Upgrade OCP cluster

Upgrading OpenShift is the ultimate “Day 2” test for an administrator. Because OCP 4.x is Operator-managed, the upgrade is not just a software update; it is a coordinated orchestration across the entire stack—from the Operating System (RHCOS) to the Control Plane and your worker nodes.

Here are the critical “interview-ready” concepts you need to know for OCP upgrades.

1. The Upgrade Flow (The Order Matters)

When you trigger an upgrade via the Web Console or oc adm upgrade, the cluster follows a strict sequence to ensure stability:

Cluster Version Operator (CVO): First, the CVO updates itself. It is the “brain” that knows what the new version of every other operator should be.
Control Plane Operators: The operators for the API server, Controller Manager, and Scheduler are updated.
Etcd: The database is updated (usually one node at a time to maintain quorum).
Control Plane Nodes: The Machine Config Operator (MCO) drains, updates the OS (RHCOS), and reboots the control plane nodes one by one.
Worker Nodes: Finally, the MCO begins rolling updates through your worker node pools.

2. Update Channels

You must choose a “channel” that dictates how fast you receive updates:

Stable: Validated updates that have been out for a while.
Fast: Updates that are technically ready but might still be gaining “field experience.”
Candidate: Early access for testing.
EUS (Extended Update Support): Specific even-numbered versions (e.g., 4.14, 4.16, 4.18) that allow you to skip a minor version during upgrades (e.g., 4.14 → 4.16) to reduce the number of reboots.

3. The “Canary” Strategy (Custom MCPs)

In a large production cluster, you don’t want all 100 worker nodes to start rebooting at once.

MachineConfigPool (MCP) Pausing: You can “pause” a pool of nodes. This allows the Control Plane to upgrade, but keeps the Workers on the old version until you are ready.
Canary Testing: You can create a small “canary” MCP with only 2–3 nodes. Unpause this pool first, verify your apps work on the new version, and then unpause the rest of the cluster.

4. Critical Troubleshooting Questions

An interviewer will likely give you these scenarios:

“The upgrade is stuck at 57%.” What do you do?
- Check ClusterOperators: Run oc get co. Look for any operator where AVAILABLE=False or PROGRESSING=True.
- Check Node Status: Run oc get nodes. If a node is SchedulingDisabled, the MCO might be struggling to drain a pod (e.g., a pod without a PDB or a local volume).
“Can you roll back an OpenShift upgrade?”
- NO. This is a trick question. OpenShift does not support rollbacks. Because the etcd database schema changes during upgrades, you can only “roll forward” by fixing the issue or, in a total disaster, by restoring the cluster from an etcd backup taken before the upgrade.

5. Best Practices for Admins

Check the Update Graph: Always use the Red Hat OpenShift Update Graph tool to ensure there is a supported path between your current version and your target.
Review Alerts: Clear all critical alerts before starting. If the cluster isn’t healthy before the upgrade, it definitely won’t be healthy after.
Pod Disruption Budgets (PDB): Ensure developers have set up PDBs so the upgrade doesn’t accidentally take down all replicas of a critical service at once.

The Canary Update strategy allows you to test an OpenShift upgrade on a small subset of nodes before rolling it out to the entire cluster. This is the gold standard for high-availability environments.

Here is the exact administrative workflow and commands you would use.

Step 1: Create a “Canary” MachineConfigPool (MCP)

First, you need a pool that targets only the nodes you want to test.

Label your canary nodes:Bashoc label node <node-name> node-role.kubernetes.io/worker-canary=""
Create the MCP:Save this as canary-mcp.yaml and run oc create -f canary-mcp.yaml.YAMLapiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfigPool metadata: name: worker-canary spec: machineConfigSelector: matchExpressions: - {key: machineconfiguration.openshift.io/role, operator: In, values: [worker, worker-canary]} nodeSelector: matchLabels: node-role.kubernetes.io/worker-canary: ""

Step 2: Pause the Remaining Worker Pools

Before triggering the cluster upgrade, you must “pause” the main worker pool. This tells the Machine Config Operator (MCO): “Update the Control Plane, but do NOT touch these worker nodes yet.”

Bash

			
# Pause the standard worker pool
oc patch mcp/worker --type='merge' -p '{"spec":{"paused":true}}'

Step 3: Trigger the Upgrade

Now, start the cluster upgrade as usual (via Console or CLI).

Bash

oc adm upgrade --to=4.16.x

What happens now?

The Control Plane upgrades and reboots.
The Worker-Canary pool (which is NOT paused) updates and reboots.
The Worker pool (which IS paused) stays on the old version.

Step 4: Verify and Complete the Rollout

Once the Canary nodes are successfully updated and your applications are verified, you can roll out the update to the rest of the cluster by unpausing the main pool.

Check status:Bashoc get mcp You should see worker-canary is UPDATED, but worker shows UPDATED=False.
Unpause the main pool:Bashoc patch mcp/worker --type='merge' -p '{"spec":{"paused":false}}' The MCO will now begin the rolling update of the remaining worker nodes.

Critical Interview Warning: The “Pause” Alert

If an interviewer asks: “Is it safe to leave an MCP paused indefinitely?”

Answer: No. Starting in OCP 4.11+, a critical alert will fire if a pool is paused for more than 1 hour during an update.
Reason: Pausing an MCP prevents Certificate Rotation. If you leave it paused too long (usually >24 hours during an upgrade cycle), the nodes’ Kubelet certificates may expire, and the nodes will go NotReady, potentially breaking the cluster.

In OpenShift, Operators are the software managers that keep your cluster healthy. When an operator fails, it shows up as Degraded. As an admin, your job is to find the “who, why, and how” of the failure.

Here is the professional troubleshooting sequence for an OCP Operator failure.

1. Identify the Failing Operator

The first step is always to find which operator is complaining.

Bash

			
# Get the status of all cluster operators
oc get clusteroperators (or 'oc get co')

What to look for: Look for DEGRADED=True or AVAILABLE=False. Common ones that fail are authentication, console, image-registry, and machine-config.

2. The Investigation Sequence

Once you identify the degraded operator (e.g., authentication), follow this 4-step drill:

A. Describe the ClusterOperator

This gives you the “high-level” reason for the failure (often a specific error message from the operator itself).

Bash

oc describe clusteroperator authentication

B. Check the Operator’s Namespace

Every operator has its own namespace (usually starting with openshift-).

Bash

			
# Find the namespace and pods
oc get pods -A | grep authentication

C. Inspect the Pod Logs

The operator is just a pod. If it’s failing, it will tell you why in its logs.

Bash

oc logs -n openshift-authentication-operator deployment/authentication-operator

D. Check Events

Sometimes the problem isn’t the code, but the infrastructure (e.g., “Failed to pull image” or “Insufficient CPU”).

Bash

oc get events -n openshift-authentication-operator --sort-by='.lastTimestamp'

3. Common “Admin-Level” Failure Scenarios

In an interview, you can shine by mentioning these specific, real-world failures:

Failing Operator	Typical Reason	The Fix
Machine-Config	Node can’t drain because of a Pod Disruption Budget (PDB).	Manually move the pod or adjust the PDB temporarily.
Authentication	Etcd is slow or the internal OAuth secret is out of sync.	Check `etcd` health; sometimes deleting the operator pod to force a restart helps.
Image-Registry	The backend storage (S3, Azure Blob, NFS) is full or disconnected.	Check the `configs.imageregistry.operator.openshift.io` resource and storage backend.
Ingress	Port 80/443 is blocked on the LoadBalancer or the Router deployment is scaling.	Check the IngressController custom resource and cloud provider LB status.

4. The “Nuclear” Option: Must-Gather

If the API is behaving so poorly that you can’t even run these commands, or if you need to open a Red Hat Support ticket, use Must-Gather.

Bash

oc adm must-gather

Must-Gather is an admin’s best friend. It creates a local directory with every log, secret (redacted), and config file from the cluster. You can then use grep or ag locally to find the needle in the haystack.

5. Node-Level Debugging (When the API is down)

If the operator is failing because the node itself is unresponsive, you must go under the hood:

Bash

			
# Access the node via a debug pod (preferred)
oc debug node/<node-name>
# Once inside the debug pod, switch to host binaries
chroot /host
# Check the container runtime (CRI-O)
crictl ps
crictl logs <container_id>

		

Infra Cloud Solutions

Upgrade OCP cluster

1. The Upgrade Flow (The Order Matters)

2. Update Channels

3. The “Canary” Strategy (Custom MCPs)

4. Critical Troubleshooting Questions

5. Best Practices for Admins

Step 1: Create a “Canary” MachineConfigPool (MCP)

Step 2: Pause the Remaining Worker Pools

Step 3: Trigger the Upgrade

Step 4: Verify and Complete the Rollout

Critical Interview Warning: The “Pause” Alert

1. Identify the Failing Operator

2. The Investigation Sequence

A. Describe the ClusterOperator

B. Check the Operator’s Namespace

C. Inspect the Pod Logs

D. Check Events

3. Common “Admin-Level” Failure Scenarios

4. The “Nuclear” Option: Must-Gather

5. Node-Level Debugging (When the API is down)

Leave a comment Cancel reply

1. The Upgrade Flow (The Order Matters)

2. Update Channels

3. The “Canary” Strategy (Custom MCPs)

4. Critical Troubleshooting Questions

5. Best Practices for Admins

Step 1: Create a “Canary” MachineConfigPool (MCP)

Step 2: Pause the Remaining Worker Pools

Step 3: Trigger the Upgrade

Step 4: Verify and Complete the Rollout

Critical Interview Warning: The “Pause” Alert

1. Identify the Failing Operator

2. The Investigation Sequence

A. Describe the ClusterOperator

B. Check the Operator’s Namespace

C. Inspect the Pod Logs

D. Check Events

3. Common “Admin-Level” Failure Scenarios

4. The “Nuclear” Option: Must-Gather

5. Node-Level Debugging (When the API is down)

Share this:

Related

Leave a comment Cancel reply