Upgrading OpenShift is the ultimate “Day 2” test for an administrator. Because OCP 4.x is Operator-managed, the upgrade is not just a software update; it is a coordinated orchestration across the entire stack—from the Operating System (RHCOS) to the Control Plane and your worker nodes.
Here are the critical “interview-ready” concepts you need to know for OCP upgrades.
1. The Upgrade Flow (The Order Matters)
When you trigger an upgrade via the Web Console or oc adm upgrade, the cluster follows a strict sequence to ensure stability:
- Cluster Version Operator (CVO): First, the CVO updates itself. It is the “brain” that knows what the new version of every other operator should be.
- Control Plane Operators: The operators for the API server, Controller Manager, and Scheduler are updated.
- Etcd: The database is updated (usually one node at a time to maintain quorum).
- Control Plane Nodes: The Machine Config Operator (MCO) drains, updates the OS (RHCOS), and reboots the control plane nodes one by one.
- Worker Nodes: Finally, the MCO begins rolling updates through your worker node pools.
2. Update Channels
You must choose a “channel” that dictates how fast you receive updates:
- Stable: Validated updates that have been out for a while.
- Fast: Updates that are technically ready but might still be gaining “field experience.”
- Candidate: Early access for testing.
- EUS (Extended Update Support): Specific even-numbered versions (e.g., 4.14, 4.16, 4.18) that allow you to skip a minor version during upgrades (e.g., 4.14 → 4.16) to reduce the number of reboots.
3. The “Canary” Strategy (Custom MCPs)
In a large production cluster, you don’t want all 100 worker nodes to start rebooting at once.
- MachineConfigPool (MCP) Pausing: You can “pause” a pool of nodes. This allows the Control Plane to upgrade, but keeps the Workers on the old version until you are ready.
- Canary Testing: You can create a small “canary” MCP with only 2–3 nodes. Unpause this pool first, verify your apps work on the new version, and then unpause the rest of the cluster.
4. Critical Troubleshooting Questions
An interviewer will likely give you these scenarios:
- “The upgrade is stuck at 57%.” What do you do?
- Check ClusterOperators: Run
oc get co. Look for any operator whereAVAILABLE=FalseorPROGRESSING=True. - Check Node Status: Run
oc get nodes. If a node isSchedulingDisabled, the MCO might be struggling to drain a pod (e.g., a pod without a PDB or a local volume).
- Check ClusterOperators: Run
- “Can you roll back an OpenShift upgrade?”
- NO. This is a trick question. OpenShift does not support rollbacks. Because the etcd database schema changes during upgrades, you can only “roll forward” by fixing the issue or, in a total disaster, by restoring the cluster from an etcd backup taken before the upgrade.
5. Best Practices for Admins
- Check the Update Graph: Always use the Red Hat OpenShift Update Graph tool to ensure there is a supported path between your current version and your target.
- Review Alerts: Clear all critical alerts before starting. If the cluster isn’t healthy before the upgrade, it definitely won’t be healthy after.
- Pod Disruption Budgets (PDB): Ensure developers have set up PDBs so the upgrade doesn’t accidentally take down all replicas of a critical service at once.
The Canary Update strategy allows you to test an OpenShift upgrade on a small subset of nodes before rolling it out to the entire cluster. This is the gold standard for high-availability environments.
Here is the exact administrative workflow and commands you would use.
Step 1: Create a “Canary” MachineConfigPool (MCP)
First, you need a pool that targets only the nodes you want to test.
- Label your canary nodes:Bash
oc label node <node-name> node-role.kubernetes.io/worker-canary="" - Create the MCP:Save this as
canary-mcp.yamland runoc create -f canary-mcp.yaml.YAMLapiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfigPool metadata: name: worker-canary spec: machineConfigSelector: matchExpressions: - {key: machineconfiguration.openshift.io/role, operator: In, values: [worker, worker-canary]} nodeSelector: matchLabels: node-role.kubernetes.io/worker-canary: ""
Step 2: Pause the Remaining Worker Pools
Before triggering the cluster upgrade, you must “pause” the main worker pool. This tells the Machine Config Operator (MCO): “Update the Control Plane, but do NOT touch these worker nodes yet.”
Bash
# Pause the standard worker pooloc patch mcp/worker --type='merge' -p '{"spec":{"paused":true}}'
Step 3: Trigger the Upgrade
Now, start the cluster upgrade as usual (via Console or CLI).
Bash
oc adm upgrade --to=4.16.x
What happens now?
- The Control Plane upgrades and reboots.
- The Worker-Canary pool (which is NOT paused) updates and reboots.
- The Worker pool (which IS paused) stays on the old version.
Step 4: Verify and Complete the Rollout
Once the Canary nodes are successfully updated and your applications are verified, you can roll out the update to the rest of the cluster by unpausing the main pool.
- Check status:Bash
oc get mcpYou should seeworker-canaryis UPDATED, butworkershows UPDATED=False. - Unpause the main pool:Bash
oc patch mcp/worker --type='merge' -p '{"spec":{"paused":false}}'The MCO will now begin the rolling update of the remaining worker nodes.
Critical Interview Warning: The “Pause” Alert
If an interviewer asks: “Is it safe to leave an MCP paused indefinitely?”
- Answer: No. Starting in OCP 4.11+, a critical alert will fire if a pool is paused for more than 1 hour during an update.
- Reason: Pausing an MCP prevents Certificate Rotation. If you leave it paused too long (usually >24 hours during an upgrade cycle), the nodes’ Kubelet certificates may expire, and the nodes will go
NotReady, potentially breaking the cluster.
In OpenShift, Operators are the software managers that keep your cluster healthy. When an operator fails, it shows up as Degraded. As an admin, your job is to find the “who, why, and how” of the failure.
Here is the professional troubleshooting sequence for an OCP Operator failure.
1. Identify the Failing Operator
The first step is always to find which operator is complaining.
Bash
# Get the status of all cluster operatorsoc get clusteroperators (or 'oc get co')
What to look for: Look for DEGRADED=True or AVAILABLE=False. Common ones that fail are authentication, console, image-registry, and machine-config.
2. The Investigation Sequence
Once you identify the degraded operator (e.g., authentication), follow this 4-step drill:
A. Describe the ClusterOperator
This gives you the “high-level” reason for the failure (often a specific error message from the operator itself).
Bash
oc describe clusteroperator authentication
B. Check the Operator’s Namespace
Every operator has its own namespace (usually starting with openshift-).
Bash
# Find the namespace and podsoc get pods -A | grep authentication
C. Inspect the Pod Logs
The operator is just a pod. If it’s failing, it will tell you why in its logs.
Bash
oc logs -n openshift-authentication-operator deployment/authentication-operator
D. Check Events
Sometimes the problem isn’t the code, but the infrastructure (e.g., “Failed to pull image” or “Insufficient CPU”).
Bash
oc get events -n openshift-authentication-operator --sort-by='.lastTimestamp'
3. Common “Admin-Level” Failure Scenarios
In an interview, you can shine by mentioning these specific, real-world failures:
| Failing Operator | Typical Reason | The Fix |
| Machine-Config | Node can’t drain because of a Pod Disruption Budget (PDB). | Manually move the pod or adjust the PDB temporarily. |
| Authentication | Etcd is slow or the internal OAuth secret is out of sync. | Check etcd health; sometimes deleting the operator pod to force a restart helps. |
| Image-Registry | The backend storage (S3, Azure Blob, NFS) is full or disconnected. | Check the configs.imageregistry.operator.openshift.io resource and storage backend. |
| Ingress | Port 80/443 is blocked on the LoadBalancer or the Router deployment is scaling. | Check the IngressController custom resource and cloud provider LB status. |
4. The “Nuclear” Option: Must-Gather
If the API is behaving so poorly that you can’t even run these commands, or if you need to open a Red Hat Support ticket, use Must-Gather.
Bash
oc adm must-gather
Must-Gather is an admin’s best friend. It creates a local directory with every log, secret (redacted), and config file from the cluster. You can then use grep or ag locally to find the needle in the haystack.
5. Node-Level Debugging (When the API is down)
If the operator is failing because the node itself is unresponsive, you must go under the hood:
Bash
# Access the node via a debug pod (preferred)oc debug node/<node-name># Once inside the debug pod, switch to host binarieschroot /host# Check the container runtime (CRI-O)crictl pscrictl logs <container_id>