OCP troubleshooting

In an interview, the ability to walk through a logical “drilling down” process is more important than knowing the exact answer immediately. Here is a classic scenario for an OpenShift Admin role.


The Scenario: “The Disappearing Images”

The Symptom: You are paged because developers cannot push or pull images to the internal OpenShift registry. You run oc get co and see that the image-registry operator is Degraded.

Your Task: Walk me through how you find the root cause and fix it.


Your Mock Troubleshooting Response

1. The High-Level Check

“First, I’ll check the high-level error message provided by the ClusterOperator resource. This usually gives a hint if it’s a configuration issue or a backend failure.”

Bash

oc describe clusteroperator image-registry

Interview Result: The message says: “Progressing: Unable to apply resources: storage backend not configured” or “Degraded: error creating registry pod: persistentvolumeclaim “image-registry-storage” not found.”

2. Investigate the Operator Configuration

“Since the error mentions storage, I need to look at the Image Registry’s custom configuration to see where it’s trying to store data.”

Bash

oc get configs.imageregistry.operator.openshift.io cluster -o yaml

What you are looking for: Check the spec.storage section. Is it set to pvc, s3, azure, or emptyDir?

3. Deep Dive into the Namespace

“I’ll jump into the openshift-image-registry namespace to check the health of the actual registry pods and the status of the PVC.”

Bash

oc get pods,pvc -n openshift-image-registry

Case A (PVC is Pending): “If the PVC is Pending, I’ll run oc describe pvc <pvc-name>. Usually, this reveals that the requested StorageClass doesn’t exist or there is no capacity left in the storage provider.”

Case B (Pod is CrashLoopBackOff): “If the pod is crashing, I’ll check the logs: oc logs <pod_name>. Often, this is a permission issue where the registry container can’t write to the mounted volume due to UID mismatches.”

4. The Fix

“Depending on the find, I would:”

  • If storage was missing: Update the configs.imageregistry to point to a valid StorageClass.
  • If it’s a bare-metal install: Patch the registry to use emptyDir (for non-prod) or configure a manual PV.
  • If it’s Cloud (AWS/Azure): Check if the Operator has the right IAM permissions to create the S3 bucket or Blob storage.

Bonus “Pro” Answer: The Authentication Operator

If you want to impress the interviewer, mention the Authentication Operator and Certificates.

The Scenario: Authentication is degraded because of expired certificates.

The Pro Tip: “I would check the v4-0-config-system-router-certs secret in the openshift-authentication namespace. If the Ingress wildcard cert was manually replaced but the Auth operator wasn’t updated, it will go Degraded because it can no longer validate the OAuth callback URL. I’d fix this by ensuring the router-ca is correctly synced.”

Interviewer Follow-up:

“What if you fix the storage, but the Operator is still showing Degraded after 10 minutes?”

Your Answer: “Sometimes the Operator’s ‘Sync’ loop gets stuck. I would try a graceful restart of the operator pod itself by running oc delete pod -l name=cluster-image-registry-operator -n openshift-image-registry-operator. Since it’s a deployment, a new pod will spin up, re-scan the environment, and should clear the Degraded status if the underlying issue is resolved.”

A “Pending” pod is one of the most common issues you’ll face. In an interview, the key is to show you understand that Pending = A Scheduling Problem, whereas CrashLoopBackOff = An Application Problem.

Here is how to handle this scenario like a seasoned admin.


1. The Core Diagnostic: oc describe

The first thing you must say is: “I check the Events section.” The scheduler is very vocal about why it can’t place a pod.

Bash

oc describe pod <pod-name>

Look at the very bottom under “Events”. You will usually see a FailedScheduling warning with a specific reason.


2. Common Reasons (The “Big Four”)

A. Insufficient Resources (CPU/Memory)

  • The Message: 0/6 nodes are available: 3 Insufficient cpu, 3 Insufficient memory.
  • The Reality: Kubernetes schedules based on Requests, not actual usage. Even if a node looks idle, if other pods have “reserved” that space via high requests, the scheduler won’t touch it.
  • The Fix: Scale up the cluster (Autoscaler), add nodes, or ask the developer to lower their resources.requests.

B. Mismatched NodeSelectors / Affinity

  • The Message: 0/6 nodes are available: 6 node(s) didn't match node selector.
  • The Reality: The pod is looking for a label like disktype=ssd, but no nodes have that label.
  • The Fix: Label the nodes or fix the typo in the Deployment YAML.

C. Taints and Tolerations

  • The Message: 0/6 nodes are available: 6 node(s) had taints that the pod didn't tolerate.
  • The Reality: You might have “Infra” nodes or “GPU” nodes that are tainted to keep regular apps off them. If the pod doesn’t have a matching “Toleration,” it’s banned from those nodes.
  • The Fix: Add the correct tolerations to the pod spec.

D. Unbound PersistentVolumeClaims (PVC)

  • The Message: pod has unbound immediate PersistentVolumeClaims.
  • The Reality: The pod is waiting for a disk. Maybe the StorageClass is wrong, or the disk is in US-East-1a while the nodes are in US-East-1b.
  • The Fix: Check the PVC status with oc get pvc.

3. Advanced Troubleshooting: “Resource Quotas”

If oc describe doesn’t show a scheduling error, check the Namespace Quota.

Bash

oc get quota

The Scenario: If a project has a limit of 10 CPUs and existing pods are already using 9.5, a new pod requesting 1 CPU will stay Pending because it would violate the project’s “budget,” even if the physical nodes have plenty of room.


4. Summary for the Interviewer

“To summarize, if I see a Pending pod, I follow this hierarchy:”

  1. Check Events: Use oc describe to see the scheduler’s ‘FailedScheduling’ message.
  2. Check Resources: Compare pod requests against node allocatable capacity.
  3. Check Constraints: Verify nodeSelectors, Taints, and Affinity rules.
  4. Check Storage: Ensure the PVC is bound and in the correct zone.
  5. Check Quotas: Ensure the namespace hasn’t hit its hard limit.

In an OpenShift (OCP) admin interview, “Networking” is the area where theory meets reality. By 2026, the focus has shifted entirely to OVN-Kubernetes (the default network provider) and complex traffic patterns like Egress Control.

Here are the most common networking scenarios and questions you’ll encounter.


1. OVN-Kubernetes: The Modern Standard

OpenShift transitioned from the legacy “OpenShift SDN” to OVN-Kubernetes. Interviewers will expect you to know why.

  • Question: Why did OpenShift move to OVN-Kubernetes?
    • Answer: OVN-K is built on Open vSwitch (OVS) and provides better scalability for large clusters, native support for IPv6, and advanced features like Egress IPs and IPsec encryption for pod-to-pod traffic.
  • Troubleshooting Tip: If networking feels “sluggish,” check the OVN Northbound and Southbound databases. These are the “brain” of the network. If they get out of sync, pods might have IPs but can’t talk to each other.
    • Command: oc get pods -n openshift-ovn-kubernetes (Check for failing ovnkube-node or ovnkube-control-plane pods).

2. Egress Traffic: “How do we leave the cluster?”

In enterprise environments, security teams often demand that traffic leaving the cluster has a predictable, static IP for firewall whitelisting.

  • Question: How do you give a specific Project a dedicated external IP?
    • Answer: By using an Egress IP. You assign an IP to a Namespace, and any traffic leaving that namespace to the outside world will appear to come from that specific IP, rather than the node’s IP.
  • The “Egress Firewall” (EgressNetworkPolicy):
    • This is used to prevent pods from reaching specific external destinations (e.g., “Allow pods to talk to the corporate DB, but block all other internet access”).
    • Limit: You can only have one EgressNetworkPolicy per project.

3. Service vs. Route vs. Ingress

This is a classic “bread and butter” question.

  • The Problem: A developer says their application is unreachable from the internet.
  • The Admin Drill:
    1. Check the Route: Does it exist? Is it “Admitted” by the Ingress Controller? (oc get route)
    2. Check the Service: Does the Route point to a valid Service? Does that Service have Endpoints? (oc get endpoints)
    3. Check the Pod: Are the pods running? Are they passing their Readiness Probes? If a probe fails, the endpoint is removed, and the Route will return a 503 Service Unavailable.

4. Common Failure: MTU Mismatches

If you can ping a service but large data transfers (like file uploads) hang or fail, it is almost always an MTU (Maximum Transmission Unit) mismatch.

  • Scenario: You are running OCP on a platform (like Azure or a specific VPC) that uses encapsulation (VXLAN/GENEVE).
  • The Fix: The cluster network MTU must be smaller than the physical network MTU to account for the “header overhead.” If the physical network is 1500, your OVN-K network should usually be 1400.

5. Network Observability (The 2026 Edge)

In 2026, admins don’t just guess; they use the Network Observability Operator.

  • Question: How do you find out which pod is hogging all the bandwidth?
    • Answer: I use the Network Observability Operator (based on Loki). It provides a flow-collector that visualizes traffic in the OCP Console. I can see a “Top Talkers” graph to identify which pod or namespace is causing network congestion.

The “Pro” Interview Summary

If you want to sound like an expert, use these keywords:

  • East-West Traffic: Communication between pods (secured by NetworkPolicies).
  • North-South Traffic: Communication into or out of the cluster (managed by Routes/EgressIP).
  • Hairpinning: When a pod tries to reach itself via the external Route (can cause loops if not configured correctly).

In an OpenShift (OCP) interview, storage is a “Day 2” topic. By 2026, the discussion has moved from simply “how to attach a disk” to software-defined storage and data resilience.

Administrators are expected to understand the abstraction layers between the physical disk and the application.


1. The Core Abstraction (PV, PVC, and StorageClass)

Interviewers will start with the basics to ensure you know the “Kubernetes way” of handling state.

  • StorageClass (SC): The “template” for storage. It defines the provider (AWS EBS, VMware vSphere, Azure Disk) and parameters like reclaimPolicy (Delete vs. Retain).
  • PersistentVolumeClaim (PVC): The developer’s request. “I need 10GB of RWO storage.”
  • PersistentVolume (PV): The actual slice of storage that gets bound to the PVC.

2. OpenShift Data Foundation (ODF)

This is the “Enterprise” way to do storage in OCP. It is based on Ceph and Rook.

  • Question: Why use ODF instead of just direct cloud-native CSI drivers?
    • Answer: ODF provides a unified layer. It gives you Block (RWO), File (RWX), and Object (S3) storage regardless of where the cluster is running. It also enables advanced features like data replication, snapshots, and disaster recovery (DR) across clusters.
  • Key Component (NooBaa): Mention “Multicloud Object Gateway” (NooBaa). It allows you to store data across different cloud providers (e.g., AWS S3 and Azure Blob) while presenting a single S3 endpoint to the app.

3. Access Modes: RWO vs. RWX

This is a frequent “trap” question in interviews.

  • ReadWriteOnce (RWO): Can be mounted by a single node. Best for databases (PostgreSQL, MongoDB).
  • ReadWriteMany (RWX): Can be mounted by many nodes simultaneously. Essential for shared file systems or web servers serving the same static content.
    • Note: Cloud block storage (EBS/Azure Disk) is almost always RWO. To get RWX, you usually need ODF (CephFS) or a managed service like AWS EFS.

4. Critical Admin Tasks & Commands

An interviewer might ask: “A developer says their database is out of space. Walk me through the fix.”

  1. Check Capability: oc get sc <storage-class-name> -o yaml. Look for allowVolumeExpansion: true.
  2. The Fix: Edit the PVC directly: oc edit pvc <pvc-name>.
  3. The Result: If the CSI driver supports it, the PV will expand automatically, and the file system inside the pod will grow without a restart (usually).

5. Advanced: LVM Storage vs. Local Storage Operator

For bare metal or Single Node OpenShift (SNO):

  • LVM Storage Operator (LVMS): The modern (2025/2026) choice. It takes local disks and turns them into a Volume Group, allowing dynamic provisioning of small chunks of local storage.
  • Local Storage Operator (LSO): The “old” way. It binds a whole raw disk to a single PV. It’s less flexible than LVMS because it lacks dynamic resizing.

6. Storage Troubleshooting Checklist

  • PVC stuck in “Pending”:
    • Check oc describe pvc.
    • Cause: No PV available that matches the request, or the StorageClass doesn’t support “Wait For First Consumer” (scheduling issues).
  • Volume stuck in “Terminating”:
    • Cause: A pod is still using the volume. You must find the pod (oc get pods -A | grep <pvc-name>) and delete it before the storage can be released.
  • Multi-Zone Issues:
    • Cause: In AWS/Azure, a volume created in Zone A cannot be mounted by a node in Zone B. This is why “topology-aware” scheduling is critical.