OCP vs EKS vs AKS — clear, real-world comparison

OCP vs EKS vs AKS — clear, real-world comparison

Image

You’re comparing three ways to run Kubernetes:

  • Red Hat OpenShift (OCP) → full enterprise platform
  • Amazon EKS → AWS-managed Kubernetes
  • Azure Kubernetes Service (AKS) → Azure-managed Kubernetes

One-line mental model

  • OCP = Kubernetes + platform + opinionated tooling
  • EKS / AKS = Kubernetes as a service

⚙️ Core architecture difference

OpenShift (OCP)

  • You manage:
    • cluster (unless using ROSA/ARO)
    • nodes
  • Comes with:
    • registry
    • CI/CD
    • security policies
    • operators
  • Runs:
    • on-prem, cloud, hybrid

EKS (AWS)

  • AWS manages:
    • control plane ✅
  • You manage:
    • worker nodes (or use Fargate)
  • Uses AWS ecosystem:
    • IAM
    • ALB / NLB
    • VPC networking

AKS (Azure)

  • Azure manages:
    • control plane ✅
  • You manage:
    • node pools
  • Uses Azure ecosystem:
    • Azure AD
    • Load Balancer
    • VNets

Security model

FeatureOCPEKSAKS
Default security🔒 Very strictModerateModerate
Pod restrictionsSCC (strong)PSP/OPA (optional)Azure policies
IdentityRBAC + OAuthIAM rolesAzure AD

OCP is the most locked-down by default.


Networking & exposure

FeatureOCPEKSAKS
External accessRoutesIngress + ALBIngress + Azure LB
CNIOVN-KubernetesAWS VPC CNIAzure CNI / Kubenet

OCP’s Routes = simpler developer experience
EKS/AKS = more cloud-native integrations


CI/CD & Developer Experience

FeatureOCPEKSAKS
Built-in CI/CD✅ Yes (BuildConfig, pipelines)❌ No❌ No
Container registry✅ Built-in❌ (ECR external)❌ (ACR external)
Developer UI✅ StrongMinimalMinimal

OCP is a developer platform, not just infra.


Operations & automation

FeatureOCPEKSAKS
OperatorsCore conceptOptionalOptional
Cluster upgradesOperator-drivenAWS-managedAzure-managed
Add-onsBuilt-inAWS add-onsAzure add-ons

Cost model (important)

  • OCP
    • license + infra cost
  • EKS
    • control plane fee + AWS resources
  • AKS
    • control plane often free + Azure resources

OCP is usually the most expensive.


Where each shines

Use OpenShift when:

  • enterprise / regulated environments
  • on-prem or hybrid cloud
  • need built-in CI/CD + security
  • platform engineering teams

Use EKS when:

  • you’re deep in AWS ecosystem
  • want flexibility + AWS integrations
  • prefer DIY platform setup

Use AKS when:

  • you’re in Azure ecosystem
  • want simplest managed Kubernetes
  • using Azure AD, DevOps, etc.

Real-world differences that matter

1. Developer experience

  • OCP → “push code → app runs”
  • EKS/AKS → you wire everything yourself

2. Security defaults

  • OCP → restrictive (safe by default)
  • EKS/AKS → flexible (you configure security)

3. Lock-in

  • OCP → Red Hat ecosystem
  • EKS → AWS lock-in
  • AKS → Azure lock-in

Interview-ready answer

“OpenShift is a full Kubernetes platform with built-in CI/CD, registry, and strong security, while EKS and AKS are managed Kubernetes services where the cloud provider manages the control plane. OCP is more opinionated and enterprise-focused, whereas EKS and AKS provide more flexibility but require assembling additional components.”


OpenShift (OCP) – Ingress

In OpenShift (OCP), Ingress is the mechanism that allows external traffic (HTTP/HTTPS) to reach services inside your cluster. While Kubernetes has a standard “Ingress” resource, OpenShift has historically used its own evolved version called Routes.

As of 2026, the landscape has expanded to include the Gateway API, which is the modern successor to both.


1. The Three Ways to Expose Apps

FeatureRoute (Native OCP)Ingress (K8s Standard)Gateway API (The Future)
SimplicityHigh (Very easy to use)MediumMedium/High
FlexibilityGoodLimited (Needs annotations)Extreme (Fine-grained control)
StandardRed Hat ProprietaryKubernetes LegacyKubernetes Modern (GA 2026)
Best ForStandard OCP appsCross-platform migrationComplex routing/Canary/Blue-Green

2. How the “Router” Works

The Ingress Controller in OCP is an Operator-managed deployment of HAProxy (by default).

  • It sits at the edge of the cluster.
  • It watches for new Routes or Ingresses.
  • It automatically updates its configuration and starts proxying traffic to the correct Pods.

3. Key Concepts for Admins

Ingress Controller Sharding

In large clusters, a single router can become a bottleneck. You can “shard” your ingress traffic by creating multiple Ingress Controllers.

  • Example: Create one router for *.public.example.com and a separate, isolated router for *.internal.example.com.
  • Benefit: Performance isolation and security (e.g., PCI-compliant traffic on specific nodes).

TLS Termination Patterns

Routes support four types of security:

  1. Edge: SSL is decrypted at the Router. Traffic to the Pod is plain HTTP. (Most common).
  2. Passthrough: SSL is sent directly to the Pod. The Router doesn’t see the data.
  3. Re-encryption: SSL is decrypted at the Router, inspected, then re-encrypted before being sent to the Pod.
  4. None: Simple plain HTTP.

4. Interview “Pro” Tips

  • The “503 Service Unavailable” Error: If a developer sees this on their Route, it almost always means the Readiness Probe for the Pod is failing. The Router won’t send traffic to a Pod that isn’t “Ready.”
  • Host vs Path Routing: OCP Routes excel at host-based routing (app-a.com vs app-b.com). If you need complex path-based routing (e.g., app.com/v1/api going to one service and /v2/api to another), the Gateway API is now the recommended tool over standard Routes.
  • Wildcard DNS: OCP creates a default wildcard (e.g., *.apps.cluster.example.com). Every time you create a Route without a host, OCP generates one for you using this pattern.

5. Troubleshooting Command Cheat Sheet

# Check the status of the Ingress Operator
oc get ingresscontroller -n openshift-ingress-operator
# See the HAProxy pods actually doing the work
oc get pods -n openshift-ingress
# Check if a Route is "Admitted" (Successfully configured)
oc get route <route-name> -o yaml | grep -A 5 status
# Look at router logs to see traffic errors
oc logs -n openshift-ingress deployment/router-default
In a high-scale enterprise environment, you often want to isolate traffic. For example, your **Internal HR App**
shouldn't share the same entry point (the router) as your **Public Marketing Site**. This is called **Ingress Sharding**.
In OpenShift, we achieve this by creating a second **Ingress Controller** and using **Namespace Selectors**.
---
### 1. The Strategy: Router Sharding
By default, OpenShift has a `default` Ingress Controller that handles everything. To split traffic, we will:
1. Create a new Ingress Controller named `sharded-router`.
2. Tell it to only watch namespaces with a specific label (e.g., `type: public`).
3. Label our target namespace.
---
### 2. Implementation Steps
#### **Step A: Label the Namespace**
First, identify which projects should use the new, isolated router.
```bash
oc label namespace my-public-app type=public
```
#### **Step B: Create the New Ingress Controller**
This YAML creates a new set of HAProxy pods that only serve traffic for namespaces with the `type=public` label.
```yaml
apiVersion: operator.openshift.io/v1
kind: IngressController
metadata:
name: sharded-public
namespace: openshift-ingress-operator
spec:
domain: public.apps.mycluster.com # A dedicated subdomain
endpointPublishingStrategy:
type: LoadBalancerService # Creates a new Cloud Load Balancer
namespaceSelector:
matchLabels:
type: public # The magic filter
nodePlacement: # Optional: Run on specific "DMZ" nodes
nodeSelector:
matchLabels:
node-role.kubernetes.io/infra: ""
```
---
### 3. Verification
Once applied, OpenShift will spin up new pods in the `openshift-ingress` namespace.
* **Check Pods:** `oc get pods -n openshift-ingress` (You'll see `router-sharded-public-...`)
* **Check Load Balancer:** `oc get svc -n openshift-ingress` (You'll see a new service with a unique External IP).
Now, any Route created in the `my-public-app` namespace will be picked up by the **new** router and ignored by the **default** one.
---
### 💡 Interview Questions on Sharding
* **Q: Why would you shard an Ingress Controller?**
* **A:** For **Security** (isolating internal vs. external traffic), **Performance** (preventing a noisy neighbor from hogging the CPU/bandwidth of the router), and **Compliance** (ensuring certain data only flows through nodes that meet specific regulatory standards).
* **Q: How does the Route know which router to use?**
* **A:** The Route doesn't "choose." The **Ingress Controllers** choose the Routes based on the `namespaceSelector` or `routeSelector` defined in their configuration.
* **Q: Can a single Route be served by two different Routers?**
* **A:** Yes, if both Ingress Controllers have selectors that match that Route's namespace or labels. This is sometimes used during migrations.
---

OCP architecture

OpenShift (OCP) Architecture — Clear, Practical Breakdown

Red Hat OpenShift (OCP) is a Kubernetes-based platform with extra layers for:

  • security
  • developer workflows
  • enterprise operations

Think of it as:

Kubernetes + opinionated enterprise tooling + automation


High-Level Architecture

At the highest level, OpenShift has 3 main layers:

1. Control Plane (Master Nodes)

Manages the cluster

2. Worker Nodes

Run your applications

3. Infrastructure Layer

Networking, storage, registry, ingress


1. Control Plane (Master Nodes)

Core brain of the cluster:

Key components:

  • kube-apiserver
    • entry point for all API calls
  • etcd
    • stores cluster state
  • kube-scheduler
    • assigns pods to nodes
  • kube-controller-manager
    • maintains desired state

OpenShift-specific additions:

  • OpenShift API Server
    • adds OCP-specific APIs (routes, builds, etc.)
  • Controller Manager (OpenShift)
    • handles builds, deployments, image streams

2. Worker Nodes

Where workloads run.

Components:

  • kubelet
    • manages pods on node
  • Container runtime
    • usually CRI-O (default in OpenShift)
  • Pods
    • your apps + sidecars

3. Networking Layer

Key pieces:

  • Cluster Network
    • pod-to-pod communication
  • Service Network
    • stable virtual IPs
  • Ingress / Routes (OpenShift-specific)

OpenShift uses Routes instead of standard Ingress:

  • external traffic → router → service → pod

OpenShift Router (Ingress Controller)

  • based on HAProxy
  • handles:
    • TLS termination
    • load balancing
    • external exposure

4. Image & Build System (OCP unique)

This is where OpenShift stands out.

Image Registry

  • internal container registry

Image Streams

  • track image versions
  • trigger deployments automatically

BuildConfig

  • builds images from:
    • Git
    • Dockerfile
    • Source-to-Image (S2I)

5. Security Layer (very important)

OpenShift is stricter than Kubernetes.

Features:

  • Security Context Constraints (SCC)
    • control what pods can do
    • similar to Pod Security Policies
  • No root containers by default
  • SELinux enforced
  • integrated RBAC

6. Operators (Automation Engine)

OpenShift heavily uses Operators.

  • manage apps like:
    • databases
    • monitoring
    • logging

Built-in operators:

  • cluster version operator
  • ingress operator
  • etc.

7. Observability & Logging

Built-in:

  • Prometheus (monitoring)
  • Grafana (dashboards)
  • EFK / Loki stack (logging)

Full Flow Example

Deploying an app:

  1. Push code to Git
  2. BuildConfig builds image
  3. Image stored in registry
  4. Deployment created
  5. Pod runs on worker node
  6. Service exposes pod internally
  7. Route exposes app externally

OpenShift vs Kubernetes (quick view)

FeatureKubernetesOpenShift
IngressIngress resourceRoutes
Securityflexiblestrict by default
Buildsexternal toolsbuilt-in
Registryoptionalbuilt-in
UIoptionalstrong web console

Simple mental model

  • Kubernetes = engine
  • OpenShift = full platform

Interview-ready summary

“OpenShift architecture is built on Kubernetes with control plane and worker nodes, but adds enterprise features like integrated registry, build pipelines, enhanced security via SCC, and a routing layer for external traffic. It also uses operators extensively to automate cluster management.”


OCP troubleshooting

In an interview, the ability to walk through a logical “drilling down” process is more important than knowing the exact answer immediately. Here is a classic scenario for an OpenShift Admin role.


The Scenario: “The Disappearing Images”

The Symptom: You are paged because developers cannot push or pull images to the internal OpenShift registry. You run oc get co and see that the image-registry operator is Degraded.

Your Task: Walk me through how you find the root cause and fix it.


Your Mock Troubleshooting Response

1. The High-Level Check

“First, I’ll check the high-level error message provided by the ClusterOperator resource. This usually gives a hint if it’s a configuration issue or a backend failure.”

Bash

oc describe clusteroperator image-registry

Interview Result: The message says: “Progressing: Unable to apply resources: storage backend not configured” or “Degraded: error creating registry pod: persistentvolumeclaim “image-registry-storage” not found.”

2. Investigate the Operator Configuration

“Since the error mentions storage, I need to look at the Image Registry’s custom configuration to see where it’s trying to store data.”

Bash

oc get configs.imageregistry.operator.openshift.io cluster -o yaml

What you are looking for: Check the spec.storage section. Is it set to pvc, s3, azure, or emptyDir?

3. Deep Dive into the Namespace

“I’ll jump into the openshift-image-registry namespace to check the health of the actual registry pods and the status of the PVC.”

Bash

oc get pods,pvc -n openshift-image-registry

Case A (PVC is Pending): “If the PVC is Pending, I’ll run oc describe pvc <pvc-name>. Usually, this reveals that the requested StorageClass doesn’t exist or there is no capacity left in the storage provider.”

Case B (Pod is CrashLoopBackOff): “If the pod is crashing, I’ll check the logs: oc logs <pod_name>. Often, this is a permission issue where the registry container can’t write to the mounted volume due to UID mismatches.”

4. The Fix

“Depending on the find, I would:”

  • If storage was missing: Update the configs.imageregistry to point to a valid StorageClass.
  • If it’s a bare-metal install: Patch the registry to use emptyDir (for non-prod) or configure a manual PV.
  • If it’s Cloud (AWS/Azure): Check if the Operator has the right IAM permissions to create the S3 bucket or Blob storage.

Bonus “Pro” Answer: The Authentication Operator

If you want to impress the interviewer, mention the Authentication Operator and Certificates.

The Scenario: Authentication is degraded because of expired certificates.

The Pro Tip: “I would check the v4-0-config-system-router-certs secret in the openshift-authentication namespace. If the Ingress wildcard cert was manually replaced but the Auth operator wasn’t updated, it will go Degraded because it can no longer validate the OAuth callback URL. I’d fix this by ensuring the router-ca is correctly synced.”

Interviewer Follow-up:

“What if you fix the storage, but the Operator is still showing Degraded after 10 minutes?”

Your Answer: “Sometimes the Operator’s ‘Sync’ loop gets stuck. I would try a graceful restart of the operator pod itself by running oc delete pod -l name=cluster-image-registry-operator -n openshift-image-registry-operator. Since it’s a deployment, a new pod will spin up, re-scan the environment, and should clear the Degraded status if the underlying issue is resolved.”

A “Pending” pod is one of the most common issues you’ll face. In an interview, the key is to show you understand that Pending = A Scheduling Problem, whereas CrashLoopBackOff = An Application Problem.

Here is how to handle this scenario like a seasoned admin.


1. The Core Diagnostic: oc describe

The first thing you must say is: “I check the Events section.” The scheduler is very vocal about why it can’t place a pod.

Bash

oc describe pod <pod-name>

Look at the very bottom under “Events”. You will usually see a FailedScheduling warning with a specific reason.


2. Common Reasons (The “Big Four”)

A. Insufficient Resources (CPU/Memory)

  • The Message: 0/6 nodes are available: 3 Insufficient cpu, 3 Insufficient memory.
  • The Reality: Kubernetes schedules based on Requests, not actual usage. Even if a node looks idle, if other pods have “reserved” that space via high requests, the scheduler won’t touch it.
  • The Fix: Scale up the cluster (Autoscaler), add nodes, or ask the developer to lower their resources.requests.

B. Mismatched NodeSelectors / Affinity

  • The Message: 0/6 nodes are available: 6 node(s) didn't match node selector.
  • The Reality: The pod is looking for a label like disktype=ssd, but no nodes have that label.
  • The Fix: Label the nodes or fix the typo in the Deployment YAML.

C. Taints and Tolerations

  • The Message: 0/6 nodes are available: 6 node(s) had taints that the pod didn't tolerate.
  • The Reality: You might have “Infra” nodes or “GPU” nodes that are tainted to keep regular apps off them. If the pod doesn’t have a matching “Toleration,” it’s banned from those nodes.
  • The Fix: Add the correct tolerations to the pod spec.

D. Unbound PersistentVolumeClaims (PVC)

  • The Message: pod has unbound immediate PersistentVolumeClaims.
  • The Reality: The pod is waiting for a disk. Maybe the StorageClass is wrong, or the disk is in US-East-1a while the nodes are in US-East-1b.
  • The Fix: Check the PVC status with oc get pvc.

3. Advanced Troubleshooting: “Resource Quotas”

If oc describe doesn’t show a scheduling error, check the Namespace Quota.

Bash

oc get quota

The Scenario: If a project has a limit of 10 CPUs and existing pods are already using 9.5, a new pod requesting 1 CPU will stay Pending because it would violate the project’s “budget,” even if the physical nodes have plenty of room.


4. Summary for the Interviewer

“To summarize, if I see a Pending pod, I follow this hierarchy:”

  1. Check Events: Use oc describe to see the scheduler’s ‘FailedScheduling’ message.
  2. Check Resources: Compare pod requests against node allocatable capacity.
  3. Check Constraints: Verify nodeSelectors, Taints, and Affinity rules.
  4. Check Storage: Ensure the PVC is bound and in the correct zone.
  5. Check Quotas: Ensure the namespace hasn’t hit its hard limit.

In an OpenShift (OCP) admin interview, “Networking” is the area where theory meets reality. By 2026, the focus has shifted entirely to OVN-Kubernetes (the default network provider) and complex traffic patterns like Egress Control.

Here are the most common networking scenarios and questions you’ll encounter.


1. OVN-Kubernetes: The Modern Standard

OpenShift transitioned from the legacy “OpenShift SDN” to OVN-Kubernetes. Interviewers will expect you to know why.

  • Question: Why did OpenShift move to OVN-Kubernetes?
    • Answer: OVN-K is built on Open vSwitch (OVS) and provides better scalability for large clusters, native support for IPv6, and advanced features like Egress IPs and IPsec encryption for pod-to-pod traffic.
  • Troubleshooting Tip: If networking feels “sluggish,” check the OVN Northbound and Southbound databases. These are the “brain” of the network. If they get out of sync, pods might have IPs but can’t talk to each other.
    • Command: oc get pods -n openshift-ovn-kubernetes (Check for failing ovnkube-node or ovnkube-control-plane pods).

2. Egress Traffic: “How do we leave the cluster?”

In enterprise environments, security teams often demand that traffic leaving the cluster has a predictable, static IP for firewall whitelisting.

  • Question: How do you give a specific Project a dedicated external IP?
    • Answer: By using an Egress IP. You assign an IP to a Namespace, and any traffic leaving that namespace to the outside world will appear to come from that specific IP, rather than the node’s IP.
  • The “Egress Firewall” (EgressNetworkPolicy):
    • This is used to prevent pods from reaching specific external destinations (e.g., “Allow pods to talk to the corporate DB, but block all other internet access”).
    • Limit: You can only have one EgressNetworkPolicy per project.

3. Service vs. Route vs. Ingress

This is a classic “bread and butter” question.

  • The Problem: A developer says their application is unreachable from the internet.
  • The Admin Drill:
    1. Check the Route: Does it exist? Is it “Admitted” by the Ingress Controller? (oc get route)
    2. Check the Service: Does the Route point to a valid Service? Does that Service have Endpoints? (oc get endpoints)
    3. Check the Pod: Are the pods running? Are they passing their Readiness Probes? If a probe fails, the endpoint is removed, and the Route will return a 503 Service Unavailable.

4. Common Failure: MTU Mismatches

If you can ping a service but large data transfers (like file uploads) hang or fail, it is almost always an MTU (Maximum Transmission Unit) mismatch.

  • Scenario: You are running OCP on a platform (like Azure or a specific VPC) that uses encapsulation (VXLAN/GENEVE).
  • The Fix: The cluster network MTU must be smaller than the physical network MTU to account for the “header overhead.” If the physical network is 1500, your OVN-K network should usually be 1400.

5. Network Observability (The 2026 Edge)

In 2026, admins don’t just guess; they use the Network Observability Operator.

  • Question: How do you find out which pod is hogging all the bandwidth?
    • Answer: I use the Network Observability Operator (based on Loki). It provides a flow-collector that visualizes traffic in the OCP Console. I can see a “Top Talkers” graph to identify which pod or namespace is causing network congestion.

The “Pro” Interview Summary

If you want to sound like an expert, use these keywords:

  • East-West Traffic: Communication between pods (secured by NetworkPolicies).
  • North-South Traffic: Communication into or out of the cluster (managed by Routes/EgressIP).
  • Hairpinning: When a pod tries to reach itself via the external Route (can cause loops if not configured correctly).

In an OpenShift (OCP) interview, storage is a “Day 2” topic. By 2026, the discussion has moved from simply “how to attach a disk” to software-defined storage and data resilience.

Administrators are expected to understand the abstraction layers between the physical disk and the application.


1. The Core Abstraction (PV, PVC, and StorageClass)

Interviewers will start with the basics to ensure you know the “Kubernetes way” of handling state.

  • StorageClass (SC): The “template” for storage. It defines the provider (AWS EBS, VMware vSphere, Azure Disk) and parameters like reclaimPolicy (Delete vs. Retain).
  • PersistentVolumeClaim (PVC): The developer’s request. “I need 10GB of RWO storage.”
  • PersistentVolume (PV): The actual slice of storage that gets bound to the PVC.

2. OpenShift Data Foundation (ODF)

This is the “Enterprise” way to do storage in OCP. It is based on Ceph and Rook.

  • Question: Why use ODF instead of just direct cloud-native CSI drivers?
    • Answer: ODF provides a unified layer. It gives you Block (RWO), File (RWX), and Object (S3) storage regardless of where the cluster is running. It also enables advanced features like data replication, snapshots, and disaster recovery (DR) across clusters.
  • Key Component (NooBaa): Mention “Multicloud Object Gateway” (NooBaa). It allows you to store data across different cloud providers (e.g., AWS S3 and Azure Blob) while presenting a single S3 endpoint to the app.

3. Access Modes: RWO vs. RWX

This is a frequent “trap” question in interviews.

  • ReadWriteOnce (RWO): Can be mounted by a single node. Best for databases (PostgreSQL, MongoDB).
  • ReadWriteMany (RWX): Can be mounted by many nodes simultaneously. Essential for shared file systems or web servers serving the same static content.
    • Note: Cloud block storage (EBS/Azure Disk) is almost always RWO. To get RWX, you usually need ODF (CephFS) or a managed service like AWS EFS.

4. Critical Admin Tasks & Commands

An interviewer might ask: “A developer says their database is out of space. Walk me through the fix.”

  1. Check Capability: oc get sc <storage-class-name> -o yaml. Look for allowVolumeExpansion: true.
  2. The Fix: Edit the PVC directly: oc edit pvc <pvc-name>.
  3. The Result: If the CSI driver supports it, the PV will expand automatically, and the file system inside the pod will grow without a restart (usually).

5. Advanced: LVM Storage vs. Local Storage Operator

For bare metal or Single Node OpenShift (SNO):

  • LVM Storage Operator (LVMS): The modern (2025/2026) choice. It takes local disks and turns them into a Volume Group, allowing dynamic provisioning of small chunks of local storage.
  • Local Storage Operator (LSO): The “old” way. It binds a whole raw disk to a single PV. It’s less flexible than LVMS because it lacks dynamic resizing.

6. Storage Troubleshooting Checklist

  • PVC stuck in “Pending”:
    • Check oc describe pvc.
    • Cause: No PV available that matches the request, or the StorageClass doesn’t support “Wait For First Consumer” (scheduling issues).
  • Volume stuck in “Terminating”:
    • Cause: A pod is still using the volume. You must find the pod (oc get pods -A | grep <pvc-name>) and delete it before the storage can be released.
  • Multi-Zone Issues:
    • Cause: In AWS/Azure, a volume created in Zone A cannot be mounted by a node in Zone B. This is why “topology-aware” scheduling is critical.

Understanding OCP Backup: Two Essential Layers

Here’s a comprehensive breakdown of OCP backup — covering the two distinct layers you need to protect.


The two backup layers in OCP

OCP backup is not a single thing — you need two separate strategies working together:

LayerWhat it protectsTool
Control plane (etcd)Cluster state — all Kubernetes/OCP objects, CRDs, configs, RBACcluster-backup.sh / EtcdBackup CR
Application dataNamespaces, workloads, PVs/PVCs, imagesOADP (OpenShift API for Data Protection)

Use etcd backups with automated snapshots to protect and recover the cluster itself. Use OADP to protect and recover your applications and their data on top of a healthy cluster. — they are complementary, not interchangeable. OADP will not successfully backup and restore operators or etcd.


Layer 1 — etcd backup (control plane)

etcd is the key-value store for OpenShift Container Platform, which persists the state of all resource objects. An etcd backup plays a crucial role in disaster recovery.

What the backup produces

Running cluster-backup.sh on a control plane node generates two files:

  • snapshot_<timestamp>.db — the etcd snapshot (all cluster state)
  • static_kuberesources_<timestamp>.tar.gz — static pod manifests + encryption keys (if etcd encryption is enabled)

How to take a manual backup

# SSH into any control plane node
ssh core@master-0.example.com
# Run the built-in backup script
sudo /usr/local/bin/cluster-backup.sh /home/core/backup
# Copy the backup off-cluster immediately
scp core@master-0:/home/core/backup/* /safe/offsite/location/

Automated scheduled backup (OCP 4.14+)

You can create a CRD to define the schedule and retention type of automated backups:

# 1. Create a PVC for backup storage
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: etcd-backup-pvc
namespace: openshift-etcd
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 200Gi
---
# 2. Schedule recurring backups
apiVersion: config.openshift.io/v1alpha1
kind: Backup
metadata:
name: etcd-recurring-backup
spec:
etcd:
schedule: "20 4 * * *" # Daily at 04:20 UTC
timeZone: "UTC"
pvcName: etcd-backup-pvc
retentionPolicy:
retentionType: RetentionNumber
retentionNumber:
maxNumberOfBackups: 15

Key rules for etcd backups

Do not take an etcd backup before the first certificate rotation completes, which occurs 24 hours after installation, otherwise the backup will contain expired certificates. It is also recommended to take etcd backups during non-peak usage hours, as it is a blocking action.

  • Backups only need to be taken from one master — there is no need to run on every master. Store backups in either an offsite location or somewhere off the server.
  • Be sure to take an etcd backup after you upgrade your cluster. When you restore your cluster, you must use an etcd backup that was taken from the same z-stream release — for example, an OCP 4.14.2 cluster must use a backup taken from 4.14.2.

Restore procedure (high level)

# On the designated recovery control plane node:
sudo -E /usr/local/bin/cluster-restore.sh /home/core/backup
# After restore completes, force etcd redeployment:
oc edit etcd cluster
# Add under spec:
# unsupportedConfigOverrides:
# forceRedeploymentReason: recovery-2025-04-17
# Monitor etcd pods coming back up
oc get pods -n openshift-etcd | grep -v quorum

Layer 2 — OADP (application backup)

OADP uses Velero to perform both backup and restore tasks for either resources and/or internal images, while also being capable of working with persistent volumes via Restic or with snapshots.

Install OADP via OperatorHub

Operators → OperatorHub → search "OADP" → Install

Configure a backup location (S3 example)

apiVersion: oadp.openshift.io/v1alpha1
kind: DataProtectionApplication
metadata:
name: dpa-cluster
namespace: openshift-adp
spec:
configuration:
velero:
defaultPlugins:
- openshift # Required for OCP-specific resources
- aws
nodeAgent:
enable: true
uploaderType: kopia # Preferred over restic in OADP 1.3+
backupLocations:
- name: default
velero:
provider: aws
default: true
objectStorage:
bucket: my-ocp-backups
prefix: cluster-1
credential:
name: cloud-credentials
key: cloud
snapshotLocations:
- name: default
velero:
provider: aws
config:
region: ca-central-1

Taking an application backup

# Backup a specific namespace
apiVersion: velero.io/v1
kind: Backup
metadata:
name: my-app-backup
namespace: openshift-adp
spec:
includedNamespaces:
- my-app
- my-app-db
defaultVolumesToFsBackup: true # Use kopia/restic for PVs
storageLocation: default
ttl: 720h0m0s # 30-day retention
# Scheduled backup (daily at 2am)
apiVersion: velero.io/v1
kind: Schedule
metadata:
name: daily-app-backup
namespace: openshift-adp
spec:
schedule: "0 2 * * *"
template:
includedNamespaces:
- "*" # All namespaces
excludedNamespaces:
- openshift-* # Exclude platform namespaces
- kube-*
defaultVolumesToFsBackup: true
storageLocation: default
ttl: 168h0m0s # 7-day retention

Restoring from OADP

apiVersion: velero.io/v1
kind: Restore
metadata:
name: my-app-restore
namespace: openshift-adp
spec:
backupName: my-app-backup
includedNamespaces:
- my-app
restorePVs: true

PV backup methods

MethodHow it worksBest for
CSI SnapshotsPoint-in-time volume snapshot via storage driverCloud PVs (AWS EBS, Azure Disk, Ceph RBD)
Kopia/Restic (fs backup)File-level copy streamed to object storageAny PV, slower but universal

Supported backup storage targets

OADP supports AWS, MS Azure, GCP, Multicloud Object Gateway, and S3-compatible object storage (MinIO, NooBaa, etc.). Snapshot backups can be performed for AWS, Azure, GCP, and CSI snapshot-enabled cloud storage such as Ceph FS and Ceph RBD.


Best practices summary

PracticeDetail
3-2-1 rule3 copies, 2 media types, 1 offsite — etcd snapshots must be stored outside the cluster
Test restoresRegularly restore to a test cluster — an untested backup is not a backup
Version locketcd restores must use a backup from the same OCP z-stream version
Frequencyetcd: at minimum daily; before every upgrade; OADP: daily or per RPO requirement
Exclude platform namespacesDon’t include openshift-* in OADP — OADP doesn’t restore operators or etcd
EncryptionEncrypt backup storage at rest; etcd snapshot includes encryption keys if etcd encryption is on
Monitor backup jobsSet up alerts on failed Schedule or EtcdBackup CRs

Upgrade OCP cluster

Upgrading OpenShift is the ultimate “Day 2” test for an administrator. Because OCP 4.x is Operator-managed, the upgrade is not just a software update; it is a coordinated orchestration across the entire stack—from the Operating System (RHCOS) to the Control Plane and your worker nodes.

Here are the critical “interview-ready” concepts you need to know for OCP upgrades.


1. The Upgrade Flow (The Order Matters)

When you trigger an upgrade via the Web Console or oc adm upgrade, the cluster follows a strict sequence to ensure stability:

  1. Cluster Version Operator (CVO): First, the CVO updates itself. It is the “brain” that knows what the new version of every other operator should be.
  2. Control Plane Operators: The operators for the API server, Controller Manager, and Scheduler are updated.
  3. Etcd: The database is updated (usually one node at a time to maintain quorum).
  4. Control Plane Nodes: The Machine Config Operator (MCO) drains, updates the OS (RHCOS), and reboots the control plane nodes one by one.
  5. Worker Nodes: Finally, the MCO begins rolling updates through your worker node pools.

2. Update Channels

You must choose a “channel” that dictates how fast you receive updates:

  • Stable: Validated updates that have been out for a while.
  • Fast: Updates that are technically ready but might still be gaining “field experience.”
  • Candidate: Early access for testing.
  • EUS (Extended Update Support): Specific even-numbered versions (e.g., 4.14, 4.16, 4.18) that allow you to skip a minor version during upgrades (e.g., 4.14 → 4.16) to reduce the number of reboots.

3. The “Canary” Strategy (Custom MCPs)

In a large production cluster, you don’t want all 100 worker nodes to start rebooting at once.

  • MachineConfigPool (MCP) Pausing: You can “pause” a pool of nodes. This allows the Control Plane to upgrade, but keeps the Workers on the old version until you are ready.
  • Canary Testing: You can create a small “canary” MCP with only 2–3 nodes. Unpause this pool first, verify your apps work on the new version, and then unpause the rest of the cluster.

4. Critical Troubleshooting Questions

An interviewer will likely give you these scenarios:

  • “The upgrade is stuck at 57%.” What do you do?
    • Check ClusterOperators: Run oc get co. Look for any operator where AVAILABLE=False or PROGRESSING=True.
    • Check Node Status: Run oc get nodes. If a node is SchedulingDisabled, the MCO might be struggling to drain a pod (e.g., a pod without a PDB or a local volume).
  • “Can you roll back an OpenShift upgrade?”
    • NO. This is a trick question. OpenShift does not support rollbacks. Because the etcd database schema changes during upgrades, you can only “roll forward” by fixing the issue or, in a total disaster, by restoring the cluster from an etcd backup taken before the upgrade.

5. Best Practices for Admins

  • Check the Update Graph: Always use the Red Hat OpenShift Update Graph tool to ensure there is a supported path between your current version and your target.
  • Review Alerts: Clear all critical alerts before starting. If the cluster isn’t healthy before the upgrade, it definitely won’t be healthy after.
  • Pod Disruption Budgets (PDB): Ensure developers have set up PDBs so the upgrade doesn’t accidentally take down all replicas of a critical service at once.

The Canary Update strategy allows you to test an OpenShift upgrade on a small subset of nodes before rolling it out to the entire cluster. This is the gold standard for high-availability environments.

Here is the exact administrative workflow and commands you would use.


Step 1: Create a “Canary” MachineConfigPool (MCP)

First, you need a pool that targets only the nodes you want to test.

  1. Label your canary nodes:>
  2. Create the MCP:Save this as canary-mcp.yaml and run oc create -f canary-mcp.yaml.YAMLapiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfigPool metadata: name: worker-canary spec: machineConfigSelector: matchExpressions: - {key: machineconfiguration.openshift.io/role, operator: In, values: [worker, worker-canary]} nodeSelector: matchLabels: node-role.kubernetes.io/worker-canary: ""

Step 2: Pause the Remaining Worker Pools

Before triggering the cluster upgrade, you must “pause” the main worker pool. This tells the Machine Config Operator (MCO): “Update the Control Plane, but do NOT touch these worker nodes yet.”

# Pause the standard worker pool
oc patch mcp/worker --type='merge' -p '{"spec":{"paused":true}}'


Step 3: Trigger the Upgrade

Now, start the cluster upgrade as usual (via Console or CLI).

oc adm upgrade --to=4.16.x

What happens now?

  • The Control Plane upgrades and reboots.
  • The Worker-Canary pool (which is NOT paused) updates and reboots.
  • The Worker pool (which IS paused) stays on the old version.

Step 4: Verify and Complete the Rollout

Once the Canary nodes are successfully updated and your applications are verified, you can roll out the update to the rest of the cluster by unpausing the main pool.

  1. Check status:Bashoc get mcp You should see worker-canary is UPDATED, but worker shows UPDATED=False.
  2. Unpause the main pool:

Critical Interview Warning: The “Pause” Alert

If an interviewer asks: “Is it safe to leave an MCP paused indefinitely?”

  • Answer: No. Starting in OCP 4.11+, a critical alert will fire if a pool is paused for more than 1 hour during an update.
  • Reason: Pausing an MCP prevents Certificate Rotation. If you leave it paused too long (usually >24 hours during an upgrade cycle), the nodes’ Kubelet certificates may expire, and the nodes will go NotReady, potentially breaking the cluster.

In OpenShift, Operators are the software managers that keep your cluster healthy. When an operator fails, it shows up as Degraded. As an admin, your job is to find the “who, why, and how” of the failure.

Here is the professional troubleshooting sequence for an OCP Operator failure.

1. Identify the Failing Operator

The first step is always to find which operator is complaining.

# Get the status of all cluster operators
oc get clusteroperators (or 'oc get co')

What to look for: Look for DEGRADED=True or AVAILABLE=False. Common ones that fail are authentication, console, image-registry, and machine-config.


2. The Investigation Sequence

Once you identify the degraded operator (e.g., authentication), follow this 4-step drill:

A. Describe the ClusterOperator

This gives you the “high-level” reason for the failure (often a specific error message from the operator itself).

oc describe clusteroperator authentication

B. Check the Operator’s Namespace

Every operator has its own namespace (usually starting with openshift-).

# Find the namespace and pods
oc get pods -A | grep authentication

C. Inspect the Pod Logs

The operator is just a pod. If it’s failing, it will tell you why in its logs.

oc logs -n openshift-authentication-operator deployment/authentication-operator

D. Check Events

Sometimes the problem isn’t the code, but the infrastructure (e.g., “Failed to pull image” or “Insufficient CPU”).

oc get events -n openshift-authentication-operator --sort-by='.lastTimestamp'


3. Common “Admin-Level” Failure Scenarios

In an interview, you can shine by mentioning these specific, real-world failures:

Failing OperatorTypical ReasonThe Fix
Machine-ConfigNode can’t drain because of a Pod Disruption Budget (PDB).Manually move the pod or adjust the PDB temporarily.
AuthenticationEtcd is slow or the internal OAuth secret is out of sync.Check etcd health; sometimes deleting the operator pod to force a restart helps.
Image-RegistryThe backend storage (S3, Azure Blob, NFS) is full or disconnected.Check the configs.imageregistry.operator.openshift.io resource and storage backend.
IngressPort 80/443 is blocked on the LoadBalancer or the Router deployment is scaling.Check the IngressController custom resource and cloud provider LB status.

4. The “Nuclear” Option: Must-Gather

If the API is behaving so poorly that you can’t even run these commands, or if you need to open a Red Hat Support ticket, use Must-Gather.

oc adm must-gather

Must-Gather is an admin’s best friend. It creates a local directory with every log, secret (redacted), and config file from the cluster. You can then use grep or ag locally to find the needle in the haystack.


5. Node-Level Debugging (When the API is down)

If the operator is failing because the node itself is unresponsive, you must go under the hood:

# Access the node via a debug pod (preferred)
oc debug node/<node-name>

# Once inside the debug pod, switch to host binaries
chroot /host

# Check the container runtime (CRI-O)
crictl ps
crictl logs <container_id>

OpenShift (OCP) interview

For an OpenShift (OCP) interview in 2026, you should expect questions that move beyond basic Kubernetes concepts and focus on enterprise operations, automation (Operators), and security.

Here is a curated list of high-value interview questions categorized by role and complexity.


1. Architectural Concepts

  • What is the role of the Cluster Version Operator (CVO)?
    • Answer: The CVO is the heart of OCP 4.x upgrades. It monitors the “desired state” of the cluster’s operators (the “payload”) and ensures the cluster is updated in a safe, coordinated manner across all components.
  • Explain the difference between an Infrastructure Node and a Worker Node.
    • Answer: Infrastructure nodes are used to host “cluster-level” services like the Router (Ingress), Monitoring (Prometheus/Grafana), and Registry. By labeling nodes as infra, companies can often save on Red Hat subscription costs, as these nodes typically don’t require the same licensing as nodes running application workloads.
  • What is the “Etcd Quorum” and why is it important in OCP?
    • Answer: OpenShift typically requires an odd number of Control Plane nodes (usually 3) to maintain a quorum in the etcd database. If you lose more than half of your masters, the cluster becomes read-only to prevent data corruption.

2. Networking & Traffic (The Gateway API Era)

  • Explain Ingress vs. Route vs. Gateway API. (See previous discussion)
    • Key Focus: Interviewers want to know if you understand that Routes are OCP-native, Ingress is K8s-standard, and Gateway API is the future standard for advanced traffic management (canary, mirroring, etc.).
  • How does “Service Serving Certificate Secrets” work in OCP?
    • Answer: OCP can automatically generate a TLS certificate for a Service. You annotate a Service with service.beta.openshift.io/serving-cert-secret-name. OCP then creates a secret containing a cert/key signed by the internal Cluster CA, allowing for easy end-to-end encryption.

3. Security (The “Hardest” Category)

  • Scenario: A developer says their pod won’t start because of a “Security Context” error. What do you check?
    • Answer: I would check the Security Context Constraints (SCC). By default, OCP runs pods with the restricted-v2 SCC, which prevents running as root. If the pod requires root or host access, I’d check if the ServiceAccount has been granted a more permissive SCC like anyuid or privileged.
  • What are NetworkPolicies vs. EgressFirewalls?
    • Answer: NetworkPolicies control traffic between pods inside the cluster (East-West). EgressFirewalls (part of OCP’s OVN-Kubernetes) control traffic leaving the cluster to external IPs or CIDR blocks (North-South).

4. Troubleshooting & Operations

  • How do you recover a cluster if the Control Plane certificates have expired?
    • Answer: This usually involves using the oc adm certificate approve command to approve pending CSRs (Certificate Signing Requests) or manually rolling back the cluster clock if it’s an emergency. OCP 4.x generally tries to auto-renew these, but clock drift can break it.
  • Describe the Source-to-Image (S2I) workflow.
    • Answer: S2I takes source code from Git, injects it into a “builder image” (like Node.js or Java), and outputs a ready-to-run container image. It simplifies the CI/CD process for developers who don’t want to write Dockerfiles.

5. Advanced / 2026 Trends

  • What is OpenShift Virtualization (KubeVirt)?
    • Answer: It allows you to run legacy Virtual Machines (VMs) as pods on OpenShift. This is critical for “modernizing” apps where one part is a container and the other is a legacy Windows or Linux VM that can’t be containerized yet.
  • How does Red Hat Advanced Cluster Management (RHACM) help in a multi-cluster setup?
    • Answer: RHACM provides a single pane of glass to manage security policies, application placement, and cluster lifecycle (creation/deletion) across multiple OCP clusters on AWS, Azure, and on-prem.

Quick Tip for the Interview

Whenever you answer, use the phrase “Operator-led design.” OpenShift 4 is built entirely on Operators. If the interviewer asks, “How do I fix the registry?” the best answer starts with, “I would check the status of the Image Registry Operator using oc get clusteroperator.” This shows you understand the fundamental architecture of the platform.

As an OpenShift Administrator, your interview will focus heavily on cluster stability, lifecycle management (upgrades), security enforcement, and the “Day 2” operations that keep an enterprise cluster running.

Here are the top admin-focused interview questions for 2026, divided by functional area.


1. Cluster Lifecycle & Maintenance

  • How does the Cluster Version Operator (CVO) manage upgrades, and what do you check if an upgrade hangs at 57%?
    • Answer: The CVO coordinates with all other cluster operators to reach a specific “desired version.” If it hangs, I check oc get clusteroperators to see which specific operator is degraded. Usually, it’s the Machine Config Operator (MCO) waiting for nodes to drain or the Authentication Operator having issues with etcd.
  • What is the “Must-Gather” tool, and when would you use it?
    • Answer: oc adm must-gather is the primary diagnostic tool. It launches a pod that collects logs, CRD states, and operating system debugging info. I use it before opening a Red Hat support ticket or when a complex issue involves multiple operators.
  • Explain how to back up and restore the etcd database.
    • Answer: I use the etcd-snapshot.sh script provided on the control plane nodes. For restoration, I must stop the static pods for the API server and etcd, then use the backup to restore the data directory. It’s critical to do this on a single control plane node first to re-establish a quorum.

2. Node & Infrastructure Management

  • What is a MachineConfigPool (MCP), and why would you pause it?
    • Answer: An MCP groups nodes (like master or worker) so the MCO can apply configurations to them. I would pause an MCP during a sensitive maintenance window or when troubleshooting a configuration change that I don’t want to roll out to all nodes at once.
  • How do you add a custom SSH key or a CronJob to the underlying RHCOS nodes?
    • Answer: You don’t log into the nodes manually. You create a MachineConfig YAML. The MCO then detects this, reboots the nodes (if necessary), and applies the change to the immutable filesystem.
  • What happens if a node enters a NotReady state?
    • Answer: First, I check node pressure (CPU/Memory/Disk). Then I check the kubelet and crio services on the node using oc debug node/<node-name>. I also check for network reachability between the node and the Control Plane.

3. Networking & Security

  • What is the benefit of OVN-Kubernetes over the legacy OpenShift SDN?
    • Answer: OVN-K is the default in 4.x. It supports modern features like IPsec encryption for pod-to-pod traffic, smarter load balancing, and Egress IPs for specific projects to exit the cluster via a fixed IP address for firewall white-listing.
  • A user is complaining they can’t reach a service in another project. What do you check?
    • Answer:
      1. NetworkPolicies: Is there a policy blocking “Cross-Namespace” traffic?
      2. Service/Endpoints: Does the Service have active Endpoints (oc get endpoints)?
      3. Namespace labels: If using a high-isolation network plugin, do the namespaces have the correct labels to “talk” to each other?
  • How do you restrict a specific group of users from creating LoadBalancer type services?
    • Answer: I would use an Admission Controller or a specialized RBAC role that removes the update/create verbs for the services/status resource, or more commonly, use a Policy Engine like Gatekeeper/OPA to deny the request.

4. Storage & Capacity Planning

  • How do you handle “Volume Expansion” if a database runs out of space?
    • Answer: If the underlying StorageClass supports allowVolumeExpansion: true, I simply edit the PersistentVolumeClaim (PVC) and increase the storage value. OpenShift and the CSI driver handle the resizing of the file system on the fly.
  • What is the difference between ReadWriteOnce (RWO) and ReadWriteMany (RWX)?
    • Answer: RWO allows only one node to mount the volume (good for databases). RWX allows multiple nodes/pods to mount it simultaneously (required for shared file storage like NFS or ODF).

5. Scenario-Based: “The Midnight Call”

  • Scenario: The Web Console is down, and oc commands are timing out. Where do you start?
    • Answer: This sounds like an API Server or etcd failure. I would:
      1. Log into a Control Plane node directly via SSH.
      2. Check the status of static pods in /etc/kubernetes/manifests.
      3. Run crictl ps to see if the kube-apiserver or etcd containers are crashing.
      4. Check the node’s disk space (etcd often fails if the disk is 100% full).

Pro-Tip for Admin Interviews:

In 2026, emphasize GitOps. Mention that you prefer managing cluster configurations (like HTPasswd providers or Quota objects) via ArgoCD rather than manual oc apply commands. This shows you are an admin who values Idempotency and Disaster Recovery.

cert-manager in Kubernetes

cert-manager in Kubernetes is an add-on that automates TLS certificate management.

It helps you:

  • request certificates
  • renew them before they expire
  • store them as Kubernetes Secrets
  • connect them to Ingresses, gateways, and apps

Common uses:

  • Let’s Encrypt certificates for public apps
  • internal PKI certificates for cluster services
  • webhook, mTLS, and service cert automation

Core pieces:

  • Issuer / ClusterIssuer: tells cert-manager where certs come from
    • Issuer = namespaced
    • ClusterIssuer = cluster-wide
  • Certificate: declares the cert you want
  • Secret: where the cert and private key get stored
  • Challenge / Order: used mainly with ACME providers like Let’s Encrypt

Typical flow:

  1. You install cert-manager.
  2. Create an Issuer or ClusterIssuer.
  3. Create a Certificate resource or annotate an Ingress.
  4. cert-manager gets the cert and saves it into a Secret.
  5. Your Ingress/controller uses that Secret for HTTPS.

Very simple example with Let’s Encrypt:

apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
name: letsencrypt-prod
spec:
acme:
email: you@example.com
server: https://acme-v02.api.letsencrypt.org/directory
privateKeySecretRef:
name: letsencrypt-prod
solvers:
- http01:
ingress:
class: nginx

Then a certificate:

apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
name: myapp-cert
namespace: default
spec:
secretName: myapp-tls
issuerRef:
name: letsencrypt-prod
kind: ClusterIssuer
dnsNames:
- example.com

Why people use it:

  • avoids manual certificate renewal
  • reduces downtime from expired certs
  • works natively with Kubernetes resources
  • supports public and private certificate authorities

Alright — let’s walk through cert-manager in Kubernetes step by step, focusing on the three things you asked for:


1. How to install cert-manager

The easiest and most reliable way is using Helm.

Add the Helm repo

helm repo add jetstack https://charts.jetstack.io
helm repo update

Install cert-manager (with CRDs)

helm install cert-manager jetstack/cert-manager \
--namespace cert-manager \
--create-namespace \
--set installCRDs=true

Verify installation

kubectl get pods -n cert-manager

You should see pods like:

  • cert-manager
  • cert-manager-webhook
  • cert-manager-cainjector

2. How cert-manager works with Ingress

This is the most common real-world setup.

Flow (important concept)

  1. You create an Ingress
  2. Add a special annotation → tells cert-manager to issue a cert
  3. cert-manager:
    • requests cert from Let’s Encrypt
    • completes challenge (HTTP-01 or DNS-01)
    • stores cert in a Secret
  4. Ingress uses that Secret → HTTPS works automatically

Example Ingress with cert-manager

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: myapp
annotations:
cert-manager.io/cluster-issuer: letsencrypt-prod
spec:
tls:
- hosts:
- example.com
secretName: myapp-tls
rules:
- host: example.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: myapp-service
port:
number: 80

Key things:

  • annotation → tells cert-manager which issuer to use
  • tls.secretName → where the cert will be stored
  • cert-manager fills that Secret automatically

3. Issuer vs ClusterIssuer vs Certificate

This is where people usually get confused — here’s the clean breakdown:


Issuer

  • Namespace-scoped
  • Only works inside one namespace
kind: Issuer
metadata:
name: my-issuer
namespace: default

Use when:

  • small apps
  • strict namespace isolation

ClusterIssuer

  • Cluster-wide
  • Can be used from any namespace
kind: ClusterIssuer
metadata:
name: letsencrypt-prod

Use when:

  • shared infrastructure
  • most production setups ✅

Certificate

  • Declares what cert you want
  • cert-manager turns this into a real cert
kind: Certificate
metadata:
name: my-cert
spec:
secretName: my-tls
dnsNames:
- example.com
issuerRef:
name: letsencrypt-prod
kind: ClusterIssuer

How they connect together

Think of it like this:

  • Issuer / ClusterIssuer → “Who gives me the cert?”
  • Certificate → “I want a cert for this domain”
  • Secret → “Here’s your actual cert + key”
  • Ingress → “Use this cert for HTTPS”

Typical production setup

Most teams use:

  • ClusterIssuer (Let’s Encrypt)
  • ✅ Ingress annotation (auto certs)
  • ❌ No manual Certificate objects (optional)

Bonus: HTTP-01 vs DNS-01

HTTP-01 (most common)

  • cert-manager creates a temporary endpoint
  • Let’s Encrypt verifies via HTTP
  • works with Ingress

DNS-01

  • adds a DNS record to prove ownership
  • needed for:
    • wildcard certs (*.example.com)
    • internal services

Great question — this is where cert-manager becomes really powerful.

At a high level:

👉 cert-manager = certificate lifecycle automation
👉 Service mesh (Istio / Linkerd) = uses certificates for mTLS between services

So cert-manager can act as the certificate authority (or CA manager) for your mesh.


Big picture: how they fit together

cert-manager → issues certificates
service mesh → uses them for mTLS
secure pod-to-pod communication

What mTLS in a service mesh actually means

In both Istio and Linkerd:

  • Every pod gets a certificate + private key
  • Pods authenticate each other using certs
  • Traffic is:
    • encrypted ✅
    • authenticated ✅
    • tamper-proof ✅

Option 1: Built-in CA (default behavior)

Istio / Linkerd by default:

  • run their own internal CA
  • automatically issue certs to pods
  • rotate certs

👉 This works out-of-the-box and is easiest.


Option 2: Using cert-manager as the CA

This is where integration happens.

Instead of mesh managing certs itself:

👉 cert-manager becomes the source of truth for certificates


Architecture with cert-manager

cert-manager
(Issuer / ClusterIssuer)
Mesh control plane (Istio / Linkerd)
Sidecars / proxies in pods

Istio + cert-manager

Default Istio:

  • uses istiod as CA

With cert-manager:

  • you replace Istio’s CA with:
    • cert-manager + external CA (Vault, Let’s Encrypt, internal PKI)

Common approach: Istio + cert-manager + external CA

cert-manager:

  • manages root/intermediate certs

Istio:

  • requests workload certs from that CA

Why do this?

  • centralized certificate management
  • enterprise PKI integration (e.g. HashiCorp Vault)
  • compliance requirements

Linkerd + cert-manager

Linkerd has cleaner native integration.

Linkerd actually recommends using cert-manager.


How it works:

  • cert-manager issues:
    • trust anchor (root cert)
    • issuer cert
  • Linkerd uses those to:
    • issue certs to proxies
    • rotate automatically

Example flow:

  1. Create a ClusterIssuer (e.g. self-signed or Vault)
  2. cert-manager generates:
    • root cert
    • intermediate cert
  3. Linkerd control plane uses them
  4. Sidecars get short-lived certs

Certificate lifecycle in mesh (with cert-manager)

  1. cert-manager creates CA certs
  2. mesh control plane uses them
  3. sidecars request short-lived certs
  4. certs rotate automatically

When to use cert-manager with a mesh

✅ Use cert-manager if:

  • you need custom CA / PKI
  • you want centralized certificate control
  • you’re integrating with:
    • Vault
    • enterprise PKI
  • compliance/security requirements

❌ Skip it if:

  • you just want simple mTLS
  • default mesh CA is enough

Important distinction

👉 cert-manager does NOT handle:

  • traffic encryption itself
  • service-to-service routing

👉 service mesh does NOT handle:

  • external certificate issuance (well)
  • complex PKI integrations (alone)

Simple mental model

  • cert-manager = certificate factory
  • Istio / Linkerd = security + traffic engine

Interview-style summary

If you need a sharp answer:

“cert-manager integrates with service meshes by acting as an external certificate authority. While Istio and Linkerd can issue certificates internally, cert-manager enables centralized PKI management, supports external CAs like Vault, and provides automated rotation, making it useful for production-grade mTLS setups.”


Here’s a real-world debugging checklist for cert-manager + service mesh / mTLS, organized in the order that usually finds the issue fastest.

1. Start with the symptom, not the YAML

First sort the failure into one of these buckets:

  • Certificate issuance problem: Secrets are missing, Certificate is not Ready, ACME challenges fail, or issuer/webhook errors appear. cert-manager’s troubleshooting flow centers on the Certificate, CertificateRequest, Order, and Challenge resources. (cert-manager)
  • Mesh identity / mTLS problem: certificates exist, but workloads still fail handshakes, sidecars can’t get identities, or mesh health checks fail. Istio and Linkerd both separate certificate management from runtime identity distribution. (Istio)

That split matters because cert-manager can be healthy while the mesh is broken, and vice versa. (cert-manager)

2. Confirm the control planes are healthy

Check the obvious first:

kubectl get pods -n cert-manager
kubectl get pods -n istio-system
kubectl get pods -n linkerd

For cert-manager, the important core components are the controller, webhook, and cainjector; webhook issues are a documented source of certificate failures. (cert-manager)

For Linkerd, run:

linkerd check

Linkerd’s official troubleshooting starts with linkerd check, and many identity and certificate problems show up there directly. (Linkerd)

For Istio, check control-plane health and then inspect config relevant to CA integration if you are using istio-csr or another external CA path. Istio’s cert-manager integration for workload certificates requires specific CA-server changes. (cert-manager)

3. Check the certificate objects before the Secrets

If cert-manager is involved, do this before anything else:

kubectl get certificate -A
kubectl describe certificate <name> -n <ns>
kubectl get certificaterequest -A
kubectl describe certificaterequest <name> -n <ns>

cert-manager’s own troubleshooting guidance points to these resources first because they expose the reason issuance or renewal failed. (cert-manager)

What you’re looking for:

  • Ready=False
  • issuer not found
  • permission denied
  • webhook validation errors
  • failed renewals
  • pending requests that never progress

If you’re using ACME, continue with:

kubectl get order,challenge -A
kubectl describe order <name> -n <ns>
kubectl describe challenge <name> -n <ns>

ACME failures are usually visible at the Order / Challenge level. (cert-manager)

4. Verify the issuer chain and secret contents

Typical failure pattern: the Secret exists, but it is the wrong Secret, wrong namespace, missing keys, or signed by the wrong CA.

Check:

kubectl get issuer,clusterissuer -A
kubectl describe issuer <name> -n <ns>
kubectl describe clusterissuer <name>
kubectl get secret <secret-name> -n <ns> -o yaml

For mesh-related certs, validate:

  • the Secret name matches what the mesh expects
  • the Secret is in the namespace the mesh component actually reads
  • the chain is correct
  • the certificate has not expired
  • the issuer/trust anchor relationship is the intended one

In Linkerd specifically, the trust anchor and issuer certificate are distinct, and Linkerd documents that workload certs rotate automatically but the control-plane issuer/trust-anchor credentials do not unless you set up rotation. (Linkerd)

5. Check expiration and rotation next

A lot of “random” mesh outages are just expired identity material.

For Linkerd, verify:

  • trust anchor validity
  • issuer certificate validity
  • whether rotation was automated or done manually

Linkerd’s docs are explicit that proxy workload certs rotate automatically, but issuer and trust anchor rotation require separate handling; expired root or issuer certs are a known failure mode. (Linkerd)

For Istio, if using a custom CA or Kubernetes CSR integration, verify the configured CA path and signing certs are still valid and match the active mesh configuration. (cert-manager)

6. If this is Istio, verify whether the mesh is using its built-in CA or an external one

This is a very common confusion point.

If you use cert-manager with Istio workloads, you are typically not just “adding cert-manager”; you are replacing or redirecting the CA flow, often through istio-csr or Kubernetes CSR integration. cert-manager’s Istio integration docs call out changes like disabling the built-in CA server and setting the CA address. (cert-manager)

So check:

  • Is istiod acting as CA, or is an external CA path configured?
  • Is caAddress pointing to the expected service?
  • If istio-csr is used, is it healthy and reachable?
  • Are workload cert requests actually reaching the intended signer?

If that split-brain exists, pods may get no certs or certs from the wrong signer. That is an inference from how Istio’s custom CA flow is wired. (cert-manager)

7. If this is Linkerd, run the identity checks early

For Linkerd, do not guess. Run:

linkerd check
linkerd check --proxy

The Linkerd troubleshooting docs center on linkerd check, and certificate / identity issues often surface there more quickly than raw Kubernetes inspection. (Linkerd)

Then look for:

  • identity component failures
  • issuer/trust-anchor mismatch
  • certificate expiration warnings
  • injected proxies missing identity

If linkerd check mentions expired identity material, go straight to issuer/trust-anchor rotation docs. (Linkerd)

8. Verify sidecar or proxy injection happened

If the pod is not meshed, mTLS debugging is a distraction.

Check:

kubectl get pod <pod> -n <ns> -o yaml

Look for the expected sidecar/proxy containers and mesh annotations. If they are absent, the issue is injection or policy, not certificate issuance. Istio and Linkerd both rely on the dataplane proxy to actually use workload identities for mTLS. (Istio)

9. Check policy mismatches after identities are confirmed

Once certificates and proxies look correct, inspect whether the traffic policy demands mTLS where the peer does not support it.

For Istio, check authentication policy objects such as PeerAuthentication and any destination-side expectations. Istio’s authentication docs cover how mTLS policy is applied. (Istio)

Classic symptom:

  • one side is strict mTLS
  • the other side is plaintext, outside mesh, or not injected

That usually produces handshake/reset errors even when cert-manager is completely fine. This is an inference from Istio’s mTLS policy model. (Istio)

10. Read the logs in this order

When the issue is still unclear, the best signal usually comes from logs in this order:

  1. cert-manager controller
  2. cert-manager webhook
  3. mesh identity/CA component (istiod, istio-csr, or Linkerd identity)
  4. the source and destination proxy containers

Use:

kubectl logs -n cert-manager deploy/cert-manager
kubectl logs -n cert-manager deploy/cert-manager-webhook
kubectl logs -n istio-system deploy/istiod
kubectl logs -n <istio-csr-namespace> deploy/istio-csr
kubectl logs -n linkerd deploy/linkerd-identity
kubectl logs <pod> -n <ns> -c <proxy-container>

cert-manager specifically documents webhook and issuance troubleshooting as core paths. Linkerd and Istio docs likewise center on their identity components for mesh cert issues. (cert-manager)

11. For ingress or gateway TLS, separate north-south from east-west

A lot of teams mix up:

  • ingress/gateway TLS
  • service-to-service mTLS

With Istio, cert-manager integration for gateways is straightforward and separate from workload identity. Istio’s docs show cert-manager managing gateway TLS credentials, while workload certificate management is handled through different CA mechanisms. (Istio)

So ask:

  • Is the failure only at ingress/gateway?
  • Or only pod-to-pod?
  • Or both?

If only ingress is broken, inspect the gateway Secret and gateway config, not mesh identity. (Istio)

12. Fast triage map

Use this shortcut:

  • Certificate not Ready → inspect CertificateRequest, Order, Challenge, issuer, webhook. (cert-manager)
  • Secret exists but mesh still fails → inspect trust chain, expiry, namespace, and mesh CA configuration. (cert-manager)
  • Linkerd only → run linkerd check, then inspect issuer/trust anchor status. (Linkerd)
  • Istio + cert-manager for workloads → verify external CA wiring, especially CA server disablement and caAddress. (cert-manager)
  • Handshake failures with healthy certs → inspect mesh policy and whether both endpoints are actually meshed. (Istio)

13. The three most common root causes

In practice, the big ones are:

  1. Expired or non-rotated issuer / trust anchor, especially in Linkerd. (Linkerd)
  2. Istio external CA miswiring, especially when using cert-manager for workloads rather than just gateway TLS. (cert-manager)
  3. Policy/injection mismatch, where strict mTLS is enabled but one side is not part of the mesh. (Istio)

14. Minimal command pack to keep handy

kubectl get certificate,certificaterequest,issuer,clusterissuer -A
kubectl describe certificate <name> -n <ns>
kubectl get order,challenge -A
kubectl logs -n cert-manager deploy/cert-manager
kubectl logs -n cert-manager deploy/cert-manager-webhook
linkerd check
linkerd check --proxy
kubectl logs -n istio-system deploy/istiod
kubectl get pods -A -o wide
kubectl get secret -A

Flux (or FluxCD)

Flux (or FluxCD) is a GitOps continuous delivery tool for Kubernetes. Here’s a concise breakdown:


What it does

Flux is an operator that runs in your Kubernetes cluster, constantly comparing the cluster’s live state to the state defined in your Git repo. If they differ, Flux automatically makes changes to the cluster to match the repo. In other words, Git is the single source of truth — you push a change to Git, Flux detects it and applies it to the cluster automatically, with no manual kubectl apply needed.


How it works — core components

Core components of FluxCD (the GitOps Toolkit) include the Source Controller, Kustomize Controller, Helm Controller, and Notification Controller. Each is a separate Kubernetes controller responsible for one concern:

  • Source Controller — watches Git repos, Helm repos, OCI registries, and S3 buckets for changes
  • Kustomize Controller — applies raw YAML and Kustomize overlays to the cluster
  • Helm Controller — manages HelmRelease objects (declarative Helm chart deployments)
  • Notification Controller — sends alerts to Slack, Teams, etc. when syncs succeed or fail

Key characteristics

  • Pull-based model: Flux enables pure pull-based GitOps application deployments — no access to clusters is needed by the source repo or by any other cluster. This is more secure than push-based pipelines where your CI system needs cluster credentials.
  • Drift detection: If your live cluster diverges from Git (e.g., due to manual edits), Flux will detect the drift and revert it, ensuring deterministic deployments.
  • Kubernetes-native: Flux v2 is built from the ground up to use Kubernetes’ API extension system. Everything is a CRD — GitRepository, Kustomization, HelmRelease, etc.
  • Security-first: Flux uses true Kubernetes RBAC via impersonation and supports multiple Git repositories. It follows a pull vs. push model, least amount of privileges, and adheres to Kubernetes security policies with tight integration with security tools.
  • Multi-cluster: Flux can use one Kubernetes cluster to manage apps in either the same or other clusters, spin up additional clusters, and manage cluster fleets.

CNCF standing & adoption

Flux is a Cloud Native Computing Foundation (CNCF) graduated project, used in production by various organisations and cloud providers. Notable users include Deutsche Telekom (managing 200+ clusters with just 10 engineers), the US Department of Defense, and Microsoft Azure (which uses Flux natively in AKS and Azure Arc).


Flux vs. Argo CD (the main alternative)

Flux CD is highly composable — use only the controllers you need. It’s preferred by teams who already think in CRDs and reconciliation loops, and is excellent for infrastructure-as-code and complex dependency handling. The main trade-off is that Flux has some drawbacks such as lack of a native UI and a steep learning curve. Argo CD is the better choice if your team wants a rich visual dashboard out of the box.


Relation to OCP

Flux is commonly used with OpenShift as the GitOps engine for managing cluster configuration and application deployments. Red Hat also ships OpenShift GitOps (based on Argo CD) as an official operator, so in OCP environments you’ll encounter both — Flux tends to be chosen by platform engineering teams who want tighter Kubernetes-native control, while OpenShift GitOps is the supported out-of-the-box option from Red Hat.

Here’s a thorough breakdown of how Flux integrates with OCP:


Installation — two options

Option 1: Flux Operator via OperatorHub (recommended)

Flux can be installed on a Red Hat OpenShift cluster directly from OperatorHub using the Flux Operator — an open-source project part of the Flux ecosystem that provides a declarative API for the lifecycle management of the Flux controllers on OpenShift.

Once installed, you declare a FluxInstance CR with cluster.type: openshift:

apiVersion: fluxcd.controlplane.io/v1
kind: FluxInstance
metadata:
name: flux
namespace: flux-system
spec:
distribution:
version: "2.x"
registry: "ghcr.io/fluxcd"
cluster:
type: openshift # ← tells Flux it's on OCP
multitenant: true
networkPolicy: true
sync:
kind: GitRepository
url: "https://my-git-server.com/my-org/my-fleet.git"
ref: "refs/heads/main"
path: "clusters/my-cluster"

Option 2: flux bootstrap CLI

The best way to install Flux on OpenShift via CLI is to use the flux bootstrap command. This command works with GitHub, GitLab, as well as generic Git providers. You require cluster-admin privileges to install Flux on OpenShift.


The OCP-specific challenge: SCCs

OCP’s default restricted-v2 SCC blocks containers from running as root — and Flux controllers, like many Kubernetes tools, need specific adjustments to run cleanly. The official integration handles this by:

  • Shipping a scc.yaml manifest that grants Flux controllers the correct non-root SCC permissions
  • Patching the Kustomization to remove the default SecComp profile and enforce the correct UID expected by Flux images, preventing OCP from altering the container user

The cluster.type: openshift flag in the FluxInstance spec automatically applies these adjustments — no manual SCC patching needed when using the Flux Operator.


What the integration looks like end-to-end

┌─────────────────────────────────────────────────────┐
│ Git Repository │
│ clusters/my-cluster/ │
│ ├── flux-system/ (Flux bootstrap manifests) │
│ ├── namespaces/ (OCP Projects) │
│ ├── rbac/ (Roles, RoleBindings, SCCs) │
│ └── apps/ (Deployments, Routes, etc.) │
└────────────────────┬────────────────────────────────┘
│ pull (every ~1 min)
┌─────────────────────────────────────────────────────┐
│ OCP Cluster (flux-system ns) │
│ source-controller → watches Git/OCI/Helm repos │
│ kustomize-controller→ applies YAML/Kustomize │
│ helm-controller → manages HelmReleases │
│ notification-ctrl → sends alerts to Slack etc. │
└─────────────────────────────────────────────────────┘

Multi-tenancy on OCP

When multitenant: true is set, Flux uses true Kubernetes RBAC via impersonation — meaning each tenant’s Kustomization runs under its own service account, scoped to its own namespace. This maps naturally to OCP Projects, where each team or app gets an isolated namespace with its own SCC and RBAC policies.

The pattern looks like this in Git:

# tenants/team-a/kustomization.yaml
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
name: team-a-apps
namespace: flux-system
spec:
serviceAccountName: team-a-reconciler # impersonates this SA
targetNamespace: team-a # deploys into this OCP Project
path: ./tenants/team-a/apps
sourceRef:
kind: GitRepository
name: fleet-repo

Each team-a-reconciler service account only has permissions within team-a‘s namespace — enforced by both RBAC and the namespace’s SCC policies.


Key considerations for OCP + Flux

TopicDetail
TestingFlux v2.3 was the first release end-to-end tested on OpenShift.
Operator lifecycleWhen a subscription is applied, OpenShift’s Operator Lifecycle Manager (OLM) automatically handles upgrading Flux.
Enterprise supportBackwards compatibility with older versions of Kubernetes and OpenShift is offered by vendors such as ControlPlane that provide enterprise support for Flux.
vs. OpenShift GitOpsRed Hat ships its own GitOps operator (based on Argo CD) as the officially supported option. Flux on OCP is community/third-party supported, preferred by teams who want a more Kubernetes-native, CLI-driven approach.
NetworkPolicySetting networkPolicy: true in the FluxInstance spec automatically creates NetworkPolicies for the flux-system namespace, restricting controller-to-controller traffic.

OCP (OpenShift Container Platform) Security Best Practices


Identity & Access Control

  • RBAC & Least Privilege: Every user, service account, and process should possess only the absolute minimum permissions needed. Isolate workloads using distinct service accounts, each bound to Roles containing relevant permissions, and avoid attaching sensitive permissions directly to user accounts.
  • Strong Authentication: Implement robust authentication mechanisms such as multi-factor authentication (MFA) or integrate with existing identity management systems to prevent unauthorized access.
  • Audit Regularly: Regularly audit Roles, ClusterRoles, RoleBindings, and SCC usage to ensure they remain aligned with the principle of least privilege and current needs.
  • Avoid kubeadmin: Don’t use the default kubeadmin superuser account in production — integrate with an enterprise identity provider instead.

Cluster & Node Hardening

  • Use RHCOS for nodes: It is best to leverage OpenShift’s relationship with cloud providers and use the most recent Red Hat Enterprise Linux CoreOS (RHCOS) for all OCP cluster nodes. RHCOS is designed to be as immutable as possible, and any changes to the node must be authorized through the Red Hat Machine Operator — no direct user access needed.
  • Control plane HA: A minimum of three control-plane nodes should be configured to allow accessibility in a node outage event.
  • Network isolation: Strict network isolation prevents unauthorized external ingress to OpenShift cluster API endpoints, nodes, or pod containers. The DNS, Ingress Controller, and API server can be set to private after installation.

Container Image Security

  • Scan images continuously: Use image scanning tools to detect vulnerabilities and malware within container images. Use trusted container images from reputable sources and regularly update them to include the latest security patches.
  • Policy enforcement: Define and enforce security policies for container images, ensuring that only images meeting specific criteria — such as being signed by trusted sources or containing no known vulnerabilities — are deployed.
  • No root containers: OpenShift has stricter security policies than vanilla Kubernetes — running a container as root is forbidden by default.

Security Context Constraints (SCCs)

OpenShift uses Security Context Constraints (SCCs) that give your cluster a strong security base. By default, OpenShift prevents cluster containers from accessing protected Linux features such as shared file systems, root access, and certain core capabilities like the KILL command. Always use the most restrictive SCC that still allows your workload to function.


Network Security

  • Zero-trust networking: Apply granular access controls between individual pods, namespaces, and services in Kubernetes clusters and external resources, including databases, internal applications, and third-party cloud APIs.
  • Use NetworkPolicies to restrict east-west traffic between namespaces and pods by default.
  • Egress control: Use Egress Gateways or policies to control outbound traffic from pods.

Compliance & Monitoring

  • Compliance Operator: The OpenShift Compliance Operator supports profiles for standards including PCI-DSS versions 3.2.1 and 4.0, enabling automated compliance scanning across the cluster.
  • Continuous monitoring: Use robust logging and monitoring solutions to gain visibility into container behavior, network flows, and resource utilization. Set up alerts for abnormalities like unusually high memory or CPU usage that could indicate compromise.
  • Track CVEs proactively: Security, bug fix, and enhancement updates for OCP are released as asynchronous errata through the Red Hat Network. Registry images should be scanned upon notification and patched if affected by new vulnerabilities.

Namespace & Project Isolation

Using projects and namespaces simplifies management and enhances security by limiting the potential impact of compromised applications, segregating resources based on application/team/environment, and ensuring users can only access the resources they are authorized to use.


Key tools to leverage: Advanced Cluster Security (ACS/StackRox), Compliance Operator, OpenShift built-in image registry with scanning, and NetworkPolicy/Calico for zero-trust networking.

SCCs (Security Context Constraints) are OpenShift’s pod-level security gate — separate from RBAC. The golden rules are: always start from restricted-v2, never modify built-in SCCs, create custom ones when needed, assign them to dedicated service accounts (not users), and never grant anyuid or privileged to app workloads.

RBAC controls what users and service accounts can do via the API. The key principle is deny-by-default — bind roles to groups rather than individuals, keep bindings namespace-scoped unless cross-namespace is genuinely needed, audit regularly with oc auth can-i and oc policy who-can, and never touch default system ClusterRoles.

Network Policies implement microsegmentation at the pod level. The pattern is always: default-deny first, then explicitly open only what’s needed — ingress from the router, traffic from the same namespace, and specific app-to-app flows. For egress, use EgressNetworkPolicy to whitelist specific CIDRs or domains and block everything else.

All three layers work together: RBAC controls the API plane, SCCs control the node plane, and NetworkPolicies control the network plane. A strong OCP security posture needs all three.