OpenShift (OCP) interview

For an OpenShift (OCP) interview in 2026, you should expect questions that move beyond basic Kubernetes concepts and focus on enterprise operations, automation (Operators), and security.

Here is a curated list of high-value interview questions categorized by role and complexity.


1. Architectural Concepts

  • What is the role of the Cluster Version Operator (CVO)?
    • Answer: The CVO is the heart of OCP 4.x upgrades. It monitors the “desired state” of the cluster’s operators (the “payload”) and ensures the cluster is updated in a safe, coordinated manner across all components.
  • Explain the difference between an Infrastructure Node and a Worker Node.
    • Answer: Infrastructure nodes are used to host “cluster-level” services like the Router (Ingress), Monitoring (Prometheus/Grafana), and Registry. By labeling nodes as infra, companies can often save on Red Hat subscription costs, as these nodes typically don’t require the same licensing as nodes running application workloads.
  • What is the “Etcd Quorum” and why is it important in OCP?
    • Answer: OpenShift typically requires an odd number of Control Plane nodes (usually 3) to maintain a quorum in the etcd database. If you lose more than half of your masters, the cluster becomes read-only to prevent data corruption.

2. Networking & Traffic (The Gateway API Era)

  • Explain Ingress vs. Route vs. Gateway API. (See previous discussion)
    • Key Focus: Interviewers want to know if you understand that Routes are OCP-native, Ingress is K8s-standard, and Gateway API is the future standard for advanced traffic management (canary, mirroring, etc.).
  • How does “Service Serving Certificate Secrets” work in OCP?
    • Answer: OCP can automatically generate a TLS certificate for a Service. You annotate a Service with service.beta.openshift.io/serving-cert-secret-name. OCP then creates a secret containing a cert/key signed by the internal Cluster CA, allowing for easy end-to-end encryption.

3. Security (The “Hardest” Category)

  • Scenario: A developer says their pod won’t start because of a “Security Context” error. What do you check?
    • Answer: I would check the Security Context Constraints (SCC). By default, OCP runs pods with the restricted-v2 SCC, which prevents running as root. If the pod requires root or host access, I’d check if the ServiceAccount has been granted a more permissive SCC like anyuid or privileged.
  • What are NetworkPolicies vs. EgressFirewalls?
    • Answer: NetworkPolicies control traffic between pods inside the cluster (East-West). EgressFirewalls (part of OCP’s OVN-Kubernetes) control traffic leaving the cluster to external IPs or CIDR blocks (North-South).

4. Troubleshooting & Operations

  • How do you recover a cluster if the Control Plane certificates have expired?
    • Answer: This usually involves using the oc adm certificate approve command to approve pending CSRs (Certificate Signing Requests) or manually rolling back the cluster clock if it’s an emergency. OCP 4.x generally tries to auto-renew these, but clock drift can break it.
  • Describe the Source-to-Image (S2I) workflow.
    • Answer: S2I takes source code from Git, injects it into a “builder image” (like Node.js or Java), and outputs a ready-to-run container image. It simplifies the CI/CD process for developers who don’t want to write Dockerfiles.

5. Advanced / 2026 Trends

  • What is OpenShift Virtualization (KubeVirt)?
    • Answer: It allows you to run legacy Virtual Machines (VMs) as pods on OpenShift. This is critical for “modernizing” apps where one part is a container and the other is a legacy Windows or Linux VM that can’t be containerized yet.
  • How does Red Hat Advanced Cluster Management (RHACM) help in a multi-cluster setup?
    • Answer: RHACM provides a single pane of glass to manage security policies, application placement, and cluster lifecycle (creation/deletion) across multiple OCP clusters on AWS, Azure, and on-prem.

Quick Tip for the Interview

Whenever you answer, use the phrase “Operator-led design.” OpenShift 4 is built entirely on Operators. If the interviewer asks, “How do I fix the registry?” the best answer starts with, “I would check the status of the Image Registry Operator using oc get clusteroperator.” This shows you understand the fundamental architecture of the platform.

As an OpenShift Administrator, your interview will focus heavily on cluster stability, lifecycle management (upgrades), security enforcement, and the “Day 2” operations that keep an enterprise cluster running.

Here are the top admin-focused interview questions for 2026, divided by functional area.


1. Cluster Lifecycle & Maintenance

  • How does the Cluster Version Operator (CVO) manage upgrades, and what do you check if an upgrade hangs at 57%?
    • Answer: The CVO coordinates with all other cluster operators to reach a specific “desired version.” If it hangs, I check oc get clusteroperators to see which specific operator is degraded. Usually, it’s the Machine Config Operator (MCO) waiting for nodes to drain or the Authentication Operator having issues with etcd.
  • What is the “Must-Gather” tool, and when would you use it?
    • Answer: oc adm must-gather is the primary diagnostic tool. It launches a pod that collects logs, CRD states, and operating system debugging info. I use it before opening a Red Hat support ticket or when a complex issue involves multiple operators.
  • Explain how to back up and restore the etcd database.
    • Answer: I use the etcd-snapshot.sh script provided on the control plane nodes. For restoration, I must stop the static pods for the API server and etcd, then use the backup to restore the data directory. It’s critical to do this on a single control plane node first to re-establish a quorum.

2. Node & Infrastructure Management

  • What is a MachineConfigPool (MCP), and why would you pause it?
    • Answer: An MCP groups nodes (like master or worker) so the MCO can apply configurations to them. I would pause an MCP during a sensitive maintenance window or when troubleshooting a configuration change that I don’t want to roll out to all nodes at once.
  • How do you add a custom SSH key or a CronJob to the underlying RHCOS nodes?
    • Answer: You don’t log into the nodes manually. You create a MachineConfig YAML. The MCO then detects this, reboots the nodes (if necessary), and applies the change to the immutable filesystem.
  • What happens if a node enters a NotReady state?
    • Answer: First, I check node pressure (CPU/Memory/Disk). Then I check the kubelet and crio services on the node using oc debug node/<node-name>. I also check for network reachability between the node and the Control Plane.

3. Networking & Security

  • What is the benefit of OVN-Kubernetes over the legacy OpenShift SDN?
    • Answer: OVN-K is the default in 4.x. It supports modern features like IPsec encryption for pod-to-pod traffic, smarter load balancing, and Egress IPs for specific projects to exit the cluster via a fixed IP address for firewall white-listing.
  • A user is complaining they can’t reach a service in another project. What do you check?
    • Answer:
      1. NetworkPolicies: Is there a policy blocking “Cross-Namespace” traffic?
      2. Service/Endpoints: Does the Service have active Endpoints (oc get endpoints)?
      3. Namespace labels: If using a high-isolation network plugin, do the namespaces have the correct labels to “talk” to each other?
  • How do you restrict a specific group of users from creating LoadBalancer type services?
    • Answer: I would use an Admission Controller or a specialized RBAC role that removes the update/create verbs for the services/status resource, or more commonly, use a Policy Engine like Gatekeeper/OPA to deny the request.

4. Storage & Capacity Planning

  • How do you handle “Volume Expansion” if a database runs out of space?
    • Answer: If the underlying StorageClass supports allowVolumeExpansion: true, I simply edit the PersistentVolumeClaim (PVC) and increase the storage value. OpenShift and the CSI driver handle the resizing of the file system on the fly.
  • What is the difference between ReadWriteOnce (RWO) and ReadWriteMany (RWX)?
    • Answer: RWO allows only one node to mount the volume (good for databases). RWX allows multiple nodes/pods to mount it simultaneously (required for shared file storage like NFS or ODF).

5. Scenario-Based: “The Midnight Call”

  • Scenario: The Web Console is down, and oc commands are timing out. Where do you start?
    • Answer: This sounds like an API Server or etcd failure. I would:
      1. Log into a Control Plane node directly via SSH.
      2. Check the status of static pods in /etc/kubernetes/manifests.
      3. Run crictl ps to see if the kube-apiserver or etcd containers are crashing.
      4. Check the node’s disk space (etcd often fails if the disk is 100% full).

Pro-Tip for Admin Interviews:

In 2026, emphasize GitOps. Mention that you prefer managing cluster configurations (like HTPasswd providers or Quota objects) via ArgoCD rather than manual oc apply commands. This shows you are an admin who values Idempotency and Disaster Recovery.

cert-manager in Kubernetes

cert-manager in Kubernetes is an add-on that automates TLS certificate management.

It helps you:

  • request certificates
  • renew them before they expire
  • store them as Kubernetes Secrets
  • connect them to Ingresses, gateways, and apps

Common uses:

  • Let’s Encrypt certificates for public apps
  • internal PKI certificates for cluster services
  • webhook, mTLS, and service cert automation

Core pieces:

  • Issuer / ClusterIssuer: tells cert-manager where certs come from
    • Issuer = namespaced
    • ClusterIssuer = cluster-wide
  • Certificate: declares the cert you want
  • Secret: where the cert and private key get stored
  • Challenge / Order: used mainly with ACME providers like Let’s Encrypt

Typical flow:

  1. You install cert-manager.
  2. Create an Issuer or ClusterIssuer.
  3. Create a Certificate resource or annotate an Ingress.
  4. cert-manager gets the cert and saves it into a Secret.
  5. Your Ingress/controller uses that Secret for HTTPS.

Very simple example with Let’s Encrypt:

apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
name: letsencrypt-prod
spec:
acme:
email: you@example.com
server: https://acme-v02.api.letsencrypt.org/directory
privateKeySecretRef:
name: letsencrypt-prod
solvers:
- http01:
ingress:
class: nginx

Then a certificate:

apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
name: myapp-cert
namespace: default
spec:
secretName: myapp-tls
issuerRef:
name: letsencrypt-prod
kind: ClusterIssuer
dnsNames:
- example.com

Why people use it:

  • avoids manual certificate renewal
  • reduces downtime from expired certs
  • works natively with Kubernetes resources
  • supports public and private certificate authorities

Alright — let’s walk through cert-manager in Kubernetes step by step, focusing on the three things you asked for:


1️⃣ How to install cert-manager

The easiest and most reliable way is using Helm.

Add the Helm repo

helm repo add jetstack https://charts.jetstack.io
helm repo update

Install cert-manager (with CRDs)

helm install cert-manager jetstack/cert-manager \
--namespace cert-manager \
--create-namespace \
--set installCRDs=true

Verify installation

kubectl get pods -n cert-manager

You should see pods like:

  • cert-manager
  • cert-manager-webhook
  • cert-manager-cainjector

2️⃣ How cert-manager works with Ingress

This is the most common real-world setup.

🔁 Flow (important concept)

  1. You create an Ingress
  2. Add a special annotation → tells cert-manager to issue a cert
  3. cert-manager:
    • requests cert from Let’s Encrypt
    • completes challenge (HTTP-01 or DNS-01)
    • stores cert in a Secret
  4. Ingress uses that Secret → HTTPS works automatically

🧩 Example Ingress with cert-manager

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: myapp
annotations:
cert-manager.io/cluster-issuer: letsencrypt-prod
spec:
tls:
- hosts:
- example.com
secretName: myapp-tls
rules:
- host: example.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: myapp-service
port:
number: 80

👉 Key things:

  • annotation → tells cert-manager which issuer to use
  • tls.secretName → where the cert will be stored
  • cert-manager fills that Secret automatically

3️⃣ Issuer vs ClusterIssuer vs Certificate

This is where people usually get confused — here’s the clean breakdown:


Issuer

  • Namespace-scoped
  • Only works inside one namespace
kind: Issuer
metadata:
name: my-issuer
namespace: default

Use when:

  • small apps
  • strict namespace isolation

ClusterIssuer

  • Cluster-wide
  • Can be used from any namespace
kind: ClusterIssuer
metadata:
name: letsencrypt-prod

Use when:

  • shared infrastructure
  • most production setups ✅

🟣 Certificate

  • Declares what cert you want
  • cert-manager turns this into a real cert
kind: Certificate
metadata:
name: my-cert
spec:
secretName: my-tls
dnsNames:
- example.com
issuerRef:
name: letsencrypt-prod
kind: ClusterIssuer

How they connect together

Think of it like this:

  • Issuer / ClusterIssuer → “Who gives me the cert?”
  • Certificate → “I want a cert for this domain”
  • Secret → “Here’s your actual cert + key”
  • Ingress → “Use this cert for HTTPS”

Typical production setup

Most teams use:

  • ClusterIssuer (Let’s Encrypt)
  • ✅ Ingress annotation (auto certs)
  • ❌ No manual Certificate objects (optional)

Bonus: HTTP-01 vs DNS-01

HTTP-01 (most common)

  • cert-manager creates a temporary endpoint
  • Let’s Encrypt verifies via HTTP
  • works with Ingress

DNS-01

  • adds a DNS record to prove ownership
  • needed for:
    • wildcard certs (*.example.com)
    • internal services

Great question — this is where cert-manager becomes really powerful.

At a high level:

👉 cert-manager = certificate lifecycle automation
👉 Service mesh (Istio / Linkerd) = uses certificates for mTLS between services

So cert-manager can act as the certificate authority (or CA manager) for your mesh.


🧠 Big picture: how they fit together

cert-manager → issues certificates
service mesh → uses them for mTLS
secure pod-to-pod communication

🔐 What mTLS in a service mesh actually means

In both Istio and Linkerd:

  • Every pod gets a certificate + private key
  • Pods authenticate each other using certs
  • Traffic is:
    • encrypted ✅
    • authenticated ✅
    • tamper-proof ✅

⚙️ Option 1: Built-in CA (default behavior)

Istio / Linkerd by default:

  • run their own internal CA
  • automatically issue certs to pods
  • rotate certs

👉 This works out-of-the-box and is easiest.


🧩 Option 2: Using cert-manager as the CA

This is where integration happens.

Instead of mesh managing certs itself:

👉 cert-manager becomes the source of truth for certificates


🧱 Architecture with cert-manager

cert-manager
(Issuer / ClusterIssuer)
Mesh control plane (Istio / Linkerd)
Sidecars / proxies in pods

🔵 Istio + cert-manager

Default Istio:

  • uses istiod as CA

With cert-manager:

  • you replace Istio’s CA with:
    • cert-manager + external CA (Vault, Let’s Encrypt, internal PKI)

Common approach: Istio + cert-manager + external CA

cert-manager:

  • manages root/intermediate certs

Istio:

  • requests workload certs from that CA

Why do this?

  • centralized certificate management
  • enterprise PKI integration (e.g. HashiCorp Vault)
  • compliance requirements

Linkerd + cert-manager

Linkerd has cleaner native integration.

👉 Linkerd actually recommends using cert-manager.


How it works:

  • cert-manager issues:
    • trust anchor (root cert)
    • issuer cert
  • Linkerd uses those to:
    • issue certs to proxies
    • rotate automatically

Example flow:

  1. Create a ClusterIssuer (e.g. self-signed or Vault)
  2. cert-manager generates:
    • root cert
    • intermediate cert
  3. Linkerd control plane uses them
  4. Sidecars get short-lived certs

🔁 Certificate lifecycle in mesh (with cert-manager)

  1. cert-manager creates CA certs
  2. mesh control plane uses them
  3. sidecars request short-lived certs
  4. certs rotate automatically

When to use cert-manager with a mesh

✅ Use cert-manager if:

  • you need custom CA / PKI
  • you want centralized certificate control
  • you’re integrating with:
    • Vault
    • enterprise PKI
  • compliance/security requirements

❌ Skip it if:

  • you just want simple mTLS
  • default mesh CA is enough

Important distinction

👉 cert-manager does NOT handle:

  • traffic encryption itself
  • service-to-service routing

👉 service mesh does NOT handle:

  • external certificate issuance (well)
  • complex PKI integrations (alone)

Simple mental model

  • cert-manager = certificate factory
  • Istio / Linkerd = security + traffic engine

Interview-style summary

If you need a sharp answer:

“cert-manager integrates with service meshes by acting as an external certificate authority. While Istio and Linkerd can issue certificates internally, cert-manager enables centralized PKI management, supports external CAs like Vault, and provides automated rotation, making it useful for production-grade mTLS setups.”


Here’s a real-world debugging checklist for cert-manager + service mesh / mTLS, organized in the order that usually finds the issue fastest.

1. Start with the symptom, not the YAML

First sort the failure into one of these buckets:

  • Certificate issuance problem: Secrets are missing, Certificate is not Ready, ACME challenges fail, or issuer/webhook errors appear. cert-manager’s troubleshooting flow centers on the Certificate, CertificateRequest, Order, and Challenge resources. (cert-manager)
  • Mesh identity / mTLS problem: certificates exist, but workloads still fail handshakes, sidecars can’t get identities, or mesh health checks fail. Istio and Linkerd both separate certificate management from runtime identity distribution. (Istio)

That split matters because cert-manager can be healthy while the mesh is broken, and vice versa. (cert-manager)

2. Confirm the control planes are healthy

Check the obvious first:

kubectl get pods -n cert-manager
kubectl get pods -n istio-system
kubectl get pods -n linkerd

For cert-manager, the important core components are the controller, webhook, and cainjector; webhook issues are a documented source of certificate failures. (cert-manager)

For Linkerd, run:

linkerd check

Linkerd’s official troubleshooting starts with linkerd check, and many identity and certificate problems show up there directly. (Linkerd)

For Istio, check control-plane health and then inspect config relevant to CA integration if you are using istio-csr or another external CA path. Istio’s cert-manager integration for workload certificates requires specific CA-server changes. (cert-manager)

3. Check the certificate objects before the Secrets

If cert-manager is involved, do this before anything else:

kubectl get certificate -A
kubectl describe certificate <name> -n <ns>
kubectl get certificaterequest -A
kubectl describe certificaterequest <name> -n <ns>

cert-manager’s own troubleshooting guidance points to these resources first because they expose the reason issuance or renewal failed. (cert-manager)

What you’re looking for:

  • Ready=False
  • issuer not found
  • permission denied
  • webhook validation errors
  • failed renewals
  • pending requests that never progress

If you’re using ACME, continue with:

kubectl get order,challenge -A
kubectl describe order <name> -n <ns>
kubectl describe challenge <name> -n <ns>

ACME failures are usually visible at the Order / Challenge level. (cert-manager)

4. Verify the issuer chain and secret contents

Typical failure pattern: the Secret exists, but it is the wrong Secret, wrong namespace, missing keys, or signed by the wrong CA.

Check:

kubectl get issuer,clusterissuer -A
kubectl describe issuer <name> -n <ns>
kubectl describe clusterissuer <name>
kubectl get secret <secret-name> -n <ns> -o yaml

For mesh-related certs, validate:

  • the Secret name matches what the mesh expects
  • the Secret is in the namespace the mesh component actually reads
  • the chain is correct
  • the certificate has not expired
  • the issuer/trust anchor relationship is the intended one

In Linkerd specifically, the trust anchor and issuer certificate are distinct, and Linkerd documents that workload certs rotate automatically but the control-plane issuer/trust-anchor credentials do not unless you set up rotation. (Linkerd)

5. Check expiration and rotation next

A lot of “random” mesh outages are just expired identity material.

For Linkerd, verify:

  • trust anchor validity
  • issuer certificate validity
  • whether rotation was automated or done manually

Linkerd’s docs are explicit that proxy workload certs rotate automatically, but issuer and trust anchor rotation require separate handling; expired root or issuer certs are a known failure mode. (Linkerd)

For Istio, if using a custom CA or Kubernetes CSR integration, verify the configured CA path and signing certs are still valid and match the active mesh configuration. (cert-manager)

6. If this is Istio, verify whether the mesh is using its built-in CA or an external one

This is a very common confusion point.

If you use cert-manager with Istio workloads, you are typically not just “adding cert-manager”; you are replacing or redirecting the CA flow, often through istio-csr or Kubernetes CSR integration. cert-manager’s Istio integration docs call out changes like disabling the built-in CA server and setting the CA address. (cert-manager)

So check:

  • Is istiod acting as CA, or is an external CA path configured?
  • Is caAddress pointing to the expected service?
  • If istio-csr is used, is it healthy and reachable?
  • Are workload cert requests actually reaching the intended signer?

If that split-brain exists, pods may get no certs or certs from the wrong signer. That is an inference from how Istio’s custom CA flow is wired. (cert-manager)

7. If this is Linkerd, run the identity checks early

For Linkerd, do not guess. Run:

linkerd check
linkerd check --proxy

The Linkerd troubleshooting docs center on linkerd check, and certificate / identity issues often surface there more quickly than raw Kubernetes inspection. (Linkerd)

Then look for:

  • identity component failures
  • issuer/trust-anchor mismatch
  • certificate expiration warnings
  • injected proxies missing identity

If linkerd check mentions expired identity material, go straight to issuer/trust-anchor rotation docs. (Linkerd)

8. Verify sidecar or proxy injection happened

If the pod is not meshed, mTLS debugging is a distraction.

Check:

kubectl get pod <pod> -n <ns> -o yaml

Look for the expected sidecar/proxy containers and mesh annotations. If they are absent, the issue is injection or policy, not certificate issuance. Istio and Linkerd both rely on the dataplane proxy to actually use workload identities for mTLS. (Istio)

9. Check policy mismatches after identities are confirmed

Once certificates and proxies look correct, inspect whether the traffic policy demands mTLS where the peer does not support it.

For Istio, check authentication policy objects such as PeerAuthentication and any destination-side expectations. Istio’s authentication docs cover how mTLS policy is applied. (Istio)

Classic symptom:

  • one side is strict mTLS
  • the other side is plaintext, outside mesh, or not injected

That usually produces handshake/reset errors even when cert-manager is completely fine. This is an inference from Istio’s mTLS policy model. (Istio)

10. Read the logs in this order

When the issue is still unclear, the best signal usually comes from logs in this order:

  1. cert-manager controller
  2. cert-manager webhook
  3. mesh identity/CA component (istiod, istio-csr, or Linkerd identity)
  4. the source and destination proxy containers

Use:

kubectl logs -n cert-manager deploy/cert-manager
kubectl logs -n cert-manager deploy/cert-manager-webhook
kubectl logs -n istio-system deploy/istiod
kubectl logs -n <istio-csr-namespace> deploy/istio-csr
kubectl logs -n linkerd deploy/linkerd-identity
kubectl logs <pod> -n <ns> -c <proxy-container>

cert-manager specifically documents webhook and issuance troubleshooting as core paths. Linkerd and Istio docs likewise center on their identity components for mesh cert issues. (cert-manager)

11. For ingress or gateway TLS, separate north-south from east-west

A lot of teams mix up:

  • ingress/gateway TLS
  • service-to-service mTLS

With Istio, cert-manager integration for gateways is straightforward and separate from workload identity. Istio’s docs show cert-manager managing gateway TLS credentials, while workload certificate management is handled through different CA mechanisms. (Istio)

So ask:

  • Is the failure only at ingress/gateway?
  • Or only pod-to-pod?
  • Or both?

If only ingress is broken, inspect the gateway Secret and gateway config, not mesh identity. (Istio)

12. Fast triage map

Use this shortcut:

  • Certificate not Ready → inspect CertificateRequest, Order, Challenge, issuer, webhook. (cert-manager)
  • Secret exists but mesh still fails → inspect trust chain, expiry, namespace, and mesh CA configuration. (cert-manager)
  • Linkerd only → run linkerd check, then inspect issuer/trust anchor status. (Linkerd)
  • Istio + cert-manager for workloads → verify external CA wiring, especially CA server disablement and caAddress. (cert-manager)
  • Handshake failures with healthy certs → inspect mesh policy and whether both endpoints are actually meshed. (Istio)

13. The three most common root causes

In practice, the big ones are:

  1. Expired or non-rotated issuer / trust anchor, especially in Linkerd. (Linkerd)
  2. Istio external CA miswiring, especially when using cert-manager for workloads rather than just gateway TLS. (cert-manager)
  3. Policy/injection mismatch, where strict mTLS is enabled but one side is not part of the mesh. (Istio)

14. Minimal command pack to keep handy

kubectl get certificate,certificaterequest,issuer,clusterissuer -A
kubectl describe certificate <name> -n <ns>
kubectl get order,challenge -A
kubectl logs -n cert-manager deploy/cert-manager
kubectl logs -n cert-manager deploy/cert-manager-webhook
linkerd check
linkerd check --proxy
kubectl logs -n istio-system deploy/istiod
kubectl get pods -A -o wide
kubectl get secret -A

Flux (or FluxCD)

Flux (or FluxCD) is a GitOps continuous delivery tool for Kubernetes. Here’s a concise breakdown:


What it does

Flux is an operator that runs in your Kubernetes cluster, constantly comparing the cluster’s live state to the state defined in your Git repo. If they differ, Flux automatically makes changes to the cluster to match the repo. In other words, Git is the single source of truth — you push a change to Git, Flux detects it and applies it to the cluster automatically, with no manual kubectl apply needed.


How it works — core components

Core components of FluxCD (the GitOps Toolkit) include the Source Controller, Kustomize Controller, Helm Controller, and Notification Controller. Each is a separate Kubernetes controller responsible for one concern:

  • Source Controller — watches Git repos, Helm repos, OCI registries, and S3 buckets for changes
  • Kustomize Controller — applies raw YAML and Kustomize overlays to the cluster
  • Helm Controller — manages HelmRelease objects (declarative Helm chart deployments)
  • Notification Controller — sends alerts to Slack, Teams, etc. when syncs succeed or fail

Key characteristics

  • Pull-based model: Flux enables pure pull-based GitOps application deployments — no access to clusters is needed by the source repo or by any other cluster. This is more secure than push-based pipelines where your CI system needs cluster credentials.
  • Drift detection: If your live cluster diverges from Git (e.g., due to manual edits), Flux will detect the drift and revert it, ensuring deterministic deployments.
  • Kubernetes-native: Flux v2 is built from the ground up to use Kubernetes’ API extension system. Everything is a CRD — GitRepository, Kustomization, HelmRelease, etc.
  • Security-first: Flux uses true Kubernetes RBAC via impersonation and supports multiple Git repositories. It follows a pull vs. push model, least amount of privileges, and adheres to Kubernetes security policies with tight integration with security tools.
  • Multi-cluster: Flux can use one Kubernetes cluster to manage apps in either the same or other clusters, spin up additional clusters, and manage cluster fleets.

CNCF standing & adoption

Flux is a Cloud Native Computing Foundation (CNCF) graduated project, used in production by various organisations and cloud providers. Notable users include Deutsche Telekom (managing 200+ clusters with just 10 engineers), the US Department of Defense, and Microsoft Azure (which uses Flux natively in AKS and Azure Arc).


Flux vs. Argo CD (the main alternative)

Flux CD is highly composable — use only the controllers you need. It’s preferred by teams who already think in CRDs and reconciliation loops, and is excellent for infrastructure-as-code and complex dependency handling. The main trade-off is that Flux has some drawbacks such as lack of a native UI and a steep learning curve. Argo CD is the better choice if your team wants a rich visual dashboard out of the box.


Relation to OCP

Flux is commonly used with OpenShift as the GitOps engine for managing cluster configuration and application deployments. Red Hat also ships OpenShift GitOps (based on Argo CD) as an official operator, so in OCP environments you’ll encounter both — Flux tends to be chosen by platform engineering teams who want tighter Kubernetes-native control, while OpenShift GitOps is the supported out-of-the-box option from Red Hat.

Here’s a thorough breakdown of how Flux integrates with OCP:


Installation — two options

Option 1: Flux Operator via OperatorHub (recommended)

Flux can be installed on a Red Hat OpenShift cluster directly from OperatorHub using the Flux Operator — an open-source project part of the Flux ecosystem that provides a declarative API for the lifecycle management of the Flux controllers on OpenShift.

Once installed, you declare a FluxInstance CR with cluster.type: openshift:

apiVersion: fluxcd.controlplane.io/v1
kind: FluxInstance
metadata:
name: flux
namespace: flux-system
spec:
distribution:
version: "2.x"
registry: "ghcr.io/fluxcd"
cluster:
type: openshift # ← tells Flux it's on OCP
multitenant: true
networkPolicy: true
sync:
kind: GitRepository
url: "https://my-git-server.com/my-org/my-fleet.git"
ref: "refs/heads/main"
path: "clusters/my-cluster"

Option 2: flux bootstrap CLI

The best way to install Flux on OpenShift via CLI is to use the flux bootstrap command. This command works with GitHub, GitLab, as well as generic Git providers. You require cluster-admin privileges to install Flux on OpenShift.


The OCP-specific challenge: SCCs

OCP’s default restricted-v2 SCC blocks containers from running as root — and Flux controllers, like many Kubernetes tools, need specific adjustments to run cleanly. The official integration handles this by:

  • Shipping a scc.yaml manifest that grants Flux controllers the correct non-root SCC permissions
  • Patching the Kustomization to remove the default SecComp profile and enforce the correct UID expected by Flux images, preventing OCP from altering the container user

The cluster.type: openshift flag in the FluxInstance spec automatically applies these adjustments — no manual SCC patching needed when using the Flux Operator.


What the integration looks like end-to-end

┌─────────────────────────────────────────────────────┐
│ Git Repository │
│ clusters/my-cluster/ │
│ ├── flux-system/ (Flux bootstrap manifests) │
│ ├── namespaces/ (OCP Projects) │
│ ├── rbac/ (Roles, RoleBindings, SCCs) │
│ └── apps/ (Deployments, Routes, etc.) │
└────────────────────┬────────────────────────────────┘
│ pull (every ~1 min)
┌─────────────────────────────────────────────────────┐
│ OCP Cluster (flux-system ns) │
│ source-controller → watches Git/OCI/Helm repos │
│ kustomize-controller→ applies YAML/Kustomize │
│ helm-controller → manages HelmReleases │
│ notification-ctrl → sends alerts to Slack etc. │
└─────────────────────────────────────────────────────┘

Multi-tenancy on OCP

When multitenant: true is set, Flux uses true Kubernetes RBAC via impersonation — meaning each tenant’s Kustomization runs under its own service account, scoped to its own namespace. This maps naturally to OCP Projects, where each team or app gets an isolated namespace with its own SCC and RBAC policies.

The pattern looks like this in Git:

# tenants/team-a/kustomization.yaml
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
name: team-a-apps
namespace: flux-system
spec:
serviceAccountName: team-a-reconciler # impersonates this SA
targetNamespace: team-a # deploys into this OCP Project
path: ./tenants/team-a/apps
sourceRef:
kind: GitRepository
name: fleet-repo

Each team-a-reconciler service account only has permissions within team-a‘s namespace — enforced by both RBAC and the namespace’s SCC policies.


Key considerations for OCP + Flux

TopicDetail
TestingFlux v2.3 was the first release end-to-end tested on OpenShift.
Operator lifecycleWhen a subscription is applied, OpenShift’s Operator Lifecycle Manager (OLM) automatically handles upgrading Flux.
Enterprise supportBackwards compatibility with older versions of Kubernetes and OpenShift is offered by vendors such as ControlPlane that provide enterprise support for Flux.
vs. OpenShift GitOpsRed Hat ships its own GitOps operator (based on Argo CD) as the officially supported option. Flux on OCP is community/third-party supported, preferred by teams who want a more Kubernetes-native, CLI-driven approach.
NetworkPolicySetting networkPolicy: true in the FluxInstance spec automatically creates NetworkPolicies for the flux-system namespace, restricting controller-to-controller traffic.

OCP (OpenShift Container Platform) Security Best Practices


Identity & Access Control

  • RBAC & Least Privilege: Every user, service account, and process should possess only the absolute minimum permissions needed. Isolate workloads using distinct service accounts, each bound to Roles containing relevant permissions, and avoid attaching sensitive permissions directly to user accounts.
  • Strong Authentication: Implement robust authentication mechanisms such as multi-factor authentication (MFA) or integrate with existing identity management systems to prevent unauthorized access.
  • Audit Regularly: Regularly audit Roles, ClusterRoles, RoleBindings, and SCC usage to ensure they remain aligned with the principle of least privilege and current needs.
  • Avoid kubeadmin: Don’t use the default kubeadmin superuser account in production — integrate with an enterprise identity provider instead.

Cluster & Node Hardening

  • Use RHCOS for nodes: It is best to leverage OpenShift’s relationship with cloud providers and use the most recent Red Hat Enterprise Linux CoreOS (RHCOS) for all OCP cluster nodes. RHCOS is designed to be as immutable as possible, and any changes to the node must be authorized through the Red Hat Machine Operator — no direct user access needed.
  • Control plane HA: A minimum of three control-plane nodes should be configured to allow accessibility in a node outage event.
  • Network isolation: Strict network isolation prevents unauthorized external ingress to OpenShift cluster API endpoints, nodes, or pod containers. The DNS, Ingress Controller, and API server can be set to private after installation.

Container Image Security

  • Scan images continuously: Use image scanning tools to detect vulnerabilities and malware within container images. Use trusted container images from reputable sources and regularly update them to include the latest security patches.
  • Policy enforcement: Define and enforce security policies for container images, ensuring that only images meeting specific criteria — such as being signed by trusted sources or containing no known vulnerabilities — are deployed.
  • No root containers: OpenShift has stricter security policies than vanilla Kubernetes — running a container as root is forbidden by default.

Security Context Constraints (SCCs)

OpenShift uses Security Context Constraints (SCCs) that give your cluster a strong security base. By default, OpenShift prevents cluster containers from accessing protected Linux features such as shared file systems, root access, and certain core capabilities like the KILL command. Always use the most restrictive SCC that still allows your workload to function.


Network Security

  • Zero-trust networking: Apply granular access controls between individual pods, namespaces, and services in Kubernetes clusters and external resources, including databases, internal applications, and third-party cloud APIs.
  • Use NetworkPolicies to restrict east-west traffic between namespaces and pods by default.
  • Egress control: Use Egress Gateways or policies to control outbound traffic from pods.

Compliance & Monitoring

  • Compliance Operator: The OpenShift Compliance Operator supports profiles for standards including PCI-DSS versions 3.2.1 and 4.0, enabling automated compliance scanning across the cluster.
  • Continuous monitoring: Use robust logging and monitoring solutions to gain visibility into container behavior, network flows, and resource utilization. Set up alerts for abnormalities like unusually high memory or CPU usage that could indicate compromise.
  • Track CVEs proactively: Security, bug fix, and enhancement updates for OCP are released as asynchronous errata through the Red Hat Network. Registry images should be scanned upon notification and patched if affected by new vulnerabilities.

Namespace & Project Isolation

Using projects and namespaces simplifies management and enhances security by limiting the potential impact of compromised applications, segregating resources based on application/team/environment, and ensuring users can only access the resources they are authorized to use.


Key tools to leverage: Advanced Cluster Security (ACS/StackRox), Compliance Operator, OpenShift built-in image registry with scanning, and NetworkPolicy/Calico for zero-trust networking.

SCCs (Security Context Constraints) are OpenShift’s pod-level security gate — separate from RBAC. The golden rules are: always start from restricted-v2, never modify built-in SCCs, create custom ones when needed, assign them to dedicated service accounts (not users), and never grant anyuid or privileged to app workloads.

RBAC controls what users and service accounts can do via the API. The key principle is deny-by-default — bind roles to groups rather than individuals, keep bindings namespace-scoped unless cross-namespace is genuinely needed, audit regularly with oc auth can-i and oc policy who-can, and never touch default system ClusterRoles.

Network Policies implement microsegmentation at the pod level. The pattern is always: default-deny first, then explicitly open only what’s needed — ingress from the router, traffic from the same namespace, and specific app-to-app flows. For egress, use EgressNetworkPolicy to whitelist specific CIDRs or domains and block everything else.

All three layers work together: RBAC controls the API plane, SCCs control the node plane, and NetworkPolicies control the network plane. A strong OCP security posture needs all three.

AKS – Security Best Practice

For a brand-new microservices project in 2026, security isn’t just a “layer” you add at the end—it’s baked into the infrastructure. AKS has introduced several “secure-by-default” features that simplify this.

Here are the essential security best practices for your new setup:


1. Identity over Secrets (Zero Trust)

In 2026, storing connection strings or client secrets in Kubernetes “Secrets” is considered an anti-pattern.

  • Best Practice: Use Microsoft Entra Workload ID.
  • Why: Instead of your app having a password to access a database, your Pod is assigned a “Managed Identity.” Azure confirms the Pod’s identity via a signed token, granting it access without any static secrets that could be leaked.
  • New in 2026: Enable Conditional Access for Workload Identities to ensure a microservice can only connect to your database if it’s running inside your specific VNet.

2. Harden the Host (Azure Linux 3.0)

The operating system running your nodes is part of your attack surface.

  • Best Practice: Standardize on Azure Linux 3.0 (CBL-Mariner).
  • Why: It is a “distroless-adjacent” host OS. It contains ~500 packages compared to the thousands in Ubuntu, drastically reducing the number of vulnerabilities (CVEs) you have to patch.
  • Advanced Isolation: For sensitive services (like payment processing), enable Pod Sandboxing. This uses Kata Containers to run the service in a dedicated hardware-isolated micro-VM, preventing “container breakout” attacks where a hacker could jump from your app to the node.

3. Network “Blast Radius” Control

If one microservice is compromised, you don’t want the attacker to move laterally through your entire cluster.

  • Best Practice: Use Cilium for Network Policy.
  • Why: As of 2026, Cilium is the gold standard for AKS networking. It uses eBPF technology to filter traffic at the kernel level.
  • Strategy: Implement a Default Deny policy. By default, no service should be able to talk to any other service unless you explicitly write a rule allowing it.

4. API Server Protection

The Kubernetes API server is the “front door” to your cluster. If someone gets in here, they own everything.

  • Best Practice: Use API Server VNet Integration (Private Clusters).
  • Why: This ensures your cluster’s management endpoint is not reachable from the public internet. It exists only inside your private network.
  • Access Control: Use Microsoft Entra RBAC (Role-Based Access Control). Never use the “cluster-admin” local account. Link permissions to your team’s Entra ID groups so that when an employee leaves the company, their cluster access is revoked instantly.

5. Continuous Supply Chain Security

Security starts before the code even reaches AKS.

  • Best Practice: Enable Defender for Cloud and Binary Authorization.
  • Why: Defender for Cloud scans your images in the Azure Container Registry (ACR) for malware and secrets.
  • Enforcement: Use Azure Policy for Kubernetes to block any deployment that hasn’t been scanned or isn’t coming from your trusted registry.

Summary Security Checklist

Area2026 Standard
IdentityMicrosoft Entra Workload ID (No Secrets)
OSAzure Linux 3.0 with OS Guard
NetworkCilium with mTLS (Service Mesh)
AccessPrivate Cluster + Entra RBAC
GovernanceAzure Policy “Restricted” Baseline

Pro-Tip: Check your Secure Score in Microsoft Defender for Cloud weekly. It will give you a prioritized list of “quick fixes” for your specific AKS cluster based on real-time threats.

With a Service Mesh (specifically the Istio-based add-on for AKS), you are moving toward a “Zero Trust” network architecture. In this setup, the network is no longer trusted by default; every connection must be verified and encrypted.

Here is the 2026 security blueprint for running microservices with Istio on AKS.


1. Automated mTLS (Encryption in Transit)

By default, traffic between Kubernetes Pods is unencrypted. With Istio, you can enforce Strict Mutual TLS (mTLS) without changing a single line of application code.

  • The Best Practice: Apply a PeerAuthentication policy at the namespace level set to STRICT.
  • The Result: Any service that tries to connect via plain text will be instantly rejected by the sidecar proxy. This ensures that even if an attacker gains access to your internal network, they cannot “sniff” sensitive data (like headers or tokens) passing between services.

2. Identity-Based Authorization

IP addresses are ephemeral in Kubernetes and shouldn’t be used for security. Istio uses SPIFFE identities based on the service’s Kubernetes Service Account.

  • The Best Practice: Use AuthorizationPolicy to define “Who can talk to Whom.”
  • Example: You can create a rule that says the Email Service can only receive requests from the Orders Service, and only if the request is a POST to the /send-receipt endpoint. Everything else is blocked at the source.

3. Secure the “Front Door” (Ingress Gateway)

In 2026, the Kubernetes Gateway API has reached full GA (General Availability) for the AKS Istio add-on.

  • The Best Practice: Use the Gateway and HTTPRoute resources instead of the older Ingress objects.
  • Security Benefit: It allows for better separation of concerns. Your platform team can manage the physical load balancer (the Gateway), while your developers manage the routing rules (HTTPRoute) for their specific microservices.

4. Dapr + Istio: The “Power Couple”

Since you are building microservices, you might also use Dapr for state and messaging. In 2026, these two work together seamlessly but require one key configuration:

  • The Best Practice: If both are present, let Istio handle the mTLS and Observability, and disable mTLS in Dapr.
  • Why: Having two layers of encryption (“double wrapping” packets) adds significant latency and makes debugging network issues a nightmare.

5. Visualizing the “Blast Radius”

The biggest security risk in microservices is lateral movement.

  • The Best Practice: Use the Kiali dashboard (integrated with AKS) to view your service graph in real-time.
  • The Security Win: If you see a weird line of communication between your Public Web Frontend and your Internal Payment Database that shouldn’t exist, you’ve found a security hole or a misconfiguration before it becomes a breach.

Summary Security Checklist for Istio on AKS

Task2026 Recommended Tool
Transport SecurityPeerAuthentication (mode: STRICT)
Service PermissionsIstio AuthorizationPolicy
External TrafficKubernetes Gateway API (Managed Istio Ingress)
Egress (Outgoing)Service Entry (Block all traffic to external sites except specific approved domains)
AuditingAzure Monitor for Containers + Istio Access Logs

Warning for 2026: Ensure your worker nodes have enough “headroom.” Istio sidecars (Envoy proxies) consume roughly 0.5 to 1.0 vCPU and several hundred MBs of RAM per Pod. For a project with many small microservices, this “sidecar tax” can add up quickly.

Ingress and API Gateways

In the world of Kubernetes and OpenShift, both Ingress and API Gateways serve as the entry point for external traffic. While they overlap in functionality, they operate at different levels of the networking stack and offer different “intelligence” regarding how they handle requests.

Think of Ingress as a simple receptionist directing people to the right room, while an API Gateway is a concierge who also checks IDs, translates languages, and limits how many people enter at once.


1. What is Ingress?

Ingress is a native Kubernetes resource (Layer 7) that manages external access to services, typically HTTP and HTTPS.

  • Primary Job: Simple routing based on the URL path (e.g., /api) or the hostname (e.g., app.example.com).
  • Implementation: In OCP, this is usually handled by the OpenShift Ingress Controller (based on HAProxy) using Routes.
  • Pros: Lightweight, standard across Kubernetes, and built-in.
  • Cons: Limited “logic.” It’s hard to do complex things like rate limiting, authentication, or request transformation without custom annotations.

2. What is an API Gateway?

An API Gateway is a more sophisticated proxy that sits in front of your microservices to provide “cross-cutting concerns.”

  • Primary Job: API Management. It handles security, monitoring, and orchestration.
  • Key Features:
    • Authentication/Authorization: Validating JWT tokens or API keys before the request hits the service.
    • Rate Limiting: Ensuring one user doesn’t spam your backend.
    • Payload Transformation: Changing a XML request to JSON for a modern backend.
    • Circuit Breaking: Stopping traffic to a failing service to prevent a total system crash.
  • Examples: Kong, Tyk, Apigee, or the Red Hat 3scale API Management platform.

Key Comparison Table

FeatureIngress / RouteAPI Gateway
OSI LayerLayer 7 (HTTP/S)Layer 7 + Application Logic
Main GoalExpose services to the internetProtect and manage APIs
ComplexityLowHigh
SecurityBasic SSL/TLS terminationJWT, OAuth, mTLS, IP Whitelisting
Traffic ControlSimple Load BalancingRate Limiting, Quotas, Retries
CostUsually free (built into OCP)Often requires licensing or extra infra

When to use which?

  • Use Ingress/Routes when: You have a web application and just need to point a domain name to a service. It’s the “plumbing” of the cluster.
  • Use an API Gateway when: You are exposing APIs to third parties, need strict usage tracking (monetization), or want to centralize security logic so your developers don’t have to write auth code for every single microservice.

The “Modern” Middle Ground: Gateway API

There is a newer Kubernetes standard called the Gateway API. It is designed to replace Ingress by providing the power of an API Gateway (like header-based routing and traffic splitting) while remaining a standard part of the Kubernetes ecosystem. In OpenShift, you can enable the Gateway API through the Operator.

To help you see the evolution, here is how the “old” standard (Ingress) compares to the “new” standard (Gateway API).

1. The Traditional Ingress

Ingress is a single, “flat” resource. It’s simple but limited because the person who owns the app (the developer) and the person who owns the network (the admin) have to share the same file.

YAML

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: my-app-ingress
annotations:
nginx.ingress.kubernetes.io/rewrite-target: /
spec:
rules:
- host: app.example.com
http:
paths:
- path: /api
pathType: Prefix
backend:
service:
name: api-service
port:
number: 80
  • The Problem: If you want to do something fancy like a “Canary deployment” (sending 10% of traffic to a new version), you usually have to use messy, vendor-specific annotations.

2. The Modern Gateway API

The Gateway API breaks the configuration into pieces. This allows the Cluster Admin to define the entry point (the Gateway) and the Developer to define how their specific app is reached (the HTTPRoute).

The Admin’s Part (The Infrastructure):

YAML

apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
name: external-gateway
spec:
gatewayClassName: openshift-default
listeners:
- name: http
protocol: HTTP
port: 80

The Developer’s Part (The Logic & Traffic Splitting):

YAML

apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
name: my-app-route
spec:
parentRefs:
- name: external-gateway
hostnames:
- "app.example.com"
rules:
- matches:
- path: { type: PathPrefix, value: /api }
backendRefs:
- name: api-v1
port: 80
weight: 90 # 90% of traffic here
- name: api-v2
port: 80
weight: 10 # 10% of traffic to the new version!

Summary of Differences

FeatureIngressGateway API
StructureMonolithic (One file for everything)Role-based (Separated for Admin vs Dev)
Traffic SplittingRequires non-standard annotationsBuilt-in (Weights/Canary)
ExtensibilityLimitedHigh (Supports TCP, UDP, TLS, GRPC)
PortabilityHigh (but annotations are not)Very High (Standardized across vendors)

Why OpenShift is moving this way

OpenShift 4.12+ fully supports the Gateway API because it solves the “annotation hell” that occurred when users tried to make basic Ingress act like a full API Gateway. It gives you the power of a professional Gateway (like Kong or Istio) but stays within the native Kubernetes language.

In OpenShift 4.15 and later (reaching General Availability in 4.19), the Gateway API is managed by the Cluster Ingress Operator. Unlike standard Kubernetes where you might have to install many CRDs manually, OpenShift streamlines this by bundling the controller logic into its existing operators.

Here is the step-by-step process to enable and use it.


1. Enable the Gateway API CRDs

In newer versions of OCP, the CRDs are often present but “dormant” until a GatewayClass is created. The Ingress Operator watches for a specific controllerName to trigger the installation of the underlying proxy (which is Istio/Envoy in the Red Hat implementation).

Create the GatewayClass:

YAML

apiVersion: gateway.networking.k8s.io/v1
kind: GatewayClass
metadata:
name: openshift-default
spec:
controllerName: openshift.io/gateway-controller/v1

What happens next? The Ingress Operator will automatically detect this and start a deployment called istiod-openshift-gateway in the openshift-ingress namespace.


2. Set up a Wildcard Certificate (Required)

Unlike standard Routes, the Gateway API in OCP does not automatically generate a default certificate. You need to provide a TLS secret in the openshift-ingress namespace.

Bash

# Example: Creating a self-signed wildcard for testing
oc -n openshift-ingress create secret tls gwapi-wildcard \
--cert=wildcard.crt --key=wildcard.key

3. Deploy the Gateway

The Gateway represents the actual “entry point” or load balancer.

YAML

apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
name: my-gateway
namespace: openshift-ingress
spec:
gatewayClassName: openshift-default
listeners:
- name: https
protocol: HTTPS
port: 443
hostname: "*.apps.mycluster.com"
tls:
mode: Terminate
certificateRefs:
- name: gwapi-wildcard

4. Create an HTTPRoute (Developer Task)

Now that the “door” (Gateway) is open, a developer in a different namespace can “attach” their application to it.

YAML

apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
name: my-app-route
namespace: my-app-project
spec:
parentRefs:
- name: my-gateway
namespace: openshift-ingress
hostnames:
- "myapp.apps.mycluster.com"
rules:
- backendRefs:
- name: my-app-service
port: 8080

Summary Checklist for the Interview

If you are asked how to set this up in an interview, remember these four pillars:

  1. Operator-Led: It’s managed by the Ingress Operator; no separate “Gateway Operator” is needed for the default Red Hat implementation.
  2. Implementation: OpenShift uses Envoy (via a lightweight Istio control plane) as the engine behind the Gateway API.
  3. Namespace: The Gateway object itself almost always lives in openshift-ingress.
  4. Service Type: Creating a Gateway usually triggers the creation of a Service type: LoadBalancer automatically.

Ingress vs Service mesh

Ingress and a service mesh solve different networking problems.

Ingress
Ingress is a Kubernetes API object for managing external access into the cluster, typically HTTP/HTTPS. It routes inbound requests based on hosts and paths to backend Services. Kubernetes now says Ingress is stable but frozen, and recommends the newer Gateway API for future development. (Kubernetes)

Service mesh
A service mesh is an infrastructure layer for service-to-service communication inside and around your app, adding things like traffic policy, observability, and zero-trust security without changing app code. In Istio, this includes traffic routing, retries, timeouts, fault injection, mTLS, authentication, and authorization. (Istio)

Practical difference
Think of it like this:

  • Ingress = the front door to your cluster
  • Service mesh = the road system and security checkpoints between services inside the cluster

Use Ingress when
You need:

  • a public endpoint for your app
  • host/path routing like api.example.com or /shop
  • TLS termination for incoming web traffic

That is the classic “internet → cluster → service” problem. (Kubernetes)

Use a service mesh when
You need:

  • service-to-service observability
  • mutual TLS between workloads
  • canary / weighted routing between versions
  • retries, timeouts, circuit breaking
  • policy and identity for east-west traffic
  • control over some outbound traffic too

Istio’s docs specifically describe percentage routing, version-aware routing, external service entries, retries, timeouts, and circuit breakers. (Istio)

Do they overlap?
A little. Both can influence traffic routing, but at different scopes:

  • Ingress mainly handles north-south traffic: outside users coming in
  • Service mesh mainly handles east-west traffic: service-to-service traffic inside the platform

A mesh can also handle ingress/egress via its own gateways, but that is a broader and heavier solution than plain Kubernetes Ingress. (Kubernetes)

Which should you choose?

  • For a simple web app exposing a few services: Ingress is usually enough.
  • For microservices that need security, tracing, traffic shaping, and resilience: service mesh is worth considering.
  • Many teams use both: one for external entry, one for internal communication.

One current note: for new Kubernetes edge-routing designs, Gateway API is the direction Kubernetes recommends over Ingress. (Kubernetes)

Here’s a concrete example.

Example app

Imagine an e-commerce app running on Kubernetes:

  • web-frontend
  • product-api
  • cart-api
  • checkout-api
  • payment-service
  • user-service

Customers come from the internet. The services call each other inside the cluster.

With Ingress only

Traffic flow:

Internet → Ingress controller → Kubernetes Service → Pods

Example:

  • shop.example.com goes to web-frontend
  • shop.example.com/api/* goes to product-api

What Ingress is doing here:

  • expose the app publicly
  • terminate TLS
  • route by host/path
  • maybe do some basic load balancing

So a request might go:

  1. User opens https://shop.example.com
  2. Ingress sends / to web-frontend
  3. web-frontend calls cart-api
  4. cart-api calls user-service
  5. checkout-api calls payment-service

The key point: Ingress mostly helps with step 1, the outside-in entry point. It does not, by itself, give you rich control/security/telemetry for steps 3–5. Ingress is for external access, and the Kubernetes project notes the API is stable but frozen, with Gateway API recommended for newer traffic-management work. (Kubernetes)

With Ingress + service mesh

Now add a mesh like Istio.

Traffic flow becomes:

Internet → Ingress/Gateway → web-frontend → mesh-controlled service-to-service traffic

Now you still have an entry point, but inside the cluster the mesh handles communication between services.

What the mesh adds:

  • mTLS between services
  • retries/timeouts
  • canary routing
  • traffic splitting
  • telemetry/tracing
  • authz policies between workloads

Example:

  • checkout-api sends 95% of traffic to payment-service v1 and 5% to payment-service v2
  • calls from cart-api to user-service get a 2-second timeout and one retry
  • only checkout-api is allowed to call payment-service
  • all service-to-service traffic is encrypted with mutual TLS

Those are standard service-mesh capabilities described in Istio’s traffic-management and security docs. (Istio)

Simple diagram

Ingress only

[User on Internet]
|
v
[Ingress]
|
v

[web-frontend]

| v [product-api] -> [cart-api] -> [checkout-api] -> [payment-service]

Ingress + service mesh

[User on Internet]
|
v
[Ingress / Gateway]
|
v

[web-frontend]

| v ————————————————- |

Service Mesh inside cluster | | [product-api] <-> [cart-api] <-> [checkout-api]| | \ | / | | \ v / | | ——> [user-service] | | [payment-service] | | | | mTLS, retries, tracing, canaries, policy | ————————————————-

Real-world way teams choose

Use just Ingress when:

  • you have a small app
  • you mostly need public routing
  • internal service communication is simple
  • you do not need per-service security/policy

Add a service mesh when:

  • you have many microservices
  • debugging internal calls is hard
  • you need zero-trust service identity
  • you do canaries/traffic shaping often
  • you want consistent retries/timeouts/policies

One important 2026 note

For brand-new Kubernetes edge-routing setups, many teams are moving toward Gateway API instead of classic Ingress. Kubernetes recommends Gateway over Ingress for future-facing work, and Istio also supports Gateway API for traffic management. (Kubernetes)

Rule of thumb

  • Ingress/Gateway API: “How does traffic get into my cluster?”
  • Service mesh: “How do services inside my platform talk securely and reliably?”

Kubernetes networking

Kubernetes networking is often considered the most complex part of the system because it operates on a “flat network” model. In a traditional setup, you might worry about port conflicts or how to reach a specific VM. In Kubernetes, every Pod gets its own unique IP address, and every Pod can talk to every other Pod without NAT (Network Address Translation).

To understand how traffic flows, we break it down into four distinct “layers” of communication.


1. Pod-to-Pod Communication

Every Pod in a cluster has its own internal IP. Kubernetes mandates that Pods on one node must be able to reach Pods on another node without any special configuration.

  • The Container Network Interface (CNI): This is the plugin (like Calico, Cilium, or OpenShift SDN) that actually builds the “pipes” between nodes.
  • The Experience: From the perspective of a container, it feels like it’s on a standard Ethernet network. It doesn’t care if the target Pod is on the same physical server or one across the data center.

2. Pod-to-Service Communication

Pods are “ephemeral”—they die and get replaced constantly, and their IP addresses change every time. You can’t hardcode a Pod IP into your app.

  • The Service: A Service is a stable “virtual IP” (ClusterIP) that sits in front of a group of Pods.
  • Kube-Proxy: This is a process running on every node that watches the API server. When you try to hit a Service IP, kube-proxy intercepts that traffic and redirects it to one of the healthy backend Pods.
  • CoreDNS: Kubernetes includes a built-in DNS service. Instead of an IP, your app just connects to http://my-database-service.

3. External-to-Service (Ingress & Egress)

How do people outside the cluster reach your apps?

  • NodePort: Opens a specific port (30000–32767) on every single node. High maintenance and insecure for production.
  • LoadBalancer: Tells your cloud provider (AWS/Azure/GCP) to spin up a physical Load Balancer that points to your nodes.
  • Ingress/Route: As we discussed earlier, this is a Layer 7 “smart router” that lets you use hostnames (like api.example.com) to direct traffic to different services using a single entry point.

4. Network Policies (The Firewall)

By default, Kubernetes is “open”—any Pod can talk to any other Pod. In production, you usually want to restrict this for security.

  • NetworkPolicy: These are the firewall rules of Kubernetes. You can define rules like: “Only the ‘Frontend’ Pod is allowed to talk to the ‘Database’ Pod on port 5432.”
  • Implementation: These rules are enforced by the CNI (e.g., Calico or Cilium) at the packet level.

Summary of Components

ComponentLevelPurpose
CNILayer 3 (IP)Connects Pods across different nodes.
ServiceLayer 4 (TCP/UDP)Provides a stable IP/DNS for a group of Pods.
IngressLayer 7 (HTTP)Handles routing, SSL, and hostnames.
NetworkPolicySecurityDefines which Pods are allowed to talk to each other.

The Path of a Packet

If a user types example.com into their browser:

  1. DNS resolves the name to the Ingress Controller’s IP.
  2. The Ingress Controller looks at the host header and finds the correct Service.
  3. The Service (via kube-proxy) picks a healthy Pod and sends the traffic there.
  4. The CNI routes that packet across the internal network to the node where the Pod is living.

To understand how data actually moves between nodes, we have to look at the CNI (Container Network Interface).

Since every Pod has a unique IP but lives on a host (Node) with its own separate IP, the CNI’s job is to “tunnel” the Pod’s traffic across the physical network. The two most common ways it does this are Encapsulation (Overlay) and Direct Routing.


1. Encapsulation (The “Envelope” Method)

This is the most common approach (used by Flannel (VXLAN) and OpenShift SDN). It treats the physical network as a “carrier” for a private, virtual network.

  • How it works: When Pod A (on Node 1) sends a packet to Pod B (on Node 2), the CNI takes that entire packet and wraps it inside a new UDP packet.
  • The “Outer” Header: Points from Node 1’s IP to Node 2’s IP.
  • The “Inner” Header: Points from Pod A’s IP to Pod B’s IP.
  • Arrival: When the packet hits Node 2, the CNI “unwraps” the outer envelope and delivers the original inner packet to Pod B.

The Downside: This adds a small amount of overhead (usually about 50 bytes per packet) because of the extra headers. This is why you often see the MTU (Maximum Transmission Unit) set slightly lower in Kubernetes (e.g., 1450 instead of 1500).


2. Direct Routing (The “BGP” Method)

Used by Calico (in non-overlay mode) and Cilium, this method avoids the “envelope” entirely for better performance.

  • How it works: The nodes act like standard network routers. They use BGP (Border Gateway Protocol) to tell each other: “Hey, if you want to reach the 10.244.1.0/24 subnet, send those packets to me (Node 1).”
  • The Experience: Packets travel “naked” across the wire with no extra headers.
  • The Requirement: Your physical network routers must be able to handle these extra routes, or the nodes must all be on the same Layer 2 segment (the same VLAN/Switch).

3. The Role of the “veth” Pair

Regardless of how the data moves between nodes, getting data out of a container uses a Virtual Ethernet (veth) pair.

Think of a veth pair as a virtual patch cable:

  1. One end is plugged into the container (usually named eth0).
  2. The other end is plugged into the host’s network namespace (often named something like vethabc123).
  3. The host end is usually connected to a Bridge (like cni0 or br0), which acts like a virtual switch for all Pods on that specific node.

4. Comparing Popular CNIs

CNI PluginPrimary MethodBest Use Case
FlannelVXLAN (Overlay)Simple clusters; works almost anywhere.
CalicoBGP or IP-in-IPHigh performance and advanced Network Policies.
CiliumeBPFNext-gen performance, deep security, and observability.
OpenShift SDNVXLANDefault for older OCP; very stable and integrated.
OVN-KubernetesGeneve (Overlay)Modern OCP default; supports massive scale and Windows nodes.

Which one should you choose?

  • If you are on OpenShift 4.x, you are likely using OVN-Kubernetes. It’s powerful and handles complex routing for you.
  • If you are building a Vanilla K8s cluster and want the absolute fastest networking, Cilium is the current industry gold standard because it uses eBPF to bypass parts of the Linux kernel entirely.

To understand why eBPF (Extended Berkeley Packet Filter) is the “holy grail” of modern Kubernetes networking, we first have to look at how the “old way” (iptables) works.

1. The Old Way: iptables (The “List of Rules”)

For years, Kubernetes used iptables (a standard Linux kernel feature) to route traffic.

  • How it works: Imagine a giant printed list of instructions. Every time a packet arrives, the CPU has to read the list from top to bottom: “Is it for Service A? No. Service B? No. Service C? Yes.”
  • The Problem: As you add more Services and Pods, this list grows to thousands of lines. If a packet is destined for the 5,000th service on the list, the CPU has to perform 5,000 checks for every single packet.
  • Result: High latency and high CPU “overhead” just to move data.

2. The New Way: eBPF (The “Direct Shortcut”)

eBPF allows you to run small, sandboxed programs directly inside the Linux kernel without changing the kernel code.

  • How it works: Instead of a long list of rules, eBPF creates a high-speed “lookup table” (a Hash Map) in the kernel’s memory.
  • The Benefit: When a packet arrives, the eBPF program looks at the destination and instantly knows where it goes. It doesn’t matter if you have 10 services or 10,000—the lookup time is exactly the same (O(1) complexity).
  • Bypassing the Stack: eBPF can catch a packet the moment it hits the Network Interface Card (NIC) and send it straight to the Pod, bypassing almost the entire Linux networking stack.

3. Why Cilium + eBPF is a Game Changer

Cilium is the most popular CNI that uses eBPF. It provides three massive advantages over traditional networking:

Featureiptables / Standard CNICilium (eBPF)
PerformanceSlows down as the cluster grows.Consistently fast at any scale.
ObservabilityHard to see “who is talking to who” without sidecars.Hubble (Cilium’s UI) shows every flow, drop, and latency in real-time.
SecurityIP-based filtering (hard to manage).Identity-based filtering. It knows a packet belongs to “Service-Frontend” regardless of its IP.

4. Why OpenShift is Moving to OVN (Geneve)

While Cilium is the “shiny new toy,” Red Hat chose OVN-Kubernetes (using the Geneve protocol) as the default for OCP 4.

  • Scale: OVN is built on Open vSwitch, which was designed for massive telco-grade clouds.
  • Feature Parity: It handles complex things like “Egress IPs” (giving a specific namespace a static IP for exiting the cluster) and Hybrid networking (Linux + Windows nodes) much more maturely than basic eBPF implementations did a few years ago.

Summary: The “Speed” Evolution

  1. iptables: Standard, but slow at scale.
  2. IPVS: A middle ground that uses hash tables but is still tied to the old kernel stack.
  3. eBPF (Cilium): The fastest possible way to move data in Linux today.

In OpenShift, the modern way to see these network flows is through the Network Observability Operator. This tool uses the eBPF technology we discussed to capture traffic data without slowing down your pods.

Here is how you can access and use these views.


1. Using the Web Console (The GUI Way)

Once the operator is installed, a new menu appears in your OpenShift Console.

  1. Navigate to Observe -> Network Traffic in the Administrator perspective.
  2. Overview Tab: This gives you a high-level “Sankey” diagram or graph showing which namespaces are talking to each other. It’s perfect for spotting “top talkers” (apps using the most bandwidth).
  3. Traffic Flows Tab: This is like a “Wireshark-lite” for your cluster. You can see every individual connection, including:
    • Source/Destination: Which pod is talking to which service.
    • Byte Rate: How much data is moving.
    • RTT (Latency): Exactly how many milliseconds a packet takes to travel between pods.
  4. Topology Tab: This provides a visual map of your network. You can group by “Namespace” or “Node” to see how traffic crosses physical boundaries.

2. Using the CLI (The “oc netobserv” plugin)

If you prefer the terminal, there is a specific plugin called oc netobserv. This is incredibly useful for live debugging when you don’t want to leave your shell.

Capture live flows:

Bash

oc netobserv flows --protocol=TCP --port=80

This will stream live traffic data directly to your terminal.

Filter for specific issues:

You can filter by namespace or even look for dropped packets (great for debugging firewall/NetworkPolicy issues):

Bash

oc netobserv flows --namespace=my-app --action=Drop

3. The “Loki” Backend

Behind the scenes, the Network Observability Operator stores these flows in Loki (a log aggregation system). This allows you to “go back in time.”

If an application crashed at 2:00 AM, you can go to the Network Traffic page, set the time filter to 2:00 AM, and see if there was a sudden spike in traffic or if a connection was being blocked by a security policy at that exact moment.


4. Advanced Debugging: Packet Drops

One of the best features of the eBPF-based observer is Packet Drop tracking. Traditional tools often can’t tell you why a packet disappeared. With this tool, the kernel can report the exact reason:

  • Filter Drop: A NetworkPolicy blocked it.
  • TCP Timeout: The other side didn’t respond.
  • Congestion: The network interface was overloaded.

Summary: What can you find?

  • Security: Is my database receiving traffic from an unauthorized namespace?
  • Performance: Which pods have the highest latency (RTT)?
  • Cost: Which services are sending the most data to external (Internet) IPs?

In Kubernetes, a NetworkPolicy is your cluster’s internal firewall. By default, Kubernetes has a “non-isolated” policy—meaning every pod can talk to every other pod.

To secure your app, you should follow the “Principle of Least Privilege”: block everything, then specifically allow only what is necessary.


1. The “Default Deny” (The Foundation)

Before you write specific rules, it is a best practice to create a “Default Deny” policy for your namespace. This locks all doors so that nothing can enter or leave unless you explicitly say so.

YAML

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: default-deny-all
namespace: my-secure-app
spec:
podSelector: {} # Matches all pods in this namespace
policyTypes:
- Ingress
- Egress

2. Allowing Specific Traffic (The “Rule”)

Now that everything is blocked, let’s say you have a Database pod and you only want your Frontend pod to talk to it on port 5432.

YAML

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: allow-frontend-to-db
namespace: my-secure-app
spec:
podSelector:
matchLabels:
app: database # This policy applies to pods labeled 'app: database'
policyTypes:
- Ingress
parent:
ingress:
- from:
- podSelector:
matchLabels:
role: frontend # Only allow pods labeled 'role: frontend'
ports:
- protocol: TCP
port: 5432

3. Three Ways to Target Traffic

You can control traffic based on three different criteria:

  1. podSelector: Target pods within the same namespace (e.g., “Frontend to Backend”).
  2. namespaceSelector: Target entire namespaces (e.g., “Allow everything from the ‘Monitoring’ namespace”).
  3. ipBlock: Target specific IP ranges outside the cluster (e.g., “Allow traffic from our corporate VPN range 10.0.0.0/24”).

4. Troubleshooting NetworkPolicies

If you apply a policy and your app stops working, here is how to debug:

  • Check Labels: NetworkPolicies rely 100% on labels. If your Frontend pod is labeled app: front-end but your policy looks for role: frontend, it will fail silently.
  • The “Blind” Policy: Standard Kubernetes doesn’t “log” when a policy blocks a packet. This is why we use the Network Observability Operator (as we discussed) to see the “Drop” events.
  • CNI Support: Remember, the CNI (Calico, OVN, etc.) is what actually enforces these rules. If your CNI doesn’t support NetworkPolicies (like basic Flannel), the YAML will be accepted but it won’t actually block anything!

Summary: Ingress vs. Egress

  • Ingress: Controls traffic coming into the pod (Who can talk to me?).
  • Egress: Controls traffic leaving the pod (Who can I talk to?).

A Zero Trust architecture in Kubernetes means that no pod is trusted by default. Even if a pod is inside your cluster, it shouldn’t be allowed to talk to anything else unless you specifically permit it.

In this scenario, we have a 3-tier app: Frontend, Backend, and Database.


1. The “Lockdown” (Default Deny)

First, we apply this to the entire namespace. This ensures that any new pod you deploy in the future is “secure by default” and cannot communicate until you add a rule for it.

YAML

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: default-deny-all
namespace: my-app-stack
spec:
podSelector: {} # Matches ALL pods
policyTypes:
- Ingress
- Egress

2. Tier 1: The Frontend

The Frontend needs to receive traffic from the Internet (via the Ingress Controller) and send traffic only to the Backend.

YAML

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: frontend-policy
namespace: my-app-stack
spec:
podSelector:
matchLabels:
tier: frontend
policyTypes:
- Ingress
- Egress
ingress:
- from:
- namespaceSelector:
matchLabels:
network.openshift.io/policy-group: ingress # Allows OpenShift Ingress Controller
egress:
- to:
- podSelector:
matchLabels:
tier: backend # ONLY allowed to talk to Backend

3. Tier 2: The Backend

The Backend should only accept traffic from the Frontend and is only allowed to talk to the Database.

YAML

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: backend-policy
namespace: my-app-stack
spec:
podSelector:
matchLabels:
tier: backend
policyTypes:
- Ingress
- Egress
ingress:
- from:
- podSelector:
matchLabels:
tier: frontend # ONLY accepts Frontend traffic
egress:
- to:
- podSelector:
matchLabels:
tier: database # ONLY allowed to talk to DB

4. Tier 3: The Database

The Database is the most sensitive. It should never initiate a connection (no Egress) and only accept traffic from the Backend.

YAML

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: database-policy
namespace: my-app-stack
spec:
podSelector:
matchLabels:
tier: database
policyTypes:
- Ingress
- Egress # We include Egress to ensure it's blocked by default
ingress:
- from:
- podSelector:
matchLabels:
tier: backend
ports:
- protocol: TCP
port: 5432 # Postgres port

Important: Don’t Forget DNS!

When you apply a “Default Deny” Egress policy, your pods can no longer talk to CoreDNS, which means they can’t resolve service names like http://backend-service.

To fix this, you must add one more policy to allow UDP Port 53 to the openshift-dns namespace:

YAML

  egress:
  - to:
    - namespaceSelector:
        matchLabels:
          kubernetes.io/metadata.name: openshift-dns
    ports:
    - protocol: UDP
      port: 53


Summary of the Strategy

  • Labels are everything: If you typo tier: backend as tier: back-end, the wall stays up and the app breaks.
  • Layered Security: Even if a hacker compromises your Frontend pod, they cannot “scan” your network or reach your Database directly; they are stuck only being able to talk to the Backend.

To test your Zero Trust setup, we will perform a “Penetration Test” from inside the cluster. We’ll use a temporary debug pod to see if our firewall rules actually block unauthorized access.

1. The “Attacker” Pod

We will spin up a temporary pod with networking tools (like curl and dig) that has no labels. Since our “Default Deny” policy targets all pods, this pod should be completely isolated the moment it starts.

Bash

# Run a temporary alpine pod
oc run network-tester --image=alpine --restart=Never -- /bin/sh -c "sleep 3600"

2. Test 1: Can an unknown pod talk to the Database?

Even if this pod is in the same namespace, it should not be able to reach the database because it doesn’t have the tier: backend label.

Bash

# Try to connect to the database on port 5432
oc exec network-tester -- nc -zv database-service 5432

Expected Result: Connection timed out (The packet is dropped by the NetworkPolicy).


3. Test 2: Can the Frontend “Scan” the Database?

Now, let’s pretend a hacker compromised your Frontend pod. Can they bypass the Backend and talk directly to the Database?

Bash

# Exec into your existing Frontend pod
oc exec deployment/frontend -- nc -zv database-service 5432

Expected Result: Connection timed out.

Even though the Frontend is “trusted” to talk to the Backend, it is not trusted to talk to the Database.


4. Test 3: Can the Database reach the Internet? (Egress Test)

A common hacker tactic is to steal data and send it to an external server (Exfiltration). Let’s see if the Database can “phone home.”

Bash

# Try to ping Google from the Database pod
oc exec deployment/database -- curl -I google.com

Expected Result: Could not resolve host or Timeout.

Since we didn’t add an Egress rule for the Database, it is physically unable to send data out of its own pod.


5. How to see the “Deny” in real-time

If you want to prove the policy is working without just guessing based on timeouts, use the Network Observability tool we set up earlier:

  1. Go to Observe -> Network Traffic.
  2. Filter by Action: Deny or Action: Drop.
  3. You will see a red entry showing:
    • Source: network-tester
    • Destination: database
    • Reason: NetworkPolicy

6. Cleaning Up

When you’re done testing, don’t forget to remove the tester pod:

Bash

oc delete pod network-tester

Summary of the Lab

TestSourceTargetStatusWhy?
UnauthorizedDebug PodDatabaseBLOCKEDMissing tier: backend label.
Lateral MovementFrontendDatabaseBLOCKEDPolicy only allows Frontend -> Backend.
ExfiltrationDatabaseInternetBLOCKEDNo Egress rules defined for DB.

Ingress

In Kubernetes, Ingress is an API object that acts as a “smart router” for your cluster. While a standard Service (like a LoadBalancer) simply opens a hole in the firewall for one specific app, Ingress allows you to consolidate many services behind a single entry point and route traffic based on the URL or path.

Think of it as the receptionist of an office building: instead of every employee having their own front door, everyone uses one main entrance, and the receptionist directs visitors to the correct room based on who they are looking for.


1. How Ingress Works

There are two distinct parts required to make this work:

  1. Ingress Resource: A YAML file where you define your “rules” (e.g., “Send all traffic for myapp.com/api to the api-service“).
  2. Ingress Controller: The actual software (like NGINX, HAProxy, or Traefik) that sits at the edge of your cluster, reads those rules, and physically moves the traffic. Kubernetes does not come with a controller by default; you must install one.

2. Key Capabilities

Ingress is much more powerful than a simple Port or LoadBalancer because it operates at Layer 7 (HTTP/HTTPS).

  • Host-based Routing: Route blue.example.com to the Blue Service and green.example.com to the Green Service using a single IP.
  • Path-based Routing: Route example.com/login to the Auth service and example.com/search to the Search service.
  • SSL/TLS Termination: You can handle your SSL certificates at the Ingress level so your individual application pods don’t have to deal with encryption/decryption.
  • Name-based Virtual Hosting: Supporting multiple domain names on the same IP address.

3. Ingress vs. LoadBalancer vs. NodePort

Choosing how to expose your app is a common point of confusion. Here is the breakdown:

MethodBest ForPros/Cons
NodePortTesting/DevOpens a high-range port (30000+) on every node. Very insecure for production.
LoadBalancerSingle AppsStandard way to get an external IP. Expensive if you have 50 apps, as you pay for 50 cloud load balancers.
IngressProductionMultiple apps share one Load Balancer. Extremely cost-effective and feature-rich.

4. What about OpenShift “Routes”?

Since you were asking about OCP earlier, it’s important to note that OpenShift has its own version called a Route.

  • Routes were created by Red Hat before the Kubernetes Ingress standard existed.
  • In modern OpenShift, an Ingress is automatically converted into a Route behind the scenes.
  • The main difference: Routes support “F5” and more complex HAProxy features natively, while Ingress is the industry-standard “portable” way to do it.

5. Basic Ingress Example

This YAML routes traffic to two different services based on the URL path:

YAML

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: main-ingress
annotations:
nginx.ingress.kubernetes.io/rewrite-target: /
spec:
rules:
- host: myapp.example.com
http:
paths:
- path: /billing
pathType: Prefix
backend:
service:
name: billing-service
port:
number: 80
- path: /inventory
pathType: Prefix
backend:
service:
name: inventory-service
port:
number: 80

To set up HTTPS for an Ingress, you need a TLS Secret. This secret acts as a container for your SSL certificate and private key, which the Ingress Controller (like NGINX or HAProxy) uses to encrypt the traffic before it reaches your pods.


1. Create the TLS Secret

First, you need your certificate files (tls.crt and tls.key). You can create a Kubernetes Secret from these files using the CLI:

Bash

kubectl create secret tls my-app-tls \
--cert=path/to/tls.crt \
--key=path/to/tls.key

Note: If you are using OpenShift, you can often skip this manual step by using Service Serving Certificates, which allow OpenShift to generate and manage the certificates for you automatically.


2. Update the Ingress YAML

Now, you tell the Ingress resource to use that secret for a specific hostname.

YAML

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: secure-ingress
annotations:
# This annotation tells the controller to redirect HTTP to HTTPS
nginx.ingress.kubernetes.io/ssl-redirect: "true"
spec:
tls:
- hosts:
- myapp.example.com
secretName: my-app-tls # This must match the secret name created in Step 1
rules:
- host: myapp.example.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: my-web-service
port:
number: 80

3. Automation with Cert-Manager

Manually updating certificates before they expire is a headache. Most production clusters use Cert-Manager.

Cert-Manager is an operator that talks to certificate authorities like Let’s Encrypt. You simply add an annotation to your Ingress, and Cert-Manager handles the rest:

The “Magic” Annotation:

YAML

metadata:
annotations:
cert-manager.io/cluster-issuer: "letsencrypt-prod"

Once you add this, Cert-Manager will:

  1. See the Ingress request.
  2. Reach out to Let’s Encrypt to verify you own the domain.
  3. Generate the tls.crt and tls.key.
  4. Create the Secret for you and renew it every 90 days automatically.

Summary Checklist for HTTPS

StepAction
1. CertificateObtain a CA-signed cert or use Let’s Encrypt.
2. SecretStore the cert/key in a kind: Secret (type kubernetes.io/tls).
3. Ingress SpecAdd the tls: section to your Ingress YAML.
4. DNSEnsure your domain points to the Ingress Controller’s IP.

To automate SSL certificates with Cert-Manager, you need a ClusterIssuer. This is a cluster-wide resource that tells Cert-Manager how to talk to a Certificate Authority (CA) like Let’s Encrypt.

Before you start, ensure the Cert-Manager Operator is installed in your cluster (in OpenShift, you can find this in the OperatorHub).


1. Create a ClusterIssuer (The “Account”)

This YAML defines your identity with Let’s Encrypt. It uses the ACME (Automated Certificate Management Environment) protocol.

YAML

apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
name: letsencrypt-prod
spec:
acme:
# The ACME server address for Let's Encrypt production
server: https://acme-v02.api.letsencrypt.org/directory
# Your email address for expiration notices
email: admin@yourdomain.com
# Name of a secret used to store the ACME account private key
privateKeySecretRef:
name: letsencrypt-prod-account-key
# Enable the HTTP-01 challenge provider
solvers:
- http01:
ingress:
class: nginx # Or 'openshift-default' depending on your ingress controller

2. Update your Ingress to “Request” the Cert

Once the ClusterIssuer is created, you don’t need to manually create secrets anymore. You just “tag” your Ingress with an annotation. Cert-Manager will see this, perform the challenge, and create the secret for you.

YAML

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: my-secure-app
annotations:
# THIS IS THE TRIGGER: It links the Ingress to your ClusterIssuer
cert-manager.io/cluster-issuer: "letsencrypt-prod"
spec:
tls:
- hosts:
- app.yourdomain.com
secretName: app-tls-cert # Cert-Manager will create this secret automatically
rules:
- host: app.yourdomain.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: my-service
port:
number: 80

3. How to verify it’s working

After you apply the Ingress, Cert-Manager creates a Certificate object and a Challenge object. You can track the progress:

  • Check the certificate status:kubectl get certificate(Look for READY: True)
  • Check the order status (if it’s stuck):kubectl get challenges
  • Check the secret:kubectl get secret app-tls-cert(If this exists, your site is now HTTPS!)

Why use Let’s Encrypt?

  1. Cost: It is 100% free.
  2. Trust: It is recognized by all major browsers (unlike self-signed certs).
  3. No Maintenance: Cert-Manager automatically renews the cert 30 days before it expires.

A Small Warning:

Let’s Encrypt has rate limits. If you are just testing, use the “Staging” URL (https://acme-staging-v02.api.letsencrypt.org/directory) first. Browsers will show a warning for staging certs, but you won’t get blocked for hitting limit thresholds while debugging.

When Cert-Manager fails to issue a certificate, it usually gets stuck in the Challenge phase. Let’s look at how to diagnose and fix the most common “Let’s Encrypt” roadblocks.


1. The Troubleshooting Command Chain

If your certificate isn’t appearing, follow this hierarchy to find where the “handshake” broke:

  • Check the Certificate status:oc get certificateIf READY is False, move to the next step.
  • Check the Order:oc get orderThe Order is the request sent to Let’s Encrypt. Look at the STATE column.
  • Check the Challenge (The most important step):oc get challengesIf a challenge exists, it means Let’s Encrypt is trying to verify your domain but can’t.
  • Describe the Challenge for the error message:oc describe challenge <challenge-name>

2. Common Failure Reasons

A. The “I Can’t See You” (Firewall/Network)

Let’s Encrypt uses the HTTP-01 challenge. It tries to reach http://yourdomain.com/.well-known/acme-challenge/<TOKEN>.

  • The Problem: Your firewall, Security Group (AWS/Azure), or OpenShift Ingress Controller is blocking Port 80.
  • The Fix: Ensure Port 80 is open to the public internet. Let’s Encrypt cannot verify your domain over Port 443 (HTTPS) because the certificate doesn’t exist yet!

B. DNS Record Mismatch

  • The Problem: Your DNS A record or CNAME for app.yourdomain.com hasn’t propagated yet or is pointing to the wrong Load Balancer IP.
  • The Fix: Use dig app.yourdomain.com or nslookup to ensure the domain points exactly to your Ingress Controller’s external IP.

C. Rate Limiting

  • The Problem: You’ve tried to issue the same certificate too many times in one week (Let’s Encrypt has a limit of 5 duplicate certs per week).
  • The Fix: Switch your ClusterIssuer to use the Staging URL (mentioned in the previous step) until your configuration is 100% correct, then switch back to Production.

3. Dealing with Internal/Private Clusters

If your OpenShift cluster is behind a VPN and not accessible from the public internet, the HTTP-01 challenge will always fail because Let’s Encrypt can’t “see” your pods.

The Solution: DNS-01 Challenge

Instead of a web check, Cert-Manager proves ownership by adding a temporary TXT record to your DNS provider (Route53, Cloudflare, Azure DNS).

Example DNS-01 Issuer (Route53):

YAML

spec:
acme:
solvers:
- dns01:
aws-route53:
region: us-east-1
hostedZoneID: Z123456789

Summary Checklist

  1. Is Port 80 open?
  2. Does DNS point to the cluster?
  3. Are you hitting Rate Limits?
  4. Is your Ingress Class correct in the Issuer?

Persistent Volumes (PV) and Persistent Volume Claims (PVC)

In Kubernetes, storage is handled separately from your application’s logic. To understand Persistent Volumes (PV) and Persistent Volume Claims (PVC), it helps to use the “Electricity” analogy:

  • PV (The Infrastructure): This is like the power plant and the grid. It’s the actual physical storage (a disk, a cloud drive, or a network share).
  • PVC (The Request): This is like the power outlet in your wall. Your application “plugs in” to the PVC to get what it needs without needing to know where the power plant is.

1. Persistent Volume (PV)

A PV is a piece of storage in the cluster that has been provisioned by an administrator or by a storage class. It is a cluster-level resource (like a Node) and exists independently of any individual Pod.

  • Capacity: How much space is available (e.g., 5Gi, 100Gi).
  • Access Modes: * ReadWriteOnce (RWO): Can be mounted by one node at a time.
    • ReadOnlyMany (ROX): Many nodes can read it simultaneously.
    • ReadWriteMany (RWX): Many nodes can read and write at the same time (requires specific hardware like NFS or ODF).
  • Reclaim Policy: What happens to the data when you delete the PVC? (Retain it for manual cleanup or Delete it immediately).

2. Persistent Volume Claim (PVC)

A PVC is a request for storage by a user. If a Pod needs a “hard drive,” it doesn’t look for a specific disk; it creates a PVC asking for “10Gi of storage with ReadWriteOnce access.”

  • The “Binding” Process: Kubernetes looks at all available PVs. If it finds a PV that matches the PVC’s request, it “binds” them together.
  • Namespace Scoped: Unlike PVs, PVCs live inside a specific Namespace.

3. Dynamic Provisioning (StorageClasses)

In modern clusters (like OpenShift), admins don’t manually create 100 different PVs. Instead, they use a StorageClass.

  1. The user creates a PVC.
  2. The StorageClass notices the request.
  3. It automatically talks to the cloud provider (AWS/Azure/GCP) to create a new disk.
  4. It automatically creates the PV and binds it to the PVC.

4. How a Pod uses it

Once the PVC is bound to a PV, you tell your Pod to use that “outlet.”

YAML

spec:
containers:
- name: my-db
image: postgres
volumeMounts:
- mountPath: "/var/lib/postgresql/data"
name: my-storage
volumes:
- name: my-storage
persistentVolumeClaim:
claimName: task-pv-claim # This matches the name of your PVC

Summary Comparison

FeaturePersistent Volume (PV)Persistent Volume Claim (PVC)
Who creates it?Administrator or Storage SystemDeveloper / Application
ScopeCluster-wideNamespace-specific
AnalogyThe actual Hard DriveThe request for a Hard Drive
LifecycleExists even if no one uses itTied to the application’s needs

Here is a standard YAML example for a Persistent Volume Claim (PVC).

In this scenario, we aren’t manually creating a disk. Instead, we are telling OpenShift/Kubernetes: “I need 10Gi of fast storage. Please go talk to the cloud provider or storage backend and create it for me.”

1. The PVC Definition

This is the “request” for storage.

YAML

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: dynamic-pvc-example
namespace: my-app-project
spec:
storageClassName: gp3-csi # Or 'thin', 'ocs-storagecluster-ceph-rbd', etc.
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 10Gi

2. How the Binding Works

When you apply this YAML, the following chain reaction happens:

  1. The Claim: You submit the PVC.
  2. The Provisioner: The StorageClass (e.g., AWS EBS, Azure Disk, or OpenShift Data Foundation) sees the request.
  3. The Asset: The storage backend creates a physical 10Gi volume.
  4. The Volume: Kubernetes automatically creates a PersistentVolume (PV) object to represent that physical disk.
  5. The Binding: The PVC status changes from Pending to Bound.

3. Attaching the PVC to a Pod

A PVC is useless until a Pod “claims” it. Here is how you mount that 10Gi disk into a container:

YAML

apiVersion: v1
kind: Pod
metadata:
name: storage-test-pod
spec:
containers:
- name: nginx
image: nginx
volumeMounts:
- name: my-data-volume
mountPath: /usr/share/nginx/html # Where the disk appears inside the container
volumes:
- name: my-data-volume
persistentVolumeClaim:
claimName: dynamic-pvc-example # Must match the name in the PVC YAML

Important “Gotchas” with PVCs

  • Access Modes: * ReadWriteOnce (RWO): Most common. If Pod A is using the disk on Node 1, Pod B cannot use it if Pod B is on Node 2.
    • ReadWriteMany (RWX): Required if you want multiple Pods across different nodes to share the same files (common for web servers sharing a shared uploads folder).
  • Expansion: Many modern StorageClasses allow you to increase the storage size in the PVC YAML after it’s created, and Kubernetes will expand the disk on the fly (provided the underlying storage supports it).
  • Sticky Nodes: If you use a cloud-based RWO disk (like AWS EBS), your Pod becomes “stuck” to the availability zone where that disk was created.

Checking for available StorageClasses is one of the most common tasks for an OpenShift administrator or developer. It tells you exactly what “flavors” of storage are available for your apps.

1. Using the CLI (Recommended)

Run the following command to see a list of all storage providers configured in your cluster:

Bash

oc get storageclass

(Or use the shorthand: oc get sc)

Example Output:

Plaintext

NAME PROVISIONER RECLAIMPOLICY VOLUMEBINDINGMODE ALLOWVOLUMEEXPANSION AGE
gp3-csi (default) ebs.csi.aws.com Delete WaitForFirstConsumer true 45d
thin kubernetes.io/vsphere Delete Immediate false 102d
ocs-storage-rbd openshift-storage.rbd... Delete Immediate true 12d

2. How to Read the Results

  • NAME: This is what you put in the storageClassName field of your PVC YAML.
  • (default): If you see this next to a name, it means any PVC that doesn’t specify a class will automatically get this one.
  • PROVISIONER: This tells you the underlying technology (e.g., AWS EBS, VMware vSphere, or Ceph/ODF).
  • RECLAIMPOLICY: * Delete: When you delete the PVC, the physical disk is also deleted.
    • Retain: When you delete the PVC, the physical disk stays (so you can recover data manually).
  • VOLUMEBINDINGMODE:
    • Immediate: The disk is created the second you create the PVC.
    • WaitForFirstConsumer: The disk isn’t created until a Pod actually tries to use it. This is smarter because it ensures the disk is created in the same Availability Zone as the Pod.

3. Inspecting a Specific StorageClass

If you want to see the “fine print” (like encryption settings or IOPS), you can look at the YAML of a specific class:

Bash

oc get sc gp3-csi -o yaml

4. Using the Web Console

If you prefer the GUI:

  1. Log in to the OpenShift Web Console.
  2. Ensure you are in the Administrator perspective.
  3. Navigate to Storage -> StorageClasses.
  4. Here, you can see all classes, which one is the default, and even create new ones using a wizard.

Pro-Tip: If your cluster has no default storage class, your PVCs will stay in a Pending state forever unless you explicitly name one in your YAML.