Troubleshooting Egress Issues in OpenShift Namespaces

This is a classic OpenShift case because egress controls can be namespace-scoped, so one project can reach the internet while another cannot even though both are on the same cluster. In OpenShift with OVN-Kubernetes, the main things to check are Kubernetes NetworkPolicy egress rules, OpenShift EgressFirewall objects, and sometimes EgressIP if the namespace is supposed to leave the cluster from a specific source IP.

OpenShift documents EgressFirewall as a namespace-level object, and Kubernetes documents that once a pod is selected by an egress policy, only the explicitly allowed outbound traffic is permitted. (Red Hat Documentation)

Scenario

Pods in namespace team-a can reach external sites, but pods in team-b cannot.

Examples:

oc exec -n team-a deploy/app -- curl https://example.com # works
oc exec -n team-b deploy/app -- curl https://example.com # fails

That pattern strongly suggests the problem is policy attached to the namespace, not a cluster-wide outage. OpenShift’s EgressFirewall is evaluated per namespace, and if there is no matching rule then traffic is allowed by default unless something else, like a NetworkPolicy, restricts it. (Red Hat Documentation)

Diagram

          Namespace team-a                 Namespace team-b
      +---------------------+           +---------------------+
      | pod -> external IP  |           | pod -> external IP  |
      +----------+----------+           +----------+----------+
                 |                                 |
                 v                                 v
        [no blocking policy]            [NetworkPolicy and/or
                 |                     EgressFirewall applies]
                 v                                 |
           traffic allo                            v
                                        traffic denied or limited

Where namespace-specific egress can break:
1) Egress NetworkPolicy in that namespace
2) EgressFirewall object in that namespace
3) EgressIP expected for that namespace but misconfigured
4) DNS works, but external traffic is filtered after resolution


How to debug it

1. Prove it is really namespace-specific

Run the same test from a working namespace and a failing one:

oc exec -n team-a deploy/app -- curl -I https://example.com
oc exec -n team-b deploy/app -- curl -I https://example.com

Then test direct IP and DNS separately from the failing namespace:

oc exec -n team-b deploy/app -- nslookup example.com
oc exec -n team-b deploy/app -- curl -I https://93.184.216.34

If DNS works but outbound HTTP to external IPs fails, that points more toward egress filtering than DNS. This is an inference from Kubernetes DNS and policy behavior together. (Kubernetes)

2. Check NetworkPolicy in the failing namespace

This is the first thing I’d inspect:

oc get networkpolicy -n team-b
oc get networkpolicy -n team-b -o yaml

Kubernetes says that if a pod is selected by a policy with policyTypes: [Egress], the allowed outbound traffic is restricted to what the policy permits. A “default deny all ingress and all egress” policy is a standard pattern. (Kubernetes)

Typical bad case:

policyTypes:
- Egress
egress:
- to:
- namespaceSelector:
matchLabels:
name: internal-only

That would allow only a narrow set of destinations and block internet egress.

3. Check for an OpenShift EgressFirewall

OpenShift provides EgressFirewall as a namespace object for controlling traffic from pods to destinations outside the cluster. It is specific to OVN-Kubernetes. (Red Hat Documentation)

Commands:

oc get egressfirewall -n team-b
oc get egressfirewall -n team-b -o yaml

OpenShift documents that traffic to an IP outside the cluster is checked against the namespace’s EgressFirewall rules in order. If a rule matches, that action applies; if no rule matches, traffic is allowed by default. (Red Hat Documentation)

A realistic blocking example is a namespace with rules allowing only a few CIDRs or DNS names and denying everything else.

4. Check whether the namespace is supposed to use EgressIP

If the application depends on a fixed source IP for outbound allowlisting, verify whether EgressIP is configured and healthy. OpenShift documents that an egress IP can be assigned to a namespace and is distinct from an egress router. (Red Hat Documentation)

Check:

oc get egressip
oc describe egressip <name>

If team-b is expected to leave via a specific egress IP and that configuration is broken, outbound access to third-party systems may fail even though generic internet access from other namespaces works. That last part is an inference from how vendor allowlists usually interact with source IP–based egress. (Red Hat Documentation)

5. Verify DNS separately

Sometimes people say “egress is broken” when the real failure is DNS.

oc exec -n team-b deploy/app -- nslookup example.com
oc exec -n team-b deploy/app -- curl -I https://example.com
oc exec -n team-b deploy/app -- curl -I https://93.184.216.34

Interpretation:

  • nslookup fails, IP curl fails: maybe DNS or broader networking
  • nslookup works, IP curl fails: likely egress filtering
  • nslookup fails, IP curl works: DNS-only issue

That distinction follows from Kubernetes DNS behavior plus the documented policy mechanisms above. (Kubernetes)

6. Compare with a working namespace

This is one of the fastest ways to spot the difference:

oc get networkpolicy -n team-a -o yaml
oc get networkpolicy -n team-b -o yaml
oc get egressfirewall -n team-a -o yaml
oc get egressfirewall -n team-b -o yaml

When only one namespace is failing, the delta between those objects often explains it immediately.

7. Check whether the block is by destination type

OpenShift supports EgressFirewall rules for external destinations, and OpenShift also documents audit logging for egress firewall and network policy, which can help when you need proof of what is being denied. (Red Hat Documentation)

Ask:

  • does external IP fail?
  • does internal service traffic still work?
  • does only one external domain fail?

That helps separate “internet blocked” from “specific destinations blocked.”

What this usually turns out to be

Most common causes:

  • Default deny egress NetworkPolicy in the failing namespace. Kubernetes explicitly documents this pattern. (Kubernetes)
  • Namespace EgressFirewall allowing only selected external destinations. OpenShift documents EgressFirewall as namespace-scoped and processed rule by rule for external IP traffic. (Red Hat Documentation)
  • Broken or missing EgressIP where the app depends on outbound source-IP allowlists. OpenShift documents namespace egress IP configuration separately from egress routers. (Red Hat Documentation)
  • Misdiagnosed DNS problem, where name resolution fails and looks like internet egress failure. (Red Hat Documentation)

Fast triage sequence

oc exec -n team-b deploy/app -- nslookup example.com
oc exec -n team-b deploy/app -- curl -I https://93.184.216.34
oc get networkpolicy -n team-b -o yaml
oc get egressfirewall -n team-b -o yaml
oc get egressip

Mental model

When egress fails only in some namespaces:

  • think namespace policy first
  • then think OpenShift EgressFirewall
  • then think EgressIP expectations
  • only after that think cluster-wide OVN trouble

Because if it were a true cluster-wide OVN failure, you would usually see the problem across many namespaces, not just one. That last point is an operational inference, but it is a very useful one. (Red Hat Documentation)

Debugging DNS Issues in OpenShift Pods

DNS works for some pods but not others: this one is tricky because it often looks like OVN, but a lot of the time it is actually DNS path, namespace lookup, or pod DNS config.

In OpenShift, the DNS Operator manages CoreDNS for pod and service name resolution, and CoreDNS runs as the dns-default daemon set in openshift-dns. Pods rely on kubelet-provided DNS settings in /etc/resolv.conf to reach those DNS servers. (Red Hat Documentation)

Scenario

Some pods can resolve service names, but others cannot.

Examples:

  • Pod A: nslookup backend-service
  • Pod B: nslookup backend-service

That usually means one of these:

  • the failing pod has bad DNS settings,
  • the query is being made from the wrong namespace,
  • only some nodes can reach the DNS pods,
  • or the DNS pods themselves are unhealthy on part of the cluster. (Red Hat Documentation)

Diagram

                +------------------------------+
                |        failing pod           |
                |  /etc/resolv.conf            |
                |  nameserver -> DNS service   |
                +--------------+---------------+
                               |
                               v
                    +---------------------+
                    |   CoreDNS /         |
                    |   dns-default pods  |
                    |   in openshift-dns  |
                    +----------+----------+
                               |
                 resolves svc/pod names from cluster state
                               |
                               v
                    +---------------------+
                    |  Service / Pod DNS  |
                    |  records            |
                    +---------------------+

Where it breaks:
1) Pod resolv.conf is wrong
2) Pod queries wrong namespace
3) Pod/node cannot reach dns-default
4) dns-default pods unhealthy
5) Name exists, but target service/endpoints are wrong


How to debug it

1. Prove it is DNS and not general networking

From a good pod and a bad pod, test both DNS and direct IP access:

oc exec -it <good-pod> -- nslookup backend-service
oc exec -it <bad-pod> -- nslookup backend-service
oc exec -it <bad-pod> -- curl http://<service-cluster-ip>:<port>
oc exec -it <bad-pod> -- curl http://<pod-ip>:<port>

If IP-based access works but nslookup fails, that points strongly to DNS rather than OVN datapath routing. Kubernetes service and pod discovery are meant to work through DNS records. (Kubernetes)

2. Check the failing pod’s /etc/resolv.conf

This is one of the fastest checks:

oc exec -it <bad-pod> -- cat /etc/resolv.conf

A normal pod DNS config should include a cluster DNS nameserver and search domains such as the pod namespace, svc.cluster.local, and cluster.local; Kubernetes documents options ndots:5 as typical too. If those are missing or odd, the pod DNS setup is wrong. (Kubernetes)

3. Make sure the pod is querying the right namespace

A very common false alarm:

oc exec -it <bad-pod> -- nslookup backend-service
oc exec -it <bad-pod> -- nslookup backend-service.<namespace>

Kubernetes says unqualified service names are resolved relative to the pod’s own namespace. So backend-service from namespace frontend will not find a service that lives in namespace backend unless you query backend-service.backend. (Kubernetes)

4. Check whether the DNS pods are healthy

In OpenShift, look at the DNS operator and DNS pods:

oc get clusteroperator dns
oc get pods -n openshift-dns
oc get pods -n openshift-dns-operator

Red Hat documents that the DNS Operator manages CoreDNS, and that CoreDNS runs as the dns-default daemon set. If those pods are crashlooping, pending, or missing on expected nodes, pods may lose name resolution. (Red Hat Documentation)

5. Check whether only some nodes are affected

If only pods on one worker fail DNS, compare node placement:

oc get pods -A -o wide | grep <failing-node>
oc get pods -n openshift-dns -o wide

Red Hat notes DNS is available to all pods if DNS pods are running on some nodes and nodes without DNS pods still have network connectivity to nodes with DNS pods. So “only pods on node X fail DNS” often means node-to-DNS connectivity is broken rather than CoreDNS being globally broken. (Red Hat Documentation)

6. Test from a clean debug pod

This removes app-side noise:

oc run dns-debug --image=registry.k8s.io/e2e-test-images/agnhost:2.39 -it --rm -- sh
nslookup kubernetes.default
nslookup backend-service.<namespace>
cat /etc/resolv.conf

Kubernetes recommends creating a simple test pod and using nslookup kubernetes.default as a baseline DNS test. (Kubernetes)

7. Check DNS service reachability from the bad pod

If you know the DNS service IP from /etc/resolv.conf, test whether the pod can even reach it. If the DNS nameserver is unreachable from only some pods or nodes, the issue is likely network path to DNS, not DNS records themselves. This is an inference from the Kubernetes debug flow and OpenShift’s note about node connectivity to DNS pods. (Kubernetes)

8. Check logs from the DNS pods

If the DNS pods are up but resolution still fails:

oc logs -n openshift-dns <dns-default-pod>

If you are testing a workaround, Red Hat documents that the DNS Operator can be set to Unmanaged, but they also note you cannot upgrade while it remains unmanaged. (Red Hat Documentation)

What this usually turns out to be

Most common causes:

  • Wrong namespace lookup: querying service instead of service.namespace. (Kubernetes)
  • Bad pod DNS config: strange or missing nameserver/search domains in /etc/resolv.conf. (Kubernetes)
  • DNS pods unhealthy: dns-default issues in openshift-dns. (Red Hat Documentation)
  • Node-specific connectivity issue: pods on one node cannot reach DNS pods running elsewhere. (Red Hat Documentation)
  • Service confusion: DNS resolves, but the target service or endpoints are wrong, making it look like DNS. Kubernetes DNS only gives you the name-to-record mapping; the service still has to be valid. (Kubernetes)

Fast triage sequence

oc exec -it <bad-pod> -- cat /etc/resolv.conf
oc exec -it <bad-pod> -- nslookup kubernetes.default
oc exec -it <bad-pod> -- nslookup <service>.<namespace>
oc get clusteroperator dns
oc get pods -n openshift-dns -o wide
oc logs -n openshift-dns <dns-default-pod>

Mental model

When DNS fails only for some pods:

  • if all traffic is broken, think OVN/node networking
  • if IP access works but names fail, think DNS
  • if short names fail but FQDN works, think namespace/search path
  • if only one node’s pods fail, think node-to-dns connectivity

Debugging ClusterIP Issues in OVN Kubernetes

Great—let’s go through another very common real-world issue and include a simple visual to make it click.


Scenario

Service works via pod IP, but fails via ClusterIP (service name/IP)

Environment:

  • frontend → calling backend
  • Direct call works:curl http://10.128.2.15:8080 ✅
  • Service call fails:curl http://backend-service ❌

What this means (important insight)

If pod IP works but service fails, then:

Pod networking (OVN routing) is working
Problem is in service load-balancing layer inside OVN-Kubernetes


Mental model (diagram)

Image

Interpretation:

  • Pod → Pod = direct routing (works)
  • Pod → Service = goes through OVN load balancer (broken here)

Step-by-step debugging

Step 1: Confirm endpoints exist

oc get endpoints backend-service

If EMPTY:

Root cause = wrong labels

Example:

# Service selector
selector:
app: backend

But pod has:

labels:
app: api ❌ mismatch

Fix labels → service starts working instantly


Step 2: Verify service definition

oc get svc backend-service -o yaml

Check:

  • correct port
  • correct targetPort

Common mistake:

port: 80
targetPort: 8080 ✅ must match container port

Step 3: Test ClusterIP directly

curl <ClusterIP>:<port>

Results:

  • ❌ fails → OVN load balancer issue
  • ✅ works → DNS issue instead

Step 4: Check DNS (don’t skip this)

From pod:

nslookup backend-service
If fails:

→ Not OVN
→ Check:

oc get pods -n openshift-dns

Step 5: Inspect OVN load balancer

On a node:

oc debug node/<node>
chroot /host

Then:

ovn-nbctl lb-list

You should see something like:

VIP: 172.30.0.10:80 → 10.128.2.15:8080

If missing:

OVN didn’t program the service


Step 6: Check OVN logs

oc logs -n openshift-ovn-kubernetes <ovnkube-master>

Look for:

  • load balancer sync errors
  • endpoint update failures

Step 7: Check kube-proxy replacement

In OpenShift Container Platform, OVN replaces kube-proxy.

So if service routing is broken:
It’s handled by OVN, not iptables


Real root causes (from production)

1. Label mismatch (MOST COMMON)

  • Service selector doesn’t match pod
    → no endpoints → service dead

2. Wrong port/targetPort

  • Service pointing to wrong container port
    → connection refused

3. OVN load balancer not programmed

  • OVN DB out of sync
    → ClusterIP has no backend mapping

4. NetworkPolicy blocking service traffic

  • Pod allows direct IP but blocks service path
    (less common but happens)

5. DNS issue (misdiagnosed often)

  • Service name fails, ClusterIP works

Fast debugging logic (this is gold)

When pod IP works but service fails:

  1. Endpoints exist?
    • ❌ → labels problem
  2. ClusterIP works?
    • ❌ → OVN load balancing
  3. DNS works?
    • ❌ → DNS issue

Pro tip (what experts do fast)

From a debug pod:

oc run debug --image=busybox -it --rm -- sh

Run:

nslookup backend-service
curl <ClusterIP>
curl <pod-IP>

This instantly isolates:

  • DNS
  • service
  • networking

Key takeaway

  • Pod IP = routing layer (OVN switching)
  • Service IP = OVN load balancer layer
  • If one works and the other doesn’t → you know exactly where to look

Troubleshooting Node-Specific Pod Traffic Failures

Scenario

Traffic works for pods on node A, but fails for pods on node B.

That usually points to a node-local OVN/OVS problem, not an app problem.

Example:

  • frontend on worker-1 can reach backend
  • same app on worker-2 cannot

That pattern is a huge clue.


How to debug it

1. Prove it’s node-specific

List pods and nodes:

oc get pods -A -o wide

Run the same network test from a pod on each node:

oc exec -it <good-pod> -- curl http://<target-pod-ip>:<port>
oc exec -it <bad-pod> -- curl http://<target-pod-ip>:<port>

If one node always works and another always fails, focus on the bad node.


2. Check the OVN pod on the bad node

Find the ovnkube-node pod for that worker:

oc get pods -n openshift-ovn-kubernetes -o wide

Look for the pod scheduled on the failing node.

Then inspect:

oc describe pod -n openshift-ovn-kubernetes <ovnkube-node-pod>
oc logs -n openshift-ovn-kubernetes <ovnkube-node-pod>

Things that matter:

  • restarts
  • readiness failures
  • DB connection errors
  • OVS/flow programming errors

If ovnkube-node is unhealthy there, that is often the root cause.


3. Check node readiness and basic health

oc get node
oc describe node <bad-node>

Look for:

  • NotReady
  • memory/disk pressure
  • network-related events

Sometimes OVN is fine and the node itself is degraded.


4. Inspect OVS on the bad node

Open a debug shell:

oc debug node/<bad-node>
chroot /host

Then:

ovs-vsctl show

You want to see expected bridges such as br-int.

Also useful:

ovs-ofctl dump-ports br-int
ovs-appctl bond/show

Red flags:

  • missing br-int
  • interfaces missing
  • counters not increasing on expected ports

If OVS is broken on that node, pod traffic there will fail even while the rest of the cluster looks fine.


5. Check the node’s host networking

Still on the node:

ip addr
ip route
ip link

Look for:

  • missing routes
  • down interfaces
  • wrong MTU

A node can have OVN running, but if the host interface or route is wrong, encapsulated traffic will still fail.


6. Compare MTU with a working node

MTU mismatches are sneaky.

On both a good node and bad node:

ip link

Look at the main NIC and OVN-related interfaces.

Symptoms of MTU trouble:

  • DNS works sometimes
  • small pings work
  • larger curls/higher-volume traffic fail or hang

A quick test from a pod can help:

ping -M do -s 1400 <target-ip>

If smaller packets work and larger ones fail, suspect MTU.


7. Check if pod wiring exists on the bad node

From the failing node’s ovnkube-node logs, check whether the affected pod sandbox/interface got programmed correctly.

Also inspect pods on that node:

oc get pods -A -o wide | grep <bad-node>

If all pods on that node fail, it is likely node OVN/OVS or host network.
If only one pod fails, it may be a pod-specific attachment/setup issue.


8. Test service vs direct pod IP

From a failing pod:

curl http://<target-pod-ip>:<port>
curl http://<service-cluster-ip>:<port>

Interpretation:

  • both fail → node/local OVN path likely broken
  • pod IP works, service fails → service/load-balancer programming problem
  • DNS name fails, ClusterIP works → DNS problem

This helps avoid blaming OVN for the wrong layer.


9. Check for node-local firewall or host changes

On the bad node, inspect whether something changed outside OpenShift:

iptables -S
nft list ruleset
systemctl status ovs-vswitchd
systemctl status ovn-controller

A manual host change, bad firewall rule, or failed service can break just one node.


10. Restart scope carefully

If evidence points clearly to the bad node’s OVN stack, a targeted recovery step is safer than broad cluster changes.

Typical sequence:

  • cordon/drain the bad node if workloads are impacted
  • restart or recover the bad node’s OVN/OVS components
  • verify traffic before uncordoning

Avoid random restarts cluster-wide unless you’ve ruled out a local issue.


What this usually turns out to be

Most common causes:

  • ovnkube-node unhealthy on one node
  • broken or stale OVS state on that node
  • host NIC / route / MTU mismatch
  • node-specific firewall or kernel/network issue
  • the node recently rebooted or partially lost connectivity to OVN DB

Fast triage checklist

When traffic fails only on one node, I’d do this in order:

oc get pods -A -o wide
oc get pods -n openshift-ovn-kubernetes -o wide
oc logs -n openshift-ovn-kubernetes <ovnkube-node-on-bad-node>
oc debug node/<bad-node>
chroot /host
ovs-vsctl show
ip route
ip link
systemctl status ovs-vswitchd
systemctl status ovn-controller

That usually gets you very close.


Mental model

When only one node is broken:

  • cluster-wide policy is less likely
  • app config is less likely
  • service config is less likely
  • node-local data plane is most likely

So think:
bad node → ovnkube-node → OVS → host NIC/route/MTU


Here’s a realistic example:

  • pods on worker-2 cannot reach anything off-node
  • pods on worker-1 are fine
  • ovnkube-node on worker-2 shows repeated connection/programming errors
  • ovs-vsctl show on worker-2 is missing expected state

That strongly suggests the fix is on worker-2, not in the app or service definitions.

Debugging OVN Issues in OpenShift

Let’s walk through a realistic, production-style OVN debugging scenario in
OpenShift Container Platform using OVN-Kubernetes.


Scenario

A frontend pod cannot reach a backend service

You have:

  • frontend pod
  • backend pod
  • backend-service (ClusterIP)

And:

curl http://backend-service

fails


Step-by-step debugging (real flow)

Step 1: Check if backend pod is healthy

oc get pods -o wide

You want:

  • Backend pod = Running
  • Has an IP (e.g., 10.128.2.15)

If pod is not running → stop here (not an OVN issue)


Step 2: Test direct pod-to-pod connectivity

From frontend pod:

oc exec -it frontend -- curl http://10.128.2.15

Outcomes:

Case A: This FAILS

→ Problem is networking (OVN / policy / routing)

Case B: This WORKS

→ Networking is fine → problem is service layer


Branch A: Pod-to-pod FAILS (OVN issue)

Step 3A: Check NetworkPolicies

oc get networkpolicy -A

Look for anything like:

  • Deny all ingress
  • Missing allow rules

Quick test:
Create temporary allow-all policy

If it suddenly works → root cause = NetworkPolicy


Step 4A: Check node-level OVN

Find nodes:

oc get pods -o wide

Then:

oc get pods -n openshift-ovn-kubernetes -o wide

Check:

  • Is ovnkube-node running on both nodes?
  • Any restarts?

Step 5A: Test OVS health

oc debug node/<node>
chroot /host
ovs-vsctl show

Look for:

  • br-int bridge
  • Proper interfaces

Missing interfaces = OVN not wiring pods correctly


Step 6A: Check OVN logs

oc logs -n openshift-ovn-kubernetes <ovnkube-node>

Common errors:

  • Flow install failures
  • DB sync issues

Branch B: Pod-to-pod WORKS, Service FAILS

This is VERY common and often misunderstood.


Step 3B: Check service

oc get svc backend-service -o wide

Check:

  • ClusterIP exists
  • Correct port

Step 4B: Check endpoints

oc get endpoints backend-service

If EMPTY:

→ Service is not linked to pods

Root cause:

  • Wrong selector labels

Fix:

selector:
app: backend

Step 5B: Test service IP directly

curl <ClusterIP>

Fails but pod IP works:

→ OVN load-balancing issue


Step 6B: Check OVN load balancer

On node:

ovn-nbctl lb-list

You should see:

  • Service IP mapped to pod IPs

If missing → OVN not programming service


Bonus: DNS check (often confused with OVN)

From frontend:

nslookup backend-service

If fails:

→ DNS issue, NOT OVN

Check:

oc get pods -n openshift-dns

Real root cause examples (from production)

Case 1: Wrong labels

  • Service selector doesn’t match pod
    → No endpoints → service fails

Case 2: NetworkPolicy blocking traffic

  • Default deny policy applied
    → Pods isolated

Case 3: OVN desync

  • Pod exists but not in OVN DB
    → No routing

Case 4: Node issue

  • Only pods on one node fail
    ovnkube-node broken there

Case 5: MTU mismatch

  • Small packets work, large fail
    → Very tricky to spot

The mental model (this is what experts use)

When debugging:

  1. Pod IP → works?
    • ❌ → OVN / policy / routing
    • ✅ → go to service layer
  2. Service endpoints exist?
    • ❌ → labels problem
    • ✅ → OVN load balancing
  3. DNS works?
    • ❌ → DNS, not OVN

Pro move (what senior engineers do)

Spin up a debug pod:

oc run debug --image=busybox -it --rm -- sh

Then test:

  • ping
  • curl
  • nslookup

This removes app complexity completely.


Understanding OVN in OpenShift: A Networking Overview

In OpenShift Container Platform (OCP), OVN refers to Open Virtual Network, used via OVN-Kubernetes. It’s the default networking solution in modern OpenShift clusters.


What OVN is (in simple terms)

OVN is a software-defined networking (SDN) system that:

  • Creates virtual networks for pods
  • Handles routing, switching, and network policies
  • Replaces older OpenShift SDN implementations

Think of it as the “network brain” of your cluster.


How OVN works in OCP

Core components

  • OVN Northbound DB → stores high-level network config
  • OVN Southbound DB → translates config to actual rules
  • OVN Controller (on each node) → programs networking rules
  • Open vSwitch (OVS) → does the actual packet forwarding

What it actually does

1. Pod networking

Each pod gets:

  • Its own IP address
  • Ability to talk to other pods across nodes

2. Routing

Handles:

  • Pod-to-pod communication
  • Pod-to-service traffic
  • External access (ingress/egress)

3. Network policies

Controls traffic like:

  • “Only allow frontend → backend”
  • “Block everything except specific ports”

4. Load balancing

Implements Kubernetes Services internally


Traffic flow (simplified)

  1. Pod sends traffic
  2. Goes through Open vSwitch (OVS) on the node
  3. OVN rules decide:
    • Where it goes
    • Whether it’s allowed
  4. Traffic reaches destination pod/service

Why OCP uses OVN

Compared to older SDN:

  • Better performance and scalability
  • Native support for Kubernetes NetworkPolicy
  • More flexible routing and IP management
  • Cleaner architecture (decoupled control + data plane)

OVN vs OpenShift SDN

FeatureOVN-KubernetesOpenShift SDN (legacy)
PerformanceHigherLower
NetworkPolicyFull supportLimited
IPv6SupportedLimited
Future support✅ Active❌ Deprecated

Key concepts to remember

  • OVN = control plane for networking
  • OVS = data plane (moves packets)
  • Pods communicate via virtual networking managed by OVN
  • Policies are enforced at the network layer

Real-world example

If you deploy:

  • frontend pod
  • backend pod

With OVN:

  • Both get IPs
  • OVN ensures routing between them
  • A NetworkPolicy can allow only frontend → backend traffic

Debugging OVN in OpenShift Container Platform (via OVN-Kubernetes) can feel overwhelming at first, but there’s a clear, structured way to approach it.

Below is a practical, field-tested workflow you can follow.


0. Start with the symptom

Before touching OVN internals, identify the issue type:

  • ❌ Pod can’t reach another pod
  • ❌ Pod can’t reach a service
  • ❌ External traffic not working
  • ❌ DNS failing
  • ❌ Only some nodes affected

This helps narrow the scope fast.


1. Check cluster networking health

oc get co network
  • Should be Available=True
  • If Degraded → OVN problem likely

2. Check OVN pods

oc get pods -n openshift-ovn-kubernetes

Look for:

  • CrashLoopBackOff
  • NotReady pods

Key pods:

  • ovnkube-node (runs on every node)
  • ovnkube-master

3. Check logs (most important step)

Node-level (data plane issues)

oc logs -n openshift-ovn-kubernetes <ovnkube-node-pod>

Control plane

oc logs -n openshift-ovn-kubernetes <ovnkube-master-pod>

Look for:

  • Flow programming errors
  • DB connection failures
  • OVS issues

4. Validate pod networking

Get pod IPs:

oc get pods -o wide

Test connectivity:

oc exec -it <pod> -- ping <other-pod-ip>

If this fails:

  • Likely OVN routing or policy issue

5. Check NetworkPolicies

oc get networkpolicy -A

Common mistake:

  • Policy blocking traffic unintentionally

Test by temporarily removing policy or creating an allow-all:

kind: NetworkPolicy
spec:
podSelector: {}
ingress:
- {}
egress:
- {}

6. Check Open vSwitch (OVS)

SSH into a node:

oc debug node/<node-name>
chroot /host

Then:

ovs-vsctl show

Look for:

  • Bridges (like br-int)
  • Missing interfaces = problem

7. Inspect OVN DB state

From master node:

ovn-nbctl show

Check:

  • Logical switches
  • Ports for pods

If missing → OVN not programming correctly


8. Check services & kube-proxy replacement

OVN replaces kube-proxy.

Check:

oc get svc

Test:

curl <service-cluster-ip>

If service fails but pod IP works:
→ Load balancing issue in OVN


9. Check egress / external connectivity

From pod:

curl google.com

If fails:

  • Check EgressFirewall / EgressIP
  • Check node routing

10. Use must-gather (for deep issues)

oc adm must-gather -- /usr/bin/gather_network_logs

This collects:

  • OVN DB state
  • OVS config
  • Logs

Common real-world issues

1. MTU mismatch

Symptoms:

  • Intermittent connectivity
  • Large packets fail

2. NetworkPolicy blocking traffic

Very common in production


3. OVN DB not syncing

Symptoms:

  • Pods exist but no routes

4. Node-specific issues

  • Only pods on one node fail → check that node’s ovnkube-node

5. DNS issues (often misdiagnosed as OVN)

Check:

oc get pods -n openshift-dns

Debugging mindset (this is key)

Always go in this order:

  1. Is cluster networking healthy?
  2. Are OVN pods running?
  3. Is traffic blocked (policy)?
  4. Is routing broken (OVN/OVS)?
  5. Is it actually DNS or app issue?

Pro tip

Use a debug pod:

oc run test --image=busybox -it --rm -- sh

From there:

  • ping
  • nslookup
  • curl

This isolates networking from your app.

Understanding Tekton: A Comprehensive CI/CD Framework for Kubernetes

Tekton is a cloud-native CI/CD framework built for Kubernetes. Here’s a full breakdown:


What it is

Tekton is a Kubernetes-native open source framework for creating continuous integration and continuous delivery (CI/CD) systems. It installs and runs as an extension on a Kubernetes cluster and comprises a set of Kubernetes Custom Resources that define the building blocks you can create and reuse for your pipelines.

Tekton standardizes CI/CD tooling and processes across vendors, languages, and deployment environments. It lets you create CI/CD systems quickly, giving you scalable, serverless, cloud-native execution out of the box.


Core building blocks

Everything in Tekton is composed of these layers:

  • Step — the most basic entity, such as running unit tests or compiling a program. Tekton performs each step with a provided container image.
  • Task — a collection of steps in a specific order. Tekton runs a task in the form of a Kubernetes pod, where each step becomes a running container in the pod.
  • Pipeline — a collection of tasks in a specific order. Tekton collects all tasks, connects them in a directed acyclic graph (DAG), and executes the graph in sequence.
  • TaskRun — a specific execution of a task.
  • PipelineRun — a specific execution of a pipeline.

Example pipeline (clone → build → deploy)

# Step 1: Define a Task
apiVersion: tekton.dev/v1
kind: Task
metadata:
name: build-and-push
spec:
params:
- name: IMAGE
type: string
steps:
- name: build
image: gcr.io/kaniko-project/executor:latest
args:
- --destination=$(params.IMAGE)
- --context=/workspace/source
---
# Step 2: Compose Tasks into a Pipeline
apiVersion: tekton.dev/v1
kind: Pipeline
metadata:
name: ci-pipeline
spec:
tasks:
- name: clone
taskRef:
name: git-clone # from Tekton Catalog
- name: build
runAfter: [clone]
taskRef:
name: build-and-push
- name: deploy
runAfter: [build]
taskRef:
name: kubectl-apply
---
# Step 3: Trigger a run
apiVersion: tekton.dev/v1
kind: PipelineRun
metadata:
name: ci-pipeline-run-001
spec:
pipelineRef:
name: ci-pipeline

Major components

The Tekton ecosystem includes:

  • Pipelines — the core CRDs (Task, Pipeline, etc.)
  • Triggers — allows you to create pipelines based on event payloads, such as triggering a run every time a merge request is opened against a Git repo
  • CLI (tkn) — command-line interface to interact with Tekton from your terminal
  • Dashboard — a web-based graphical interface showing pipeline execution history
  • Catalog — a repository of high-quality, community-contributed reusable Tasks and Pipelines
  • Chains — manages supply chain security, including artifact signing and SLSA provenance

Key advantages

  • Truly Kubernetes-native — every pipeline run is a real Kubernetes pod; no external CI server needed
  • Reusable and composable — Tasks from the Tekton Hub can be dropped into any pipeline
  • Event-driven — Triggers fire pipelines automatically on Git webhooks, image pushes, etc.
  • Scalable — each step runs in its own container; pipelines scale with the cluster
  • Supply chain security — Tekton Chains can sign images and generate SLSA provenance automatically

Tekton on OpenShift

Red Hat ships Tekton as OpenShift Pipelines — the officially supported Tekton operator available directly from OperatorHub. It adds OCP-specific integrations like integration with the OpenShift internal image registry, S2I (Source-to-Image) tasks, and the OpenShift console Pipeline UI. Tekton is the basis for OpenShift Pipelines, making it the natural CI tool to pair with Argo CD or Flux for a full GitOps workflow on OCP (Tekton handles CI/build, Argo CD or Flux handles CD/deploy).


Here’s the full picture of how Tekton (CI) and Argo CD / Flux (CD) work together on OCP — first the architecture flow, then a complete reference guide.Now here’s the full practical reference — everything you need to wire it up on OCP.


How the two halves divide responsibility

When code changes are pushed to a Git repository, OpenShift Pipelines initiates a pipeline run. This pipeline might include tasks such as building container images, running unit tests, and generating artifacts. Once the pipeline successfully completes, Argo CD continuously monitors the Git repository for changes in application manifests. Once the new image version is committed, Argo CD synchronizes the application state to match the declared state in Git.

The key insight is that Tekton owns the source repo (code → image) and Argo CD / Flux owns the config repo (manifests → cluster). Tekton never deploys directly. It commits the new image tag to a separate GitOps manifests repo, then hands off.


Step 1 — Install both operators on OCP

# OpenShift Pipelines (Tekton) — via OperatorHub
# Operators → OperatorHub → "Red Hat OpenShift Pipelines" → Install
# OpenShift GitOps (Argo CD) — via OperatorHub
# Operators → OperatorHub → "Red Hat OpenShift GitOps" → Install
# Verify both are running
oc get pods -n openshift-pipelines
oc get pods -n openshift-gitops

Step 2 — The Tekton CI pipeline

On every push or pull-request to the source Git repository, the following steps execute within the Tekton pipeline: code is cloned and unit tests are run; the application is analyzed by SonarQube in parallel; a container image is built using S2I and pushed to the OpenShift internal registry; then Kubernetes manifests are updated in the Git repository with the image digest that was built within the pipeline.

apiVersion: tekton.dev/v1
kind: Pipeline
metadata:
name: ci-pipeline
namespace: cicd
spec:
workspaces:
- name: source
- name: dockerconfig
params:
- name: GIT_URL
type: string
- name: IMAGE
type: string
- name: GIT_MANIFEST_URL # separate repo for k8s manifests
type: string
tasks:
- name: clone
taskRef:
name: git-clone
kind: ClusterTask
workspaces:
- name: output
workspace: source
params:
- name: url
value: $(params.GIT_URL)
- name: unit-test
runAfter: [clone]
taskRef:
name: maven
kind: ClusterTask
workspaces:
- name: source
workspace: source
- name: build-image
runAfter: [unit-test]
taskRef:
name: buildah
kind: ClusterTask
params:
- name: IMAGE
value: $(params.IMAGE)
workspaces:
- name: source
workspace: source
- name: dockerconfig
workspace: dockerconfig
- name: scan-image
runAfter: [build-image]
taskRef:
name: trivy-scanner # from Tekton Hub
params:
- name: IMAGE
value: $(params.IMAGE)
- name: update-manifest # THE HANDOFF to GitOps
runAfter: [scan-image]
taskRef:
name: git-cli
kind: ClusterTask
params:
- name: GIT_USER_NAME
value: tekton-bot
- name: COMMANDS
value: |
git clone $(params.GIT_MANIFEST_URL) /workspace/manifest
cd /workspace/manifest
# Update image tag in kustomization
kustomize edit set image myapp=$(params.IMAGE)
git add -A
git commit -m "ci: update image to $(params.IMAGE)"
git push

Step 3 — Tekton Triggers (webhook → pipeline)

# EventListener — receives the GitHub/GitLab webhook
apiVersion: triggers.tekton.dev/v1beta1
kind: EventListener
metadata:
name: git-push-listener
namespace: cicd
spec:
serviceAccountName: pipeline
triggers:
- name: push-trigger
bindings:
- ref: github-push-binding
template:
ref: pipeline-trigger-template
---
# TriggerTemplate — what to create when the webhook fires
apiVersion: triggers.tekton.dev/v1beta1
kind: TriggerTemplate
metadata:
name: pipeline-trigger-template
namespace: cicd
spec:
params:
- name: git-revision
- name: git-repo-url
resourcetemplates:
- apiVersion: tekton.dev/v1
kind: PipelineRun
metadata:
generateName: ci-run-
spec:
pipelineRef:
name: ci-pipeline
params:
- name: GIT_URL
value: $(tt.params.git-repo-url)
- name: IMAGE
value: image-registry.openshift-image-registry.svc:5000/myapp/app:$(tt.params.git-revision)

Expose the EventListener as an OCP Route so GitHub/GitLab can reach it:

oc expose svc el-git-push-listener -n cicd
# Then add the route URL as a webhook in GitHub/GitLab

Step 4 — Argo CD watches and deploys

Once the manifests repo is updated by Tekton, Argo CD detects the change. With automated.prune: true and selfHeal: true, it syncs immediately and deploys the new revision.

# Argo CD Application — dev environment
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: myapp-dev
namespace: openshift-gitops
spec:
project: default
source:
repoURL: https://github.com/my-org/manifests.git
targetRevision: main
path: environments/dev # Kustomize overlay for dev
destination:
server: https://kubernetes.default.svc
namespace: myapp-dev
syncPolicy:
automated:
prune: true # remove resources deleted from Git
selfHeal: true # revert manual changes to the cluster
syncOptions:
- CreateNamespace=true
---
# Promotion to staging requires a PR merge (no auto-deploy to prod)
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: myapp-staging
namespace: openshift-gitops
spec:
source:
path: environments/staging
targetRevision: staging # separate branch = manual promotion
syncPolicy:
automated:
prune: true
selfHeal: true

The GitOps repo layout Tekton writes to

manifests-repo/
├── base/
│ ├── deployment.yaml
│ ├── service.yaml
│ └── kustomization.yaml
└── environments/
├── dev/
│ └── kustomization.yaml ← Tekton updates image tag here
├── staging/
│ └── kustomization.yaml ← promoted via PR merge
└── prod/
└── kustomization.yaml ← promoted via PR merge + approval

Promotion flow (dev → staging → prod)

Once the pipeline finishes successfully, the image reference in the manifests repo is updated and automatically deployed to the dev environment by Argo CD. To promote to staging, a pull request is generated targeting the staging branch. Merging that PR triggers Argo CD to sync the staging environment. Production follows the same pattern with an additional approval gate.

The promotion task in Tekton creates a PR automatically:

- name: promote-to-staging
runAfter: [update-manifest]
taskRef:
name: github-open-pr # from Tekton Hub
params:
- name: REPO_FULL_NAME
value: my-org/manifests
- name: HEAD
value: feature/new-image-$(params.git-revision)
- name: BASE
value: staging
- name: TITLE
value: "Promote $(params.IMAGE) to staging"

Putting it all together — the complete flow

StepActorAction
1Developergit push to source repo
2GitHub/GitLabSends webhook to Tekton EventListener
3TektonClones, tests, builds image with Buildah/S2I
4TektonScans image with Trivy / ACS
5TektonPushes image to OCP internal registry or Quay
6TektonUpdates image tag in manifests repo, opens PR to staging
7Argo CD / FluxDetects change in manifests repo, deploys to dev automatically
8TeamReviews and merges PR → staging auto-deploys
9TeamApproves prod PR → production deploys

This pattern — Tekton handles CI, Argo CD / Flux handles CD, and Git is the only bridge between them — is the standard GitOps delivery model on OCP.

Mastering OpenShift on VMware and Bare Metal: Key Insights

Administering OpenShift on VMware vSphere or Bare Metal is significantly more complex than cloud environments because you are responsible for the “underlay” (the physical or virtual infrastructure) as well as the “overlay” (OpenShift).

In a 2026 interview, expect a focus on automation, connectivity in restricted environments, and hardware lifecycle.


1. Installation & Provisioning (The Foundation)

Q1: Compare IPI vs. UPI in the context of VMware vSphere.
  • IPI (Installer-Provisioned Infrastructure): The installer has the vCenter credentials. It automatically creates the Folder, Virtual Machines, and Resource Pools. It also handles the VIP (Virtual IPs) for the API and Ingress via Keepalived.
  • UPI (User-Provisioned Infrastructure): You manually create the VMs, set up the Load Balancers (F5, HAProxy), and configure DNS.
  • Interview Tip: Mention that IPI is preferred for speed and “automated scaling,” but UPI is often mandatory in “Brownfield” environments where the networking team won’t give the installer full control over the VLANs.
Q2: How does OpenShift interact with physical hardware for Bare Metal?

Answer: It uses the Metal3 project and the Bare Metal Operator (BMO).

  • The admin provides the BMC (Baseboard Management Controller) details—like IPMI, iDRAC (Dell), or iLO (HP)—to OpenShift.
  • OpenShift uses these to remotely power on the server, PXE boot it, and install RHCOS (Red Hat Enterprise Linux CoreOS).

2. Infrastructure Operations

Q3: What is a “Disconnected” (Air-Gapped) Installation?

Answer: Common in on-prem data centers with high security.

  • The Problem: OpenShift usually pulls images from quay.io.
  • The Solution: You must set up a Local Mirror Registry (like Red Hat Quay or Sonatype Nexus).
  • Process: You use the oc mirror plugin to download all required images to a portable disk, move it inside the secure zone, and push them to your local registry. You then configure the cluster to use an ImageContentSourcePolicy to redirect all image pulls to your local IP.
Q4: How do you handle storage on VMware vs. Bare Metal?
  • VMware: Use the vSphere CSI Driver. This allows OpenShift to talk to vCenter and dynamically provision .vmdk files as Persistent Volumes (PVs).
  • Bare Metal: You typically use LVM (Local Storage Operator) for fast local SSDs or OpenShift Data Foundation (ODF) (based on Ceph). ODF is the industry standard for on-prem because it provides S3-compatible, Block, and File storage within the cluster itself.

3. High Availability & Networking

Q5: On Bare Metal, how do you handle Load Balancing for the API and Ingress?

Answer: Since there is no “AWS ELB” on-prem, you have two choices:

  1. External: Use a physical appliance like an F5 Big-IP or a pair of HAProxy nodes managed by your team.
  2. Internal (MetalLB): Use the MetalLB Operator. It allows you to assign a range of IPs from your corporate network to the OpenShift Router so it can act like a cloud load balancer.
Q6: What happens if a Master (Control Plane) node dies in a Bare Metal cluster?

Answer: * Quorum: You must have 3 Masters to maintain an etcd quorum. If one dies, the cluster survives. If two die, the API becomes read-only or crashes.

  • Recovery: On Bare Metal, recovery is manual. You must reinstall the OS, use the kube-etcd-operator to remove the old member, and then use the cluster-bootstrap process to add the new node back into the etcd ring.

4. Advanced Troubleshooting

Q7: A worker node is “NotReady” on VMware. What is your first check?

Answer: Beyond the logs, I check the VMware Tools status and Time Sync.

  • If the ESXi host and the VM have a clock drift (common if NTP is misconfigured), the certificates for the Kubelet will fail to validate, and the node will go NotReady.
  • I would also check the MachineConfigPool (MCP). If the node is stuck in “Updating,” it might be failing to pull an OS image from the internal registry.
Q8: What is “Assisted Installer”?

Answer: It’s the modern way to install OpenShift on-prem. It provides a web-based GUI that generates a “Discovery ISO.” You boot your physical servers with this ISO; they “check in” to the portal, and you can then click “Install” to deploy the whole cluster without writing complex YAML files.


Technical “Buzzwords” for 2026:

  • OVN-Kubernetes: The default network plugin (replaces OpenShift SDN).
  • LVM Storage: Used for high-performance databases on bare metal.
  • Red Hat Advanced Cluster Management (RHACM): If the company has multiple on-prem clusters, they will use this to manage them all from one dashboard.

Debugging etcd is the highest level of OpenShift administration. If etcd is healthy, the cluster is healthy; if etcd is failing, the API will be sluggish or completely unresponsive.

Here is the technical deep-dive on how to diagnose and fix etcd on-premise.


1. Checking the High-Level Status

Before diving into logs, check if the Etcd Operator is happy. If the operator is degraded, it usually means it’s struggling to manage the quorum.

# Check the status of the etcd cluster operator
oc get clusteroperator etcd
# Check the status of the individual etcd pods
oc get pods -n openshift-etcd -l app=etcd

2. Testing Quorum and Health (The etcdctl way)

In OpenShift 4.x, etcd runs as Static Pods on the master nodes. To run diagnostic commands, you must use a helper script or exec into the container.

The “Is it alive?” check:

# Get a list of etcd members and their health
oc rsh -n openshift-etcd etcd-master-0 etcdctl endpoint health --cluster -w table
The Performance check (Disk Latency):

On-premise (especially VMware), Disk I/O latency is the #1 killer of etcd. If your storage is slow, etcd will lose quorum.

# Check the sync duration
oc rsh -n openshift-etcd etcd-master-0 etcdctl check perf

Interview Pro-Tip: Mention that etcd requires fsync latency of less than 10ms. If it’s higher, your underlying VMware datastore or Bare Metal disks are too slow for an enterprise cluster.


3. Investigating Logs

If a pod is crashing, check the logs specifically for “leader” issues or “wal” (Write Ahead Log) errors.

# View the last 100 lines of logs from a specific member
oc logs -n openshift-etcd etcd-master-0 -c etcd --tail=100

What to look for:

  • "lost leader": Indicates network instability between master nodes.
  • "apply entries took too long": Indicates slow disk or high CPU pressure on the master node.
  • "database space exceeded": The 8GB quota has been reached (requires a defrag).

4. Critical Recovery: The “Master Node Replacement”

If a master node (e.g., master-1) hardware fails permanently on Bare Metal, you must follow these steps to restore the cluster health:

  1. Remove the ghost member:Tell etcd to stop looking for the dead node.Bashoc rsh -n openshift-etcd etcd-master-0 etcdctl member list oc rsh -n openshift-etcd etcd-master-0 etcdctl member remove <dead-member-id>
  2. Clean up the Node object:oc delete node master-1
  3. Re-provision: Boot the new hardware with the RHCOS ISO. If using IPI, the Machine API might do this for you. If UPI, you must manually trigger the CSR (Certificate Signing Request) approval.
  4. Approve CSRs:The new master won’t join until you approve its certificates:oc get csr | grep Pending | awk '{print $1}' | xargs oc adm certificate approve

5. Compaction and Defragmentation

Over time, etcd keeps versions of objects. If the database grows too large, the cluster will stop accepting writes (Error: mvcc: database space exceeded).

The Fix:

OpenShift normally handles this automatically, but as an admin, you might need to force it:

# Defragment the endpoint
oc rsh -n openshift-etcd etcd-master-0 etcdctl defrag --cluster

The “Final Boss” Interview Question:

“We lost 2 out of 3 master nodes. The API is down. How do you recover?”

The Answer:

  1. Since quorum is lost (needs $n/2 + 1$ nodes), you must perform a Single Master Recovery.
  2. Stop the etcd service on the remaining healthy master.
  3. Run the etcd-snapshot-restore.sh script (shipped with OpenShift) using a previous backup.
  4. This forces the remaining master to become a “New Cluster” of one.
  5. Once the API is back up, you re-join the other two nodes as brand-new members.

Since OpenShift 4.12+, OVN-Kubernetes has become the default network provider, replacing the older OpenShift SDN. For an on-premise administrator, understanding this is vital because it changes how traffic flows from your physical switches into your pods.


1. OVN-Kubernetes Architecture

Unlike the old SDN which used Open vSwitch (OVS) in a basic way, OVN (Open Virtual Network) brings a distributed logical router and switch to every node.

  • Geneve Encap: OVN uses Geneve (Generic Network Virtualization Encapsulation) instead of VXLAN to tunnel traffic between nodes. It’s more flexible and allows for more metadata.
  • The Gateway: Every node has a “Gateway” that handles traffic entering and exiting the cluster. On-premise, this is where your physical network interface (e.g., eno1 or ens192) meets the virtual world.

2. On-Premise Networking Challenges

Q1: How does OpenShift handle “External” IPs on-prem?

In the cloud, you have a LoadBalancer service. On-prem, you don’t.

The Admin Solution: MetalLB.

As an admin, you configure a MetalLB Operator with an IP address pool from your actual data center VLAN. When a developer creates a Service of type LoadBalancer, MetalLB uses ARP (Layer 2) or BGP (Layer 3) to announce that IP address to your physical routers.

Q2: What is the “Ingress VIP” and “API VIP”?

During a VMware/Bare Metal IPI install, you are asked for two IPs:

  1. API VIP: The floating IP used to talk to the control plane (Port 6443).
  2. Ingress VIP: The floating IP for all application traffic (Ports 80/443).Mechanism: OpenShift uses Keepalived and HAProxy internally to float these IPs between the master nodes (for API) or worker nodes (for Ingress). If the node holding the IP fails, it “floats” to another node in seconds.

3. Troubleshooting the Network

If pods can’t talk to each other, follow this “inside-out” path:

Step 1: Check the Cluster Network Operator (CNO)

If the CNO is degraded, the entire network is unstable.

oc get clusteroperator network
Step 2: Trace the Flow with oc adm network

OpenShift provides a built-in tool to verify if two pods can actually talk to each other across nodes:

Bash

oc adm pod-network diagnostic
Step 3: Inspect the OVN Database

Since OVN stores the network state in a database (Northbound and Southbound DBs), you can check if the logical flows are actually created.

# Get the logs of the ovnkube-master
oc logs -n openshift-ovn-kubernetes -l app=ovnkube-master

4. Key Concepts for Interview Scenarios
Scenario: “Applications are slow only when talking to external databases.”
  • Likely Culprit: MTU Mismatch. * Explanation: Geneve encapsulation adds 100 bytes of overhead to every packet. If your physical network is set to standard MTU (1500), but OpenShift is also sending 1500, the packets get fragmented, causing a massive performance hit.
  • The Fix: Ensure the cluster MTU is set to 1400 (1500 – 100) or enable Jumbo Frames (9000) on your physical switches.
Scenario: “How do you isolate traffic between two departments on the same cluster?”
  • The Answer: NetworkPolicies. * OVN-Kubernetes supports standard Kubernetes NetworkPolicy objects. By default, all pods can talk to all pods. I would implement a “Deny-All” default policy and then explicitly allow traffic only between required microservices.

Summary for Administrator Interview

FeatureOpenShift SDN (Old)OVN-Kubernetes (New/Standard)
EncapsulationVXLANGeneve
Network PolicyLimitedFully Featured (Egress/Ingress)
Hybrid CloudHard to implementDesigned for it (IPsec support)
Windows SupportNoYes

Essential OpenShift Q&A: Architecture, Security & Workflow

In an OpenShift interview, the questions typically fall into three categories: Architecture/Concepts, Security (SCCs/RBAC), and Developer Workflow (S2I/Builds).

Here is a curated list of the most common and high-impact questions for 2026.


1. Core Architecture & Concepts

Q1: What is the fundamental difference between OpenShift and Kubernetes?

Answer: While Kubernetes is an open-source orchestration engine, OpenShift is a downstream, enterprise-grade distribution of Kubernetes by Red Hat.

  • The “Plus” Factor: OpenShift includes everything in Kubernetes but adds a built-in container registry, integrated CI/CD pipelines (Tekton), a developer-friendly web console, and enhanced security defaults.
  • Security: By default, OpenShift forbids containers from running as root, whereas vanilla Kubernetes is “open” by default.
Q2: What is an OpenShift “Project” vs. a Kubernetes “Namespace”?

Answer: A Project is simply an abstraction on top of a Kubernetes Namespace.

  • It adds metadata and facilitates Self-Service: users can request projects via the CLI (oc new-project) or Web Console.
  • It automatically applies default Resource Quotas and Limit Ranges to the namespace to prevent a single user from crashing the cluster.
Q3: Explain the role of the Router (HAProxy) in OpenShift.

Answer: In vanilla Kubernetes, you typically install an Ingress Controller (like NGINX). In OpenShift, the Router (based on HAProxy) is a core component. It provides the external entry point for traffic, mapping an external URL (a Route) to an internal Service.


2. Developer & Build Workflow

Q4: What is Source-to-Image (S2I) and why is it used?

Answer: S2I is a toolkit that allows developers to provide only their source code (via a Git URL). OpenShift then:

  1. Detects the language (Java, Python, Node, etc.).
  2. Injects the code into a “Builder Image.”
  3. Assembles the final application image.Benefit: Developers don’t need to know how to write a Dockerfile or manage base images, ensuring consistent security patches at the base layer.
Q5: What is a BuildConfig?

Answer: A BuildConfig is the definition of the entire build process. It specifies:

  • Source: Where the code is (Git).
  • Strategy: How to build it (S2I, Docker, or Pipeline).
  • Output: Where to push the resulting image (internal registry).
  • Triggers: Events that start a build (e.g., a code commit or an update to the base image).

3. Security & Operations

Q6: What are Security Context Constraints (SCCs)?

Answer: SCCs are one of the most important security features in OpenShift. They control what actions a pod can perform.

  • Restricted SCC: The default. It prevents pods from running as root and limits access to the host filesystem.
  • Anyuid SCC: Often used when migrating legacy Docker images that must run as a specific user.
  • Privileged SCC: Full access (usually reserved for infra components like logging or monitoring).
Q7: How does OpenShift handle Persistent Storage?

Answer: OpenShift uses the Persistent Volume (PV) and Persistent Volume Claim (PVC) model.

  • An administrator provisions PVs (storage chunks).
  • A developer requests storage via a PVC.
  • OpenShift uses Storage Classes to dynamically provision storage on the fly (e.g., on AWS EBS or VMware vSphere) when a PVC is created.

4. Scenario-Based “Pro” Question

Q8: “A pod is failing with a CrashLoopBackOff. How do you troubleshoot it in OpenShift?”

Answer: Walk through these 4 steps to show you have hands-on experience:

  1. Check Status: oc get pods to see the status.
  2. Examine Logs: oc logs <pod_name> (use --previous if the container already restarted).
  3. Inspect Events: oc describe pod <pod_name> to look for failed mounts, scheduling issues, or “Back-off” events.
  4. Debug Session: Use oc debug pod/<pod_name> to launch a terminal inside a clone of the failing pod to inspect the filesystem and environment variables.

5. Rapid-Fire Command Cheat Sheet

TaskCommand
Loginoc login <api-url>
Create Appoc new-app https://github.com/user/repo
Scale Appoc scale --replicas=3 dc/my-app
Expose Serviceoc expose svc/my-service
View Resourcesoc get all
Check SCCsoc get scc

For the Administrator track, the interview will shift away from “how to deploy an app” toward Cluster Health, Lifecycle Management, and Infrastructure Stability.

In OpenShift 4.x (the modern standard), the “Operator-focused” architecture is the star of the show. Here are the heavy-hitting admin questions you should be ready for.


1. The Operator Framework

Q1: What is the “Operator Pattern” and why is it central to OpenShift 4?

Answer: In OpenShift 4, the entire cluster is managed by Operators. An Operator is a custom controller that encodes human operational knowledge into software.

  • The Loop: It constantly monitors the Actual State of a component (like the Internal Registry or Monitoring stack) and compares it to the Desired State. If they differ, the Operator automatically fixes it.
  • Cluster Version Operator (CVO): This is the “Master Operator” that manages the updates of the cluster itself, ensuring all core components are at the correct version.
Q2: How do you perform a Cluster Upgrade in OpenShift 4?

Answer: Upgrades are managed via the Cluster Version Operator (CVO).

  • Process: You typically update the “Channel” (e.g., stable-4.14) and then trigger the upgrade via the console or: oc adm upgrade.
  • Mechanism: The CVO orchestrates the update of every operator in the cluster. The Machine Config Operator (MCO) handles the rolling reboot of nodes to update the underlying Red Hat Enterprise Linux CoreOS (RHCOS).

2. Infrastructure & Nodes

Q3: What is the Machine Config Operator (MCO)?

Answer: The MCO is one of the most important components for an admin. It treats the underlying nodes like “cattle, not pets.”

  • It manages the operating system (RHCOS) itself.
  • If you need to change a kernel parameter, add a SSH key, or change a NTP setting across 50 nodes, you create a MachineConfig object. The MCO then applies that change and reboots nodes in a rolling fashion to ensure zero downtime.
Q4: Explain the difference between IPI and UPI installation.

Answer: * IPI (Installer-Provisioned Infrastructure): Full automation. The OpenShift installer has credentials to your cloud (AWS, Azure, etc.) and creates the VMs, VPCs, and Load Balancers for you.

  • UPI (User-Provisioned Infrastructure): The admin manually creates the infrastructure (VMs, networking, storage). You then run the installer to “bootstrap” OpenShift onto those pre-existing resources. (Common in highly regulated or air-gapped environments).

3. Storage & Networking

Q5: How do you troubleshoot a Node that is in “NotReady” status?

Answer: I follow a systematic checklist:

  1. Check Node Details: oc describe node <node_name> to look at the “Conditions” section (e.g., MemoryPressure, DiskPressure, or NetworkUnavailable).
  2. Verify Kubelet: SSH into the node (or use oc debug node) and check the kubelet logs: journalctl -u kubelet.
  3. Resource Usage: Check if the node has run out of PIDs or Disk space.
  4. CSRs: If the node was recently added/rebooted, check if there are pending Certificate Signing Requests: oc get csr and approve them if necessary.
Q6: What is the “In-tree” to CSI migration?

Answer: Older versions of OpenShift used storage drivers built directly into the Kubernetes binary (“In-tree”). Modern OpenShift is moving to CSI (Container Storage Interface) drivers. As an admin, this means storage is now handled by standalone operators, allowing for easier updates without upgrading the whole cluster.


4. Security & Etcd

Q7: Why is the etcd backup critical, and how do you perform it?

Answer: etcd is the “brain” of the cluster; it stores every configuration and state. If etcd is lost, the cluster is dead.

  • Backup: You use the cluster-etcd-operator. I would run a specific debug script provided by Red Hat: oc debug node/<master-node> -- /usr/local/bin/cluster-backup.sh /home/core/assets/backup.
  • Strategy: Always take a backup before a cluster upgrade.

5. Monitoring & Logging

Q8: What stack does OpenShift use for Cluster Monitoring?

Answer: OpenShift comes with a pre-configured Prometheus, Grafana, and Alertmanager stack (managed by the Monitoring Operator).

  • Note: Admins use this to monitor cluster health (CPU/Mem of nodes).
  • User Workload Monitoring: In newer versions, admins can enable “User Workload Monitoring” to allow developers to use the same Prometheus stack for their own applications without interfering with the cluster’s core monitoring.

Summary Checklist for your Interview

[!TIP]

If they ask about a problem you can’t solve: Always mention “Looking at the Operators.” In OpenShift 4, if something is broken, check

oc get clusteroperators.

If one is DEGRADED=True, that is your smoking gun.

OCP can run on different environment, such as on-premise (VMware/Bare Metal) or a managed service (ROSA/ARO)

Linux network

Linux network issues can come from a lot of places—misconfiguration, hardware problems, DNS failures, firewall rules, or even upstream network outages. The trick is to diagnose step-by-step instead of guessing. Here’s a practical workflow you can follow.


1. Check Basic Connectivity

Start simple—make sure the system is even connected.

ip a
  • Look for an IP address (not just 127.0.0.1)
  • Check if the interface is UP
ip link
  • Verify your interface (e.g., eth0, wlan0) is not DOWN

If it’s down:

sudo ip link set eth0 up

2. Test Network Reachability

Try pinging known addresses:

ping -c 4 8.8.8.8
  • If this fails → no internet or routing issue

Then test DNS:

ping -c 4 google.com
  • If IP ping works but domain fails → DNS problem

3. Check Routing Table

ip route

Look for a default route:

default via 192.168.1.1 dev eth0

If missing, add it:

sudo ip route add default via <gateway-ip>

4. Verify DNS Configuration

Check:

cat /etc/resolv.conf

You should see something like:

nameserver 8.8.8.8

If not, temporarily add:

sudo nano /etc/resolv.conf

Or if using systemd:

systemd-resolve --status

5. Check Network Manager / Services

Depending on your system:

Using NetworkManager

nmcli device status
nmcli connection show

Restart:

sudo systemctl restart NetworkManager

Using systemd-networkd

networkctl status
sudo systemctl restart systemd-networkd

6. Inspect Firewall Rules

Firewall can silently block traffic.

sudo iptables -L -n

Or if using UFW:

sudo ufw status

Temporarily disable (for testing):

sudo ufw disable

7. Check Open Ports

ss -tulnp

This shows:

  • Listening services
  • Ports in use

8. Test with traceroute

See where packets stop:

traceroute 8.8.8.8

If it stops early → local network issue
If it goes far but fails → ISP or remote issue


9. Check Logs

Logs often reveal the real problem:

journalctl -u NetworkManager
journalctl -xe
dmesg | grep -i network

10. Wireless-Specific Checks

If using Wi-Fi:

iwconfig

Scan networks:

iw dev wlan0 scan

Reconnect:

nmcli device wifi connect "SSID" password "PASSWORD"

11. Hardware / Driver Issues

Check drivers:

lspci | grep -i network
lsmod | grep <driver>

If missing drivers, you may need to install firmware packages.


Common Problems & Quick Fixes

  • No IP address → DHCP issue → restart network service
  • Can ping IP but not domain → DNS misconfigured
  • Interface down → enable with ip link set
  • Connected but no internet → gateway missing
  • Wi-Fi connected but slow → interference or driver issue

On Ubuntu, networking is usually managed by NetworkManager, so troubleshooting is a bit more standardized. Let’s narrow this down quickly with a focused Ubuntu workflow.


1. Check if NetworkManager is running

sudo systemctl status NetworkManager

If it’s not active:

sudo systemctl restart NetworkManager

2. See connection status

nmcli device status

You’ll see something like:

  • connected → good
  • disconnected → problem
  • unmanaged → config issue

To reconnect:

nmcli connection up <connection-name>

3. Check IP address

ip a

Look for something like:

inet 192.168.x.x

❌ No IP? Try renewing DHCP:

sudo dhclient -v

4. Test connectivity

ping -c 4 8.8.8.8
  • ❌ fails → network/gateway issue
  • ✅ works → move to DNS test:
ping -c 4 google.com
  • ❌ fails → DNS problem

5. Check default gateway

ip route

You should see:

default via 192.168.x.1

If missing:

sudo ip route add default via 192.168.1.1

6. Fix DNS (very common on Ubuntu)

Check:

cat /etc/resolv.conf

If it’s broken or empty, try:

sudo nano /etc/systemd/resolved.conf

Set:

DNS=8.8.8.8 1.1.1.1

Then restart:

sudo systemctl restart systemd-resolved

7. Wi-Fi specific (if applicable)

Check:

nmcli device wifi list

Reconnect:

nmcli device wifi connect "SSID" password "PASSWORD"

If Wi-Fi adapter missing:

lspci | grep -i network

8. Firewall check

Ubuntu often uses UFW:

sudo ufw status

Temporarily disable:

sudo ufw disable

9. Look at logs (very useful)

journalctl -u NetworkManager --no-pager | tail -50

Fast “reset everything” fix (works surprisingly often)

sudo systemctl restart NetworkManager
sudo systemctl restart systemd-resolved
sudo dhclient