Troubleshooting Egress Issues in OpenShift Namespaces

April 20, 2026April 20, 2026 techhadoop OCP azure, cloud, devops, kubernetes, technology

This is a classic OpenShift case because egress controls can be namespace-scoped, so one project can reach the internet while another cannot even though both are on the same cluster. In OpenShift with OVN-Kubernetes, the main things to check are Kubernetes NetworkPolicy egress rules, OpenShift EgressFirewall objects, and sometimes EgressIP if the namespace is supposed to leave the cluster from a specific source IP.

OpenShift documents EgressFirewall as a namespace-level object, and Kubernetes documents that once a pod is selected by an egress policy, only the explicitly allowed outbound traffic is permitted. (Red Hat Documentation)

Scenario

Pods in namespace team-a can reach external sites, but pods in team-b cannot.

Examples:

			
oc exec -n team-a deploy/app -- curl https://example.com   # works
oc exec -n team-b deploy/app -- curl https://example.com   # fails

That pattern strongly suggests the problem is policy attached to the namespace, not a cluster-wide outage. OpenShift’s EgressFirewall is evaluated per namespace, and if there is no matching rule then traffic is allowed by default unless something else, like a NetworkPolicy, restricts it. (Red Hat Documentation)

Diagram

          Namespace team-a                 Namespace team-b
      +---------------------+           +---------------------+
      | pod -> external IP  |           | pod -> external IP  |
      +----------+----------+           +----------+----------+
                 |                                 |
                 v                                 v
        [no blocking policy]            [NetworkPolicy and/or
                 |                     EgressFirewall applies]
                 v                                 |
           traffic allo                            v
                                        traffic denied or limited

Where namespace-specific egress can break:
1) Egress NetworkPolicy in that namespace
2) EgressFirewall object in that namespace
3) EgressIP expected for that namespace but misconfigured
4) DNS works, but external traffic is filtered after resolution

How to debug it

1. Prove it is really namespace-specific

Run the same test from a working namespace and a failing one:

			
oc exec -n team-a deploy/app -- curl -I https://example.com
oc exec -n team-b deploy/app -- curl -I https://example.com

Then test direct IP and DNS separately from the failing namespace:

			
oc exec -n team-b deploy/app -- nslookup example.com
oc exec -n team-b deploy/app -- curl -I https://93.184.216.34

If DNS works but outbound HTTP to external IPs fails, that points more toward egress filtering than DNS. This is an inference from Kubernetes DNS and policy behavior together. (Kubernetes)

2. Check `NetworkPolicy` in the failing namespace

This is the first thing I’d inspect:

			
oc get networkpolicy -n team-b
oc get networkpolicy -n team-b -o yaml

Kubernetes says that if a pod is selected by a policy with policyTypes: [Egress], the allowed outbound traffic is restricted to what the policy permits. A “default deny all ingress and all egress” policy is a standard pattern. (Kubernetes)

Typical bad case:

			
policyTypes:
- Egress
egress:
- to:
  - namespaceSelector:
      matchLabels:
        name: internal-only

		

That would allow only a narrow set of destinations and block internet egress.

3. Check for an OpenShift `EgressFirewall`

OpenShift provides EgressFirewall as a namespace object for controlling traffic from pods to destinations outside the cluster. It is specific to OVN-Kubernetes. (Red Hat Documentation)

Commands:

			
oc get egressfirewall -n team-b
oc get egressfirewall -n team-b -o yaml

OpenShift documents that traffic to an IP outside the cluster is checked against the namespace’s EgressFirewall rules in order. If a rule matches, that action applies; if no rule matches, traffic is allowed by default. (Red Hat Documentation)

A realistic blocking example is a namespace with rules allowing only a few CIDRs or DNS names and denying everything else.

4. Check whether the namespace is supposed to use `EgressIP`

If the application depends on a fixed source IP for outbound allowlisting, verify whether EgressIP is configured and healthy. OpenShift documents that an egress IP can be assigned to a namespace and is distinct from an egress router. (Red Hat Documentation)

Check:

			
oc get egressip
oc describe egressip <name>

If team-b is expected to leave via a specific egress IP and that configuration is broken, outbound access to third-party systems may fail even though generic internet access from other namespaces works. That last part is an inference from how vendor allowlists usually interact with source IP–based egress. (Red Hat Documentation)

5. Verify DNS separately

Sometimes people say “egress is broken” when the real failure is DNS.

			
oc exec -n team-b deploy/app -- nslookup example.com
oc exec -n team-b deploy/app -- curl -I https://example.com
oc exec -n team-b deploy/app -- curl -I https://93.184.216.34

Interpretation:

nslookup fails, IP curl fails: maybe DNS or broader networking
nslookup works, IP curl fails: likely egress filtering
nslookup fails, IP curl works: DNS-only issue

That distinction follows from Kubernetes DNS behavior plus the documented policy mechanisms above. (Kubernetes)

6. Compare with a working namespace

This is one of the fastest ways to spot the difference:

			
oc get networkpolicy -n team-a -o yaml
oc get networkpolicy -n team-b -o yaml
oc get egressfirewall -n team-a -o yaml
oc get egressfirewall -n team-b -o yaml

When only one namespace is failing, the delta between those objects often explains it immediately.

7. Check whether the block is by destination type

OpenShift supports EgressFirewall rules for external destinations, and OpenShift also documents audit logging for egress firewall and network policy, which can help when you need proof of what is being denied. (Red Hat Documentation)

Ask:

does external IP fail?
does internal service traffic still work?
does only one external domain fail?

That helps separate “internet blocked” from “specific destinations blocked.”

What this usually turns out to be

Most common causes:

Default deny egress NetworkPolicy in the failing namespace. Kubernetes explicitly documents this pattern. (Kubernetes)
Namespace EgressFirewall allowing only selected external destinations. OpenShift documents EgressFirewall as namespace-scoped and processed rule by rule for external IP traffic. (Red Hat Documentation)
Broken or missing EgressIP where the app depends on outbound source-IP allowlists. OpenShift documents namespace egress IP configuration separately from egress routers. (Red Hat Documentation)
Misdiagnosed DNS problem, where name resolution fails and looks like internet egress failure. (Red Hat Documentation)

Fast triage sequence

			
oc exec -n team-b deploy/app -- nslookup example.com
oc exec -n team-b deploy/app -- curl -I https://93.184.216.34
oc get networkpolicy -n team-b -o yaml
oc get egressfirewall -n team-b -o yaml
oc get egressip

		

Mental model

When egress fails only in some namespaces:

think namespace policy first
then think OpenShift EgressFirewall
then think EgressIP expectations
only after that think cluster-wide OVN trouble

Because if it were a true cluster-wide OVN failure, you would usually see the problem across many namespaces, not just one. That last point is an operational inference, but it is a very useful one. (Red Hat Documentation)

Debugging DNS Issues in OpenShift Pods

April 20, 2026April 20, 2026 techhadoop OCP cloud, devops, kubernetes, linux, technology

DNS works for some pods but not others: this one is tricky because it often looks like OVN, but a lot of the time it is actually DNS path, namespace lookup, or pod DNS config.

In OpenShift, the DNS Operator manages CoreDNS for pod and service name resolution, and CoreDNS runs as the dns-default daemon set in openshift-dns. Pods rely on kubelet-provided DNS settings in /etc/resolv.conf to reach those DNS servers. (Red Hat Documentation)

Scenario

Some pods can resolve service names, but others cannot.

Examples:

Pod A: nslookup backend-service ✅
Pod B: nslookup backend-service ❌

That usually means one of these:

the failing pod has bad DNS settings,
the query is being made from the wrong namespace,
only some nodes can reach the DNS pods,
or the DNS pods themselves are unhealthy on part of the cluster. (Red Hat Documentation)

Diagram

                +------------------------------+
                |        failing pod           |
                |  /etc/resolv.conf            |
                |  nameserver -> DNS service   |
                +--------------+---------------+
                               |
                               v
                    +---------------------+
                    |   CoreDNS /         |
                    |   dns-default pods  |
                    |   in openshift-dns  |
                    +----------+----------+
                               |
                 resolves svc/pod names from cluster state
                               |
                               v
                    +---------------------+
                    |  Service / Pod DNS  |
                    |  records            |
                    +---------------------+

Where it breaks:
1) Pod resolv.conf is wrong
2) Pod queries wrong namespace
3) Pod/node cannot reach dns-default
4) dns-default pods unhealthy
5) Name exists, but target service/endpoints are wrong

How to debug it

1. Prove it is DNS and not general networking

From a good pod and a bad pod, test both DNS and direct IP access:

			
oc exec -it <good-pod> -- nslookup backend-service
oc exec -it <bad-pod> -- nslookup backend-service
oc exec -it <bad-pod> -- curl http://<service-cluster-ip>:<port>
oc exec -it <bad-pod> -- curl http://<pod-ip>:<port>

If IP-based access works but nslookup fails, that points strongly to DNS rather than OVN datapath routing. Kubernetes service and pod discovery are meant to work through DNS records. (Kubernetes)

2. Check the failing pod’s `/etc/resolv.conf`

This is one of the fastest checks:

oc exec -it <bad-pod> -- cat /etc/resolv.conf

A normal pod DNS config should include a cluster DNS nameserver and search domains such as the pod namespace, svc.cluster.local, and cluster.local; Kubernetes documents options ndots:5 as typical too. If those are missing or odd, the pod DNS setup is wrong. (Kubernetes)

3. Make sure the pod is querying the right namespace

A very common false alarm:

			
oc exec -it <bad-pod> -- nslookup backend-service
oc exec -it <bad-pod> -- nslookup backend-service.<namespace>

Kubernetes says unqualified service names are resolved relative to the pod’s own namespace. So backend-service from namespace frontend will not find a service that lives in namespace backend unless you query backend-service.backend. (Kubernetes)

4. Check whether the DNS pods are healthy

In OpenShift, look at the DNS operator and DNS pods:

			
oc get clusteroperator dns
oc get pods -n openshift-dns
oc get pods -n openshift-dns-operator

Red Hat documents that the DNS Operator manages CoreDNS, and that CoreDNS runs as the dns-default daemon set. If those pods are crashlooping, pending, or missing on expected nodes, pods may lose name resolution. (Red Hat Documentation)

5. Check whether only some nodes are affected

If only pods on one worker fail DNS, compare node placement:

			
oc get pods -A -o wide | grep <failing-node>
oc get pods -n openshift-dns -o wide

Red Hat notes DNS is available to all pods if DNS pods are running on some nodes and nodes without DNS pods still have network connectivity to nodes with DNS pods. So “only pods on node X fail DNS” often means node-to-DNS connectivity is broken rather than CoreDNS being globally broken. (Red Hat Documentation)

6. Test from a clean debug pod

This removes app-side noise:

			
oc run dns-debug --image=registry.k8s.io/e2e-test-images/agnhost:2.39 -it --rm -- sh
nslookup kubernetes.default
nslookup backend-service.<namespace>
cat /etc/resolv.conf

Kubernetes recommends creating a simple test pod and using nslookup kubernetes.default as a baseline DNS test. (Kubernetes)

7. Check DNS service reachability from the bad pod

If you know the DNS service IP from /etc/resolv.conf, test whether the pod can even reach it. If the DNS nameserver is unreachable from only some pods or nodes, the issue is likely network path to DNS, not DNS records themselves. This is an inference from the Kubernetes debug flow and OpenShift’s note about node connectivity to DNS pods. (Kubernetes)

8. Check logs from the DNS pods

If the DNS pods are up but resolution still fails:

oc logs -n openshift-dns <dns-default-pod>

If you are testing a workaround, Red Hat documents that the DNS Operator can be set to Unmanaged, but they also note you cannot upgrade while it remains unmanaged. (Red Hat Documentation)

What this usually turns out to be

Most common causes:

Wrong namespace lookup: querying service instead of service.namespace. (Kubernetes)
Bad pod DNS config: strange or missing nameserver/search domains in /etc/resolv.conf. (Kubernetes)
DNS pods unhealthy: dns-default issues in openshift-dns. (Red Hat Documentation)
Node-specific connectivity issue: pods on one node cannot reach DNS pods running elsewhere. (Red Hat Documentation)
Service confusion: DNS resolves, but the target service or endpoints are wrong, making it look like DNS. Kubernetes DNS only gives you the name-to-record mapping; the service still has to be valid. (Kubernetes)

Fast triage sequence

			
oc exec -it <bad-pod> -- cat /etc/resolv.conf
oc exec -it <bad-pod> -- nslookup kubernetes.default
oc exec -it <bad-pod> -- nslookup <service>.<namespace>
oc get clusteroperator dns
oc get pods -n openshift-dns -o wide
oc logs -n openshift-dns <dns-default-pod>

		

Mental model

When DNS fails only for some pods:

if all traffic is broken, think OVN/node networking
if IP access works but names fail, think DNS
if short names fail but FQDN works, think namespace/search path
if only one node’s pods fail, think node-to-dns connectivity

Debugging ClusterIP Issues in OVN Kubernetes

April 20, 2026April 20, 2026 techhadoop OCP cloud, devops, docker, kubernetes, technology

Great—let’s go through another very common real-world issue and include a simple visual to make it click.

Scenario

Service works via pod IP, but fails via ClusterIP (service name/IP)

Environment:

frontend → calling backend
Direct call works:curl http://10.128.2.15:8080 ✅
Service call fails:curl http://backend-service ❌

What this means (important insight)

If pod IP works but service fails, then:

Pod networking (OVN routing) is working
Problem is in service load-balancing layer inside OVN-Kubernetes

Mental model (diagram)

Interpretation:

Pod → Pod = direct routing (works)
Pod → Service = goes through OVN load balancer (broken here)

Step-by-step debugging

Step 1: Confirm endpoints exist

oc get endpoints backend-service

If EMPTY:

Root cause = wrong labels

Example:

			
# Service selector
selector:
  app: backend

But pod has:

			
labels:
  app: api   ❌ mismatch

Fix labels → service starts working instantly

Step 2: Verify service definition

oc get svc backend-service -o yaml

Check:

correct port
correct targetPort

Common mistake:

			
port: 80
targetPort: 8080   ✅ must match container port

Step 3: Test ClusterIP directly

curl <ClusterIP>:<port>

Results:

❌ fails → OVN load balancer issue
✅ works → DNS issue instead

Step 4: Check DNS (don’t skip this)

From pod:

nslookup backend-service

If fails:

→ Not OVN
→ Check:

oc get pods -n openshift-dns

Step 5: Inspect OVN load balancer

On a node:

			
oc debug node/<node>
chroot /host

Then:

ovn-nbctl lb-list

You should see something like:

VIP: 172.30.0.10:80 → 10.128.2.15:8080

If missing:

OVN didn’t program the service

Step 6: Check OVN logs

oc logs -n openshift-ovn-kubernetes <ovnkube-master>

Look for:

load balancer sync errors
endpoint update failures

Step 7: Check kube-proxy replacement

In OpenShift Container Platform, OVN replaces kube-proxy.

So if service routing is broken:
It’s handled by OVN, not iptables

Real root causes (from production)

1. Label mismatch (MOST COMMON)

Service selector doesn’t match pod
→ no endpoints → service dead

2. Wrong port/targetPort

Service pointing to wrong container port
→ connection refused

3. OVN load balancer not programmed

OVN DB out of sync
→ ClusterIP has no backend mapping

4. NetworkPolicy blocking service traffic

Pod allows direct IP but blocks service path
(less common but happens)

5. DNS issue (misdiagnosed often)

Service name fails, ClusterIP works

Fast debugging logic (this is gold)

When pod IP works but service fails:

Endpoints exist?
- ❌ → labels problem
ClusterIP works?
- ❌ → OVN load balancing
DNS works?
- ❌ → DNS issue

Pro tip (what experts do fast)

From a debug pod:

oc run debug --image=busybox -it --rm -- sh

Run:

			
nslookup backend-service
curl <ClusterIP>
curl <pod-IP>

This instantly isolates:

DNS
service
networking

Key takeaway

Pod IP = routing layer (OVN switching)
Service IP = OVN load balancer layer
If one works and the other doesn’t → you know exactly where to look

Troubleshooting Node-Specific Pod Traffic Failures

April 20, 2026April 20, 2026 techhadoop OCP cloud, devops, docker, kubernetes, technology

Scenario

Traffic works for pods on node A, but fails for pods on node B.

That usually points to a node-local OVN/OVS problem, not an app problem.

Example:

frontend on worker-1 can reach backend
same app on worker-2 cannot

That pattern is a huge clue.

How to debug it

1. Prove it’s node-specific

List pods and nodes:

oc get pods -A -o wide

Run the same network test from a pod on each node:

			
oc exec -it <good-pod> -- curl http://<target-pod-ip>:<port>
oc exec -it <bad-pod> -- curl http://<target-pod-ip>:<port>

If one node always works and another always fails, focus on the bad node.

2. Check the OVN pod on the bad node

Find the ovnkube-node pod for that worker:

oc get pods -n openshift-ovn-kubernetes -o wide

Look for the pod scheduled on the failing node.

Then inspect:

			
oc describe pod -n openshift-ovn-kubernetes <ovnkube-node-pod>
oc logs -n openshift-ovn-kubernetes <ovnkube-node-pod>

Things that matter:

restarts
readiness failures
DB connection errors
OVS/flow programming errors

If ovnkube-node is unhealthy there, that is often the root cause.

3. Check node readiness and basic health

			
oc get node
oc describe node <bad-node>

Look for:

NotReady
memory/disk pressure
network-related events

Sometimes OVN is fine and the node itself is degraded.

4. Inspect OVS on the bad node

Open a debug shell:

			
oc debug node/<bad-node>
chroot /host

Then:

ovs-vsctl show

You want to see expected bridges such as br-int.

Also useful:

			
ovs-ofctl dump-ports br-int
ovs-appctl bond/show

Red flags:

missing br-int
interfaces missing
counters not increasing on expected ports

If OVS is broken on that node, pod traffic there will fail even while the rest of the cluster looks fine.

5. Check the node’s host networking

Still on the node:

			
ip addr
ip route
ip link

Look for:

missing routes
down interfaces
wrong MTU

A node can have OVN running, but if the host interface or route is wrong, encapsulated traffic will still fail.

6. Compare MTU with a working node

MTU mismatches are sneaky.

On both a good node and bad node:

ip link

Look at the main NIC and OVN-related interfaces.

Symptoms of MTU trouble:

DNS works sometimes
small pings work
larger curls/higher-volume traffic fail or hang

A quick test from a pod can help:

ping -M do -s 1400 <target-ip>

If smaller packets work and larger ones fail, suspect MTU.

7. Check if pod wiring exists on the bad node

From the failing node’s ovnkube-node logs, check whether the affected pod sandbox/interface got programmed correctly.

Also inspect pods on that node:

oc get pods -A -o wide | grep <bad-node>

If all pods on that node fail, it is likely node OVN/OVS or host network.
If only one pod fails, it may be a pod-specific attachment/setup issue.

8. Test service vs direct pod IP

From a failing pod:

			
curl http://<target-pod-ip>:<port>
curl http://<service-cluster-ip>:<port>

Interpretation:

both fail → node/local OVN path likely broken
pod IP works, service fails → service/load-balancer programming problem
DNS name fails, ClusterIP works → DNS problem

This helps avoid blaming OVN for the wrong layer.

9. Check for node-local firewall or host changes

On the bad node, inspect whether something changed outside OpenShift:

			
iptables -S
nft list ruleset
systemctl status ovs-vswitchd
systemctl status ovn-controller

A manual host change, bad firewall rule, or failed service can break just one node.

10. Restart scope carefully

If evidence points clearly to the bad node’s OVN stack, a targeted recovery step is safer than broad cluster changes.

Typical sequence:

cordon/drain the bad node if workloads are impacted
restart or recover the bad node’s OVN/OVS components
verify traffic before uncordoning

Avoid random restarts cluster-wide unless you’ve ruled out a local issue.

What this usually turns out to be

Most common causes:

ovnkube-node unhealthy on one node
broken or stale OVS state on that node
host NIC / route / MTU mismatch
node-specific firewall or kernel/network issue
the node recently rebooted or partially lost connectivity to OVN DB

Fast triage checklist

When traffic fails only on one node, I’d do this in order:

			
oc get pods -A -o wide
oc get pods -n openshift-ovn-kubernetes -o wide
oc logs -n openshift-ovn-kubernetes <ovnkube-node-on-bad-node>
oc debug node/<bad-node>
chroot /host
ovs-vsctl show
ip route
ip link
systemctl status ovs-vswitchd
systemctl status ovn-controller

		

That usually gets you very close.

Mental model

When only one node is broken:

cluster-wide policy is less likely
app config is less likely
service config is less likely
node-local data plane is most likely

So think:
bad node → ovnkube-node → OVS → host NIC/route/MTU

Here’s a realistic example:

pods on worker-2 cannot reach anything off-node
pods on worker-1 are fine
ovnkube-node on worker-2 shows repeated connection/programming errors
ovs-vsctl show on worker-2 is missing expected state

That strongly suggests the fix is on worker-2, not in the app or service definitions.

Debugging OVN Issues in OpenShift

April 20, 2026April 20, 2026 techhadoop OCP ai, cloud, devops, kubernetes, technology

Let’s walk through a realistic, production-style OVN debugging scenario in
OpenShift Container Platform using OVN-Kubernetes.

Scenario

A frontend pod cannot reach a backend service

You have:

frontend pod
backend pod
backend-service (ClusterIP)

And:

curl http://backend-service

fails

Step-by-step debugging (real flow)

Step 1: Check if backend pod is healthy

oc get pods -o wide

You want:

Backend pod = Running
Has an IP (e.g., 10.128.2.15)

If pod is not running → stop here (not an OVN issue)

Step 2: Test direct pod-to-pod connectivity

From frontend pod:

oc exec -it frontend -- curl http://10.128.2.15

Outcomes:

Case A: This FAILS

→ Problem is networking (OVN / policy / routing)

Case B: This WORKS

→ Networking is fine → problem is service layer

Branch A: Pod-to-pod FAILS (OVN issue)

Step 3A: Check NetworkPolicies

oc get networkpolicy -A

Look for anything like:

Deny all ingress
Missing allow rules

Quick test:
Create temporary allow-all policy

If it suddenly works → root cause = NetworkPolicy

Step 4A: Check node-level OVN

Find nodes:

oc get pods -o wide

Then:

oc get pods -n openshift-ovn-kubernetes -o wide

Check:

Is ovnkube-node running on both nodes?
Any restarts?

Step 5A: Test OVS health

			
oc debug node/<node>
chroot /host
ovs-vsctl show

Look for:

br-int bridge
Proper interfaces

Missing interfaces = OVN not wiring pods correctly

Step 6A: Check OVN logs

oc logs -n openshift-ovn-kubernetes <ovnkube-node>

Common errors:

Flow install failures
DB sync issues

Branch B: Pod-to-pod WORKS, Service FAILS

This is VERY common and often misunderstood.

Step 3B: Check service

oc get svc backend-service -o wide

Check:

ClusterIP exists
Correct port

Step 4B: Check endpoints

oc get endpoints backend-service

If EMPTY:

→ Service is not linked to pods

Root cause:

Wrong selector labels

Fix:

			
selector:
  app: backend

Step 5B: Test service IP directly

curl <ClusterIP>

Fails but pod IP works:

→ OVN load-balancing issue

Step 6B: Check OVN load balancer

On node:

ovn-nbctl lb-list

You should see:

Service IP mapped to pod IPs

If missing → OVN not programming service

Bonus: DNS check (often confused with OVN)

From frontend:

nslookup backend-service

If fails:

→ DNS issue, NOT OVN

Check:

oc get pods -n openshift-dns

Real root cause examples (from production)

Case 1: Wrong labels

Service selector doesn’t match pod
→ No endpoints → service fails

Case 2: NetworkPolicy blocking traffic

Default deny policy applied
→ Pods isolated

Case 3: OVN desync

Pod exists but not in OVN DB
→ No routing

Case 4: Node issue

Only pods on one node fail
→ ovnkube-node broken there

Case 5: MTU mismatch

Small packets work, large fail
→ Very tricky to spot

The mental model (this is what experts use)

When debugging:

Pod IP → works?
- ❌ → OVN / policy / routing
- ✅ → go to service layer
Service endpoints exist?
- ❌ → labels problem
- ✅ → OVN load balancing
DNS works?
- ❌ → DNS, not OVN

Pro move (what senior engineers do)

Spin up a debug pod:

oc run debug --image=busybox -it --rm -- sh

Then test:

ping
curl
nslookup

This removes app complexity completely.

Understanding OVN in OpenShift: A Networking Overview

April 20, 2026April 20, 2026 techhadoop OCP cloud, containers, devops, kubernetes, technology

In OpenShift Container Platform (OCP), OVN refers to Open Virtual Network, used via OVN-Kubernetes. It’s the default networking solution in modern OpenShift clusters.

What OVN is (in simple terms)

OVN is a software-defined networking (SDN) system that:

Creates virtual networks for pods
Handles routing, switching, and network policies
Replaces older OpenShift SDN implementations

Think of it as the “network brain” of your cluster.

How OVN works in OCP

Core components

OVN Northbound DB → stores high-level network config
OVN Southbound DB → translates config to actual rules
OVN Controller (on each node) → programs networking rules
Open vSwitch (OVS) → does the actual packet forwarding

What it actually does

1. Pod networking

Each pod gets:

Its own IP address
Ability to talk to other pods across nodes

2. Routing

Handles:

Pod-to-pod communication
Pod-to-service traffic
External access (ingress/egress)

3. Network policies

Controls traffic like:

“Only allow frontend → backend”
“Block everything except specific ports”

4. Load balancing

Implements Kubernetes Services internally

Traffic flow (simplified)

Pod sends traffic
Goes through Open vSwitch (OVS) on the node
OVN rules decide:
- Where it goes
- Whether it’s allowed
Traffic reaches destination pod/service

Why OCP uses OVN

Compared to older SDN:

Better performance and scalability
Native support for Kubernetes NetworkPolicy
More flexible routing and IP management
Cleaner architecture (decoupled control + data plane)

OVN vs OpenShift SDN

Feature	OVN-Kubernetes	OpenShift SDN (legacy)
Performance	Higher	Lower
NetworkPolicy	Full support	Limited
IPv6	Supported	Limited
Future support	✅ Active	❌ Deprecated

Key concepts to remember

OVN = control plane for networking
OVS = data plane (moves packets)
Pods communicate via virtual networking managed by OVN
Policies are enforced at the network layer

Real-world example

If you deploy:

frontend pod
backend pod

With OVN:

Both get IPs
OVN ensures routing between them
A NetworkPolicy can allow only frontend → backend traffic

Debugging OVN in OpenShift Container Platform (via OVN-Kubernetes) can feel overwhelming at first, but there’s a clear, structured way to approach it.

Below is a practical, field-tested workflow you can follow.

0. Start with the symptom

Before touching OVN internals, identify the issue type:

❌ Pod can’t reach another pod
❌ Pod can’t reach a service
❌ External traffic not working
❌ DNS failing
❌ Only some nodes affected

This helps narrow the scope fast.

1. Check cluster networking health

oc get co network

Should be Available=True
If Degraded → OVN problem likely

2. Check OVN pods

oc get pods -n openshift-ovn-kubernetes

Look for:

CrashLoopBackOff
NotReady pods

Key pods:

ovnkube-node (runs on every node)
ovnkube-master

3. Check logs (most important step)

Node-level (data plane issues)

oc logs -n openshift-ovn-kubernetes <ovnkube-node-pod>

Control plane

oc logs -n openshift-ovn-kubernetes <ovnkube-master-pod>

Look for:

Flow programming errors
DB connection failures
OVS issues

4. Validate pod networking

Get pod IPs:

oc get pods -o wide

Test connectivity:

oc exec -it <pod> -- ping <other-pod-ip>

If this fails:

Likely OVN routing or policy issue

5. Check NetworkPolicies

oc get networkpolicy -A

Common mistake:

Policy blocking traffic unintentionally

Test by temporarily removing policy or creating an allow-all:

			
kind: NetworkPolicy
spec:
  podSelector: {}
  ingress:
  - {}
  egress:
  - {}

		

6. Check Open vSwitch (OVS)

SSH into a node:

			
oc debug node/<node-name>
chroot /host

Then:

ovs-vsctl show

Look for:

Bridges (like br-int)
Missing interfaces = problem

7. Inspect OVN DB state

From master node:

ovn-nbctl show

Check:

Logical switches
Ports for pods

If missing → OVN not programming correctly

8. Check services & kube-proxy replacement

OVN replaces kube-proxy.

Check:

oc get svc

Test:

curl <service-cluster-ip>

If service fails but pod IP works:
→ Load balancing issue in OVN

9. Check egress / external connectivity

From pod:

curl google.com

If fails:

Check EgressFirewall / EgressIP
Check node routing

10. Use must-gather (for deep issues)

oc adm must-gather -- /usr/bin/gather_network_logs

This collects:

OVN DB state
OVS config
Logs

Common real-world issues

1. MTU mismatch

Symptoms:

Intermittent connectivity
Large packets fail

2. NetworkPolicy blocking traffic

Very common in production

3. OVN DB not syncing

Symptoms:

Pods exist but no routes

4. Node-specific issues

Only pods on one node fail → check that node’s ovnkube-node

5. DNS issues (often misdiagnosed as OVN)

Check:

oc get pods -n openshift-dns

Debugging mindset (this is key)

Always go in this order:

Is cluster networking healthy?
Are OVN pods running?
Is traffic blocked (policy)?
Is routing broken (OVN/OVS)?
Is it actually DNS or app issue?

Pro tip

Use a debug pod:

oc run test --image=busybox -it --rm -- sh

From there:

ping
nslookup
curl

This isolates networking from your app.

Understanding Tekton: A Comprehensive CI/CD Framework for Kubernetes

April 20, 2026May 2, 2026 techhadoop kubernetes ai, artificial-intelligence, cloud, devops, technology

Tekton is a cloud-native CI/CD framework built for Kubernetes. Here’s a full breakdown:

What it is

Tekton is a Kubernetes-native open source framework for creating continuous integration and continuous delivery (CI/CD) systems. It installs and runs as an extension on a Kubernetes cluster and comprises a set of Kubernetes Custom Resources that define the building blocks you can create and reuse for your pipelines.

Tekton standardizes CI/CD tooling and processes across vendors, languages, and deployment environments. It lets you create CI/CD systems quickly, giving you scalable, serverless, cloud-native execution out of the box.

Core building blocks

Everything in Tekton is composed of these layers:

Step — the most basic entity, such as running unit tests or compiling a program. Tekton performs each step with a provided container image.
Task — a collection of steps in a specific order. Tekton runs a task in the form of a Kubernetes pod, where each step becomes a running container in the pod.
Pipeline — a collection of tasks in a specific order. Tekton collects all tasks, connects them in a directed acyclic graph (DAG), and executes the graph in sequence.
TaskRun — a specific execution of a task.
PipelineRun — a specific execution of a pipeline.

Example pipeline (clone → build → deploy)

			
# Step 1: Define a Task
apiVersion: tekton.dev/v1
kind: Task
metadata:
  name: build-and-push
spec:
  params:
    - name: IMAGE
      type: string
  steps:
    - name: build
      image: gcr.io/kaniko-project/executor:latest
      args:
        - --destination=$(params.IMAGE)
        - --context=/workspace/source
---
# Step 2: Compose Tasks into a Pipeline
apiVersion: tekton.dev/v1
kind: Pipeline
metadata:
  name: ci-pipeline
spec:
  tasks:
    - name: clone
      taskRef:
        name: git-clone        # from Tekton Catalog
    - name: build
      runAfter: [clone]
      taskRef:
        name: build-and-push
    - name: deploy
      runAfter: [build]
      taskRef:
        name: kubectl-apply
---
# Step 3: Trigger a run
apiVersion: tekton.dev/v1
kind: PipelineRun
metadata:
  name: ci-pipeline-run-001
spec:
  pipelineRef:
    name: ci-pipeline

		

Major components

The Tekton ecosystem includes:

Pipelines — the core CRDs (Task, Pipeline, etc.)
Triggers — allows you to create pipelines based on event payloads, such as triggering a run every time a merge request is opened against a Git repo
CLI (tkn) — command-line interface to interact with Tekton from your terminal
Dashboard — a web-based graphical interface showing pipeline execution history
Catalog — a repository of high-quality, community-contributed reusable Tasks and Pipelines
Chains — manages supply chain security, including artifact signing and SLSA provenance

Key advantages

Truly Kubernetes-native — every pipeline run is a real Kubernetes pod; no external CI server needed
Reusable and composable — Tasks from the Tekton Hub can be dropped into any pipeline
Event-driven — Triggers fire pipelines automatically on Git webhooks, image pushes, etc.
Scalable — each step runs in its own container; pipelines scale with the cluster
Supply chain security — Tekton Chains can sign images and generate SLSA provenance automatically

Tekton on OpenShift

Red Hat ships Tekton as OpenShift Pipelines — the officially supported Tekton operator available directly from OperatorHub. It adds OCP-specific integrations like integration with the OpenShift internal image registry, S2I (Source-to-Image) tasks, and the OpenShift console Pipeline UI. Tekton is the basis for OpenShift Pipelines, making it the natural CI tool to pair with Argo CD or Flux for a full GitOps workflow on OCP (Tekton handles CI/build, Argo CD or Flux handles CD/deploy).

Here’s the full picture of how Tekton (CI) and Argo CD / Flux (CD) work together on OCP — first the architecture flow, then a complete reference guide.Now here’s the full practical reference — everything you need to wire it up on OCP.

How the two halves divide responsibility

When code changes are pushed to a Git repository, OpenShift Pipelines initiates a pipeline run. This pipeline might include tasks such as building container images, running unit tests, and generating artifacts. Once the pipeline successfully completes, Argo CD continuously monitors the Git repository for changes in application manifests. Once the new image version is committed, Argo CD synchronizes the application state to match the declared state in Git.

The key insight is that Tekton owns the source repo (code → image) and Argo CD / Flux owns the config repo (manifests → cluster). Tekton never deploys directly. It commits the new image tag to a separate GitOps manifests repo, then hands off.

Step 1 — Install both operators on OCP

			
# OpenShift Pipelines (Tekton) — via OperatorHub
# Operators → OperatorHub → "Red Hat OpenShift Pipelines" → Install
# OpenShift GitOps (Argo CD) — via OperatorHub
# Operators → OperatorHub → "Red Hat OpenShift GitOps" → Install
# Verify both are running
oc get pods -n openshift-pipelines
oc get pods -n openshift-gitops

		

Step 2 — The Tekton CI pipeline

On every push or pull-request to the source Git repository, the following steps execute within the Tekton pipeline: code is cloned and unit tests are run; the application is analyzed by SonarQube in parallel; a container image is built using S2I and pushed to the OpenShift internal registry; then Kubernetes manifests are updated in the Git repository with the image digest that was built within the pipeline.

			
apiVersion: tekton.dev/v1
kind: Pipeline
metadata:
  name: ci-pipeline
  namespace: cicd
spec:
  workspaces:
    - name: source
    - name: dockerconfig
  params:
    - name: GIT_URL
      type: string
    - name: IMAGE
      type: string
    - name: GIT_MANIFEST_URL   # separate repo for k8s manifests
      type: string
  tasks:
    - name: clone
      taskRef:
        name: git-clone
        kind: ClusterTask
      workspaces:
        - name: output
          workspace: source
      params:
        - name: url
          value: $(params.GIT_URL)
    - name: unit-test
      runAfter: [clone]
      taskRef:
        name: maven
        kind: ClusterTask
      workspaces:
        - name: source
          workspace: source
    - name: build-image
      runAfter: [unit-test]
      taskRef:
        name: buildah
        kind: ClusterTask
      params:
        - name: IMAGE
          value: $(params.IMAGE)
      workspaces:
        - name: source
          workspace: source
        - name: dockerconfig
          workspace: dockerconfig
    - name: scan-image
      runAfter: [build-image]
      taskRef:
        name: trivy-scanner      # from Tekton Hub
      params:
        - name: IMAGE
          value: $(params.IMAGE)
    - name: update-manifest      # THE HANDOFF to GitOps
      runAfter: [scan-image]
      taskRef:
        name: git-cli
        kind: ClusterTask
      params:
        - name: GIT_USER_NAME
          value: tekton-bot
        - name: COMMANDS
          value: |
            git clone $(params.GIT_MANIFEST_URL) /workspace/manifest
            cd /workspace/manifest
            # Update image tag in kustomization
            kustomize edit set image myapp=$(params.IMAGE)
            git add -A
            git commit -m "ci: update image to $(params.IMAGE)"
            git push

		

Step 3 — Tekton Triggers (webhook → pipeline)

			
# EventListener — receives the GitHub/GitLab webhook
apiVersion: triggers.tekton.dev/v1beta1
kind: EventListener
metadata:
  name: git-push-listener
  namespace: cicd
spec:
  serviceAccountName: pipeline
  triggers:
    - name: push-trigger
      bindings:
        - ref: github-push-binding
      template:
        ref: pipeline-trigger-template
---
# TriggerTemplate — what to create when the webhook fires
apiVersion: triggers.tekton.dev/v1beta1
kind: TriggerTemplate
metadata:
  name: pipeline-trigger-template
  namespace: cicd
spec:
  params:
    - name: git-revision
    - name: git-repo-url
  resourcetemplates:
    - apiVersion: tekton.dev/v1
      kind: PipelineRun
      metadata:
        generateName: ci-run-
      spec:
        pipelineRef:
          name: ci-pipeline
        params:
          - name: GIT_URL
            value: $(tt.params.git-repo-url)
          - name: IMAGE
            value: image-registry.openshift-image-registry.svc:5000/myapp/app:$(tt.params.git-revision)

		

Expose the EventListener as an OCP Route so GitHub/GitLab can reach it:

			
oc expose svc el-git-push-listener -n cicd
# Then add the route URL as a webhook in GitHub/GitLab

Step 4 — Argo CD watches and deploys

Once the manifests repo is updated by Tekton, Argo CD detects the change. With automated.prune: true and selfHeal: true, it syncs immediately and deploys the new revision.

			
# Argo CD Application — dev environment
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: myapp-dev
  namespace: openshift-gitops
spec:
  project: default
  source:
    repoURL: https://github.com/my-org/manifests.git
    targetRevision: main
    path: environments/dev          # Kustomize overlay for dev
  destination:
    server: https://kubernetes.default.svc
    namespace: myapp-dev
  syncPolicy:
    automated:
      prune: true       # remove resources deleted from Git
      selfHeal: true    # revert manual changes to the cluster
    syncOptions:
      - CreateNamespace=true
---
# Promotion to staging requires a PR merge (no auto-deploy to prod)
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: myapp-staging
  namespace: openshift-gitops
spec:
  source:
    path: environments/staging
    targetRevision: staging         # separate branch = manual promotion
  syncPolicy:
    automated:
      prune: true
      selfHeal: true

		

The GitOps repo layout Tekton writes to

			
manifests-repo/
├── base/
│   ├── deployment.yaml
│   ├── service.yaml
│   └── kustomization.yaml
└── environments/
    ├── dev/
    │   └── kustomization.yaml     ← Tekton updates image tag here
    ├── staging/
    │   └── kustomization.yaml     ← promoted via PR merge
    └── prod/
        └── kustomization.yaml     ← promoted via PR merge + approval

		

Promotion flow (dev → staging → prod)

Once the pipeline finishes successfully, the image reference in the manifests repo is updated and automatically deployed to the dev environment by Argo CD. To promote to staging, a pull request is generated targeting the staging branch. Merging that PR triggers Argo CD to sync the staging environment. Production follows the same pattern with an additional approval gate.

The promotion task in Tekton creates a PR automatically:

			
- name: promote-to-staging
  runAfter: [update-manifest]
  taskRef:
    name: github-open-pr       # from Tekton Hub
  params:
    - name: REPO_FULL_NAME
      value: my-org/manifests
    - name: HEAD
      value: feature/new-image-$(params.git-revision)
    - name: BASE
      value: staging
    - name: TITLE
      value: "Promote $(params.IMAGE) to staging"

		

Putting it all together — the complete flow

Step	Actor	Action
1	Developer	`git push` to source repo
2	GitHub/GitLab	Sends webhook to Tekton EventListener
3	Tekton	Clones, tests, builds image with Buildah/S2I
4	Tekton	Scans image with Trivy / ACS
5	Tekton	Pushes image to OCP internal registry or Quay
6	Tekton	Updates image tag in manifests repo, opens PR to staging
7	Argo CD / Flux	Detects change in manifests repo, deploys to dev automatically
8	Team	Reviews and merges PR → staging auto-deploys
9	Team	Approves prod PR → production deploys

This pattern — Tekton handles CI, Argo CD / Flux handles CD, and Git is the only bridge between them — is the standard GitOps delivery model on OCP.

Mastering OpenShift on VMware and Bare Metal: Key Insights

April 19, 2026April 19, 2026 techhadoop OCP cloud, devops, kubernetes, openshift, technology

Administering OpenShift on VMware vSphere or Bare Metal is significantly more complex than cloud environments because you are responsible for the “underlay” (the physical or virtual infrastructure) as well as the “overlay” (OpenShift).

In a 2026 interview, expect a focus on automation, connectivity in restricted environments, and hardware lifecycle.

1. Installation & Provisioning (The Foundation)

Q1: Compare IPI vs. UPI in the context of VMware vSphere.

IPI (Installer-Provisioned Infrastructure): The installer has the vCenter credentials. It automatically creates the Folder, Virtual Machines, and Resource Pools. It also handles the VIP (Virtual IPs) for the API and Ingress via Keepalived.
UPI (User-Provisioned Infrastructure): You manually create the VMs, set up the Load Balancers (F5, HAProxy), and configure DNS.
Interview Tip: Mention that IPI is preferred for speed and “automated scaling,” but UPI is often mandatory in “Brownfield” environments where the networking team won’t give the installer full control over the VLANs.

Q2: How does OpenShift interact with physical hardware for Bare Metal?

Answer: It uses the Metal3 project and the Bare Metal Operator (BMO).

The admin provides the BMC (Baseboard Management Controller) details—like IPMI, iDRAC (Dell), or iLO (HP)—to OpenShift.
OpenShift uses these to remotely power on the server, PXE boot it, and install RHCOS (Red Hat Enterprise Linux CoreOS).

2. Infrastructure Operations

Q3: What is a “Disconnected” (Air-Gapped) Installation?

Answer: Common in on-prem data centers with high security.

The Problem: OpenShift usually pulls images from quay.io.
The Solution: You must set up a Local Mirror Registry (like Red Hat Quay or Sonatype Nexus).
Process: You use the oc mirror plugin to download all required images to a portable disk, move it inside the secure zone, and push them to your local registry. You then configure the cluster to use an ImageContentSourcePolicy to redirect all image pulls to your local IP.

Q4: How do you handle storage on VMware vs. Bare Metal?

VMware: Use the vSphere CSI Driver. This allows OpenShift to talk to vCenter and dynamically provision .vmdk files as Persistent Volumes (PVs).
Bare Metal: You typically use LVM (Local Storage Operator) for fast local SSDs or OpenShift Data Foundation (ODF) (based on Ceph). ODF is the industry standard for on-prem because it provides S3-compatible, Block, and File storage within the cluster itself.

3. High Availability & Networking

Q5: On Bare Metal, how do you handle Load Balancing for the API and Ingress?

Answer: Since there is no “AWS ELB” on-prem, you have two choices:

External: Use a physical appliance like an F5 Big-IP or a pair of HAProxy nodes managed by your team.
Internal (MetalLB): Use the MetalLB Operator. It allows you to assign a range of IPs from your corporate network to the OpenShift Router so it can act like a cloud load balancer.

Q6: What happens if a Master (Control Plane) node dies in a Bare Metal cluster?

Answer: * Quorum: You must have 3 Masters to maintain an etcd quorum. If one dies, the cluster survives. If two die, the API becomes read-only or crashes.

Recovery: On Bare Metal, recovery is manual. You must reinstall the OS, use the kube-etcd-operator to remove the old member, and then use the cluster-bootstrap process to add the new node back into the etcd ring.

4. Advanced Troubleshooting

Q7: A worker node is “NotReady” on VMware. What is your first check?

Answer: Beyond the logs, I check the VMware Tools status and Time Sync.

If the ESXi host and the VM have a clock drift (common if NTP is misconfigured), the certificates for the Kubelet will fail to validate, and the node will go NotReady.
I would also check the MachineConfigPool (MCP). If the node is stuck in “Updating,” it might be failing to pull an OS image from the internal registry.

Q8: What is “Assisted Installer”?

Answer: It’s the modern way to install OpenShift on-prem. It provides a web-based GUI that generates a “Discovery ISO.” You boot your physical servers with this ISO; they “check in” to the portal, and you can then click “Install” to deploy the whole cluster without writing complex YAML files.

Technical “Buzzwords” for 2026:

OVN-Kubernetes: The default network plugin (replaces OpenShift SDN).
LVM Storage: Used for high-performance databases on bare metal.
Red Hat Advanced Cluster Management (RHACM): If the company has multiple on-prem clusters, they will use this to manage them all from one dashboard.

Debugging etcd is the highest level of OpenShift administration. If etcd is healthy, the cluster is healthy; if etcd is failing, the API will be sluggish or completely unresponsive.

Here is the technical deep-dive on how to diagnose and fix etcd on-premise.

1. Checking the High-Level Status

Before diving into logs, check if the Etcd Operator is happy. If the operator is degraded, it usually means it’s struggling to manage the quorum.

			
# Check the status of the etcd cluster operator
oc get clusteroperator etcd
# Check the status of the individual etcd pods
oc get pods -n openshift-etcd -l app=etcd

2. Testing Quorum and Health (The `etcdctl` way)

In OpenShift 4.x, etcd runs as Static Pods on the master nodes. To run diagnostic commands, you must use a helper script or exec into the container.

The “Is it alive?” check:

			
# Get a list of etcd members and their health
oc rsh -n openshift-etcd etcd-master-0 etcdctl endpoint health --cluster -w table

The Performance check (Disk Latency):

On-premise (especially VMware), Disk I/O latency is the #1 killer of etcd. If your storage is slow, etcd will lose quorum.

			
# Check the sync duration
oc rsh -n openshift-etcd etcd-master-0 etcdctl check perf

Interview Pro-Tip: Mention that etcd requires fsync latency of less than 10ms. If it’s higher, your underlying VMware datastore or Bare Metal disks are too slow for an enterprise cluster.

3. Investigating Logs

If a pod is crashing, check the logs specifically for “leader” issues or “wal” (Write Ahead Log) errors.

			
# View the last 100 lines of logs from a specific member
oc logs -n openshift-etcd etcd-master-0 -c etcd --tail=100

What to look for:

"lost leader": Indicates network instability between master nodes.
"apply entries took too long": Indicates slow disk or high CPU pressure on the master node.
"database space exceeded": The 8GB quota has been reached (requires a defrag).

4. Critical Recovery: The “Master Node Replacement”

If a master node (e.g., master-1) hardware fails permanently on Bare Metal, you must follow these steps to restore the cluster health:

Remove the ghost member:Tell etcd to stop looking for the dead node.Bashoc rsh -n openshift-etcd etcd-master-0 etcdctl member list oc rsh -n openshift-etcd etcd-master-0 etcdctl member remove <dead-member-id>
Clean up the Node object:oc delete node master-1
Re-provision: Boot the new hardware with the RHCOS ISO. If using IPI, the Machine API might do this for you. If UPI, you must manually trigger the CSR (Certificate Signing Request) approval.
Approve CSRs:The new master won’t join until you approve its certificates:oc get csr | grep Pending | awk '{print $1}' | xargs oc adm certificate approve

5. Compaction and Defragmentation

Over time, etcd keeps versions of objects. If the database grows too large, the cluster will stop accepting writes (Error: mvcc: database space exceeded).

The Fix:

OpenShift normally handles this automatically, but as an admin, you might need to force it:

			
# Defragment the endpoint
oc rsh -n openshift-etcd etcd-master-0 etcdctl defrag --cluster

The “Final Boss” Interview Question:

“We lost 2 out of 3 master nodes. The API is down. How do you recover?”

The Answer:

Since quorum is lost (needs $n/2 + 1$ nodes), you must perform a Single Master Recovery.
Stop the etcd service on the remaining healthy master.
Run the etcd-snapshot-restore.sh script (shipped with OpenShift) using a previous backup.
This forces the remaining master to become a “New Cluster” of one.
Once the API is back up, you re-join the other two nodes as brand-new members.

Since OpenShift 4.12+, OVN-Kubernetes has become the default network provider, replacing the older OpenShift SDN. For an on-premise administrator, understanding this is vital because it changes how traffic flows from your physical switches into your pods.

1. OVN-Kubernetes Architecture

Unlike the old SDN which used Open vSwitch (OVS) in a basic way, OVN (Open Virtual Network) brings a distributed logical router and switch to every node.

Geneve Encap: OVN uses Geneve (Generic Network Virtualization Encapsulation) instead of VXLAN to tunnel traffic between nodes. It’s more flexible and allows for more metadata.
The Gateway: Every node has a “Gateway” that handles traffic entering and exiting the cluster. On-premise, this is where your physical network interface (e.g., eno1 or ens192) meets the virtual world.

2. On-Premise Networking Challenges

Q1: How does OpenShift handle “External” IPs on-prem?

In the cloud, you have a LoadBalancer service. On-prem, you don’t.

The Admin Solution: MetalLB.

As an admin, you configure a MetalLB Operator with an IP address pool from your actual data center VLAN. When a developer creates a Service of type LoadBalancer, MetalLB uses ARP (Layer 2) or BGP (Layer 3) to announce that IP address to your physical routers.

Q2: What is the “Ingress VIP” and “API VIP”?

During a VMware/Bare Metal IPI install, you are asked for two IPs:

API VIP: The floating IP used to talk to the control plane (Port 6443).
Ingress VIP: The floating IP for all application traffic (Ports 80/443).Mechanism: OpenShift uses Keepalived and HAProxy internally to float these IPs between the master nodes (for API) or worker nodes (for Ingress). If the node holding the IP fails, it “floats” to another node in seconds.

3. Troubleshooting the Network

If pods can’t talk to each other, follow this “inside-out” path:

Step 1: Check the Cluster Network Operator (CNO)

If the CNO is degraded, the entire network is unstable.

oc get clusteroperator network

Step 2: Trace the Flow with `oc adm network`

OpenShift provides a built-in tool to verify if two pods can actually talk to each other across nodes:

Bash

oc adm pod-network diagnostic

Step 3: Inspect the OVN Database

Since OVN stores the network state in a database (Northbound and Southbound DBs), you can check if the logical flows are actually created.

			
# Get the logs of the ovnkube-master
oc logs -n openshift-ovn-kubernetes -l app=ovnkube-master

4. Key Concepts for Interview Scenarios

Scenario: “Applications are slow only when talking to external databases.”

Likely Culprit: MTU Mismatch. * Explanation: Geneve encapsulation adds 100 bytes of overhead to every packet. If your physical network is set to standard MTU (1500), but OpenShift is also sending 1500, the packets get fragmented, causing a massive performance hit.
The Fix: Ensure the cluster MTU is set to 1400 (1500 – 100) or enable Jumbo Frames (9000) on your physical switches.

Scenario: “How do you isolate traffic between two departments on the same cluster?”

The Answer: NetworkPolicies. * OVN-Kubernetes supports standard Kubernetes NetworkPolicy objects. By default, all pods can talk to all pods. I would implement a “Deny-All” default policy and then explicitly allow traffic only between required microservices.

Summary for Administrator Interview

Feature	OpenShift SDN (Old)	OVN-Kubernetes (New/Standard)
Encapsulation	VXLAN	Geneve
Network Policy	Limited	Fully Featured (Egress/Ingress)
Hybrid Cloud	Hard to implement	Designed for it (IPsec support)
Windows Support	No	Yes

Essential OpenShift Q&A: Architecture, Security & Workflow

April 19, 2026May 18, 2026 techhadoop OCP ai, cloud, devops, kubernetes, technology

In an OpenShift interview, the questions typically fall into three categories: Architecture/Concepts, Security (SCCs/RBAC), and Developer Workflow (S2I/Builds).

Here is a curated list of the most common and high-impact questions for 2026.

1. Core Architecture & Concepts

Q1: What is the fundamental difference between OpenShift and Kubernetes?

Answer: While Kubernetes is an open-source orchestration engine, OpenShift is a downstream, enterprise-grade distribution of Kubernetes by Red Hat.

The “Plus” Factor: OpenShift includes everything in Kubernetes but adds a built-in container registry, integrated CI/CD pipelines (Tekton), a developer-friendly web console, and enhanced security defaults.
Security: By default, OpenShift forbids containers from running as root, whereas vanilla Kubernetes is “open” by default.

Q2: What is an OpenShift “Project” vs. a Kubernetes “Namespace”?

Answer: A Project is simply an abstraction on top of a Kubernetes Namespace.

It adds metadata and facilitates Self-Service: users can request projects via the CLI (oc new-project) or Web Console.
It automatically applies default Resource Quotas and Limit Ranges to the namespace to prevent a single user from crashing the cluster.

Q3: Explain the role of the Router (HAProxy) in OpenShift.

Answer: In vanilla Kubernetes, you typically install an Ingress Controller (like NGINX). In OpenShift, the Router (based on HAProxy) is a core component. It provides the external entry point for traffic, mapping an external URL (a Route) to an internal Service.

2. Developer & Build Workflow

Q4: What is Source-to-Image (S2I) and why is it used?

Answer: S2I is a toolkit that allows developers to provide only their source code (via a Git URL). OpenShift then:

Detects the language (Java, Python, Node, etc.).
Injects the code into a “Builder Image.”
Assembles the final application image.Benefit: Developers don’t need to know how to write a Dockerfile or manage base images, ensuring consistent security patches at the base layer.

Q5: What is a `BuildConfig`?

Answer: A BuildConfig is the definition of the entire build process. It specifies:

Source: Where the code is (Git).
Strategy: How to build it (S2I, Docker, or Pipeline).
Output: Where to push the resulting image (internal registry).
Triggers: Events that start a build (e.g., a code commit or an update to the base image).

3. Security & Operations

Q6: What are Security Context Constraints (SCCs)?

Answer: SCCs are one of the most important security features in OpenShift. They control what actions a pod can perform.

Restricted SCC: The default. It prevents pods from running as root and limits access to the host filesystem.
Anyuid SCC: Often used when migrating legacy Docker images that must run as a specific user.
Privileged SCC: Full access (usually reserved for infra components like logging or monitoring).

Q7: How does OpenShift handle Persistent Storage?

Answer: OpenShift uses the Persistent Volume (PV) and Persistent Volume Claim (PVC) model.

An administrator provisions PVs (storage chunks).
A developer requests storage via a PVC.
OpenShift uses Storage Classes to dynamically provision storage on the fly (e.g., on AWS EBS or VMware vSphere) when a PVC is created.

4. Scenario-Based “Pro” Question

Q8: “A pod is failing with a `CrashLoopBackOff`. How do you troubleshoot it in OpenShift?”

Answer: Walk through these 4 steps to show you have hands-on experience:

Check Status: oc get pods to see the status.
Examine Logs: oc logs <pod_name> (use --previous if the container already restarted).
Inspect Events: oc describe pod <pod_name> to look for failed mounts, scheduling issues, or “Back-off” events.
Debug Session: Use oc debug pod/<pod_name> to launch a terminal inside a clone of the failing pod to inspect the filesystem and environment variables.

5. Rapid-Fire Command Cheat Sheet

Task	Command
Login	`oc login <api-url>`
Create App	`oc new-app https://github.com/user/repo`
Scale App	`oc scale --replicas=3 dc/my-app`
Expose Service	`oc expose svc/my-service`
View Resources	`oc get all`
Check SCCs	`oc get scc`

For the Administrator track, the interview will shift away from “how to deploy an app” toward Cluster Health, Lifecycle Management, and Infrastructure Stability.

In OpenShift 4.x (the modern standard), the “Operator-focused” architecture is the star of the show. Here are the heavy-hitting admin questions you should be ready for.

1. The Operator Framework

Q1: What is the “Operator Pattern” and why is it central to OpenShift 4?

Answer: In OpenShift 4, the entire cluster is managed by Operators. An Operator is a custom controller that encodes human operational knowledge into software.

The Loop: It constantly monitors the Actual State of a component (like the Internal Registry or Monitoring stack) and compares it to the Desired State. If they differ, the Operator automatically fixes it.
Cluster Version Operator (CVO): This is the “Master Operator” that manages the updates of the cluster itself, ensuring all core components are at the correct version.

Q2: How do you perform a Cluster Upgrade in OpenShift 4?

Answer: Upgrades are managed via the Cluster Version Operator (CVO).

Process: You typically update the “Channel” (e.g., stable-4.14) and then trigger the upgrade via the console or: oc adm upgrade.
Mechanism: The CVO orchestrates the update of every operator in the cluster. The Machine Config Operator (MCO) handles the rolling reboot of nodes to update the underlying Red Hat Enterprise Linux CoreOS (RHCOS).

2. Infrastructure & Nodes

Q3: What is the Machine Config Operator (MCO)?

Answer: The MCO is one of the most important components for an admin. It treats the underlying nodes like “cattle, not pets.”

It manages the operating system (RHCOS) itself.
If you need to change a kernel parameter, add a SSH key, or change a NTP setting across 50 nodes, you create a MachineConfig object. The MCO then applies that change and reboots nodes in a rolling fashion to ensure zero downtime.

Q4: Explain the difference between IPI and UPI installation.

Answer: * IPI (Installer-Provisioned Infrastructure): Full automation. The OpenShift installer has credentials to your cloud (AWS, Azure, etc.) and creates the VMs, VPCs, and Load Balancers for you.

UPI (User-Provisioned Infrastructure): The admin manually creates the infrastructure (VMs, networking, storage). You then run the installer to “bootstrap” OpenShift onto those pre-existing resources. (Common in highly regulated or air-gapped environments).

3. Storage & Networking

Q5: How do you troubleshoot a Node that is in “NotReady” status?

Answer: I follow a systematic checklist:

Check Node Details: oc describe node <node_name> to look at the “Conditions” section (e.g., MemoryPressure, DiskPressure, or NetworkUnavailable).
Verify Kubelet: SSH into the node (or use oc debug node) and check the kubelet logs: journalctl -u kubelet.
Resource Usage: Check if the node has run out of PIDs or Disk space.
CSRs: If the node was recently added/rebooted, check if there are pending Certificate Signing Requests: oc get csr and approve them if necessary.

Q6: What is the “In-tree” to CSI migration?

Answer: Older versions of OpenShift used storage drivers built directly into the Kubernetes binary (“In-tree”). Modern OpenShift is moving to CSI (Container Storage Interface) drivers. As an admin, this means storage is now handled by standalone operators, allowing for easier updates without upgrading the whole cluster.

4. Security & Etcd

Q7: Why is the `etcd` backup critical, and how do you perform it?

Answer: etcd is the “brain” of the cluster; it stores every configuration and state. If etcd is lost, the cluster is dead.

Backup: You use the cluster-etcd-operator. I would run a specific debug script provided by Red Hat: oc debug node/<master-node> -- /usr/local/bin/cluster-backup.sh /home/core/assets/backup.
Strategy: Always take a backup before a cluster upgrade.

5. Monitoring & Logging

Q8: What stack does OpenShift use for Cluster Monitoring?

Answer: OpenShift comes with a pre-configured Prometheus, Grafana, and Alertmanager stack (managed by the Monitoring Operator).

Note: Admins use this to monitor cluster health (CPU/Mem of nodes).
User Workload Monitoring: In newer versions, admins can enable “User Workload Monitoring” to allow developers to use the same Prometheus stack for their own applications without interfering with the cluster’s core monitoring.

Summary Checklist for your Interview

[!TIP]

If they ask about a problem you can’t solve: Always mention “Looking at the Operators.” In OpenShift 4, if something is broken, check

oc get clusteroperators.

If one is DEGRADED=True, that is your smoking gun.

OCP can run on different environment, such as on-premise (VMware/Bare Metal) or a managed service (ROSA/ARO)

Linux network

April 19, 2026April 19, 2026 techhadoop linux devops, docker, linux, security, technology

Linux network issues can come from a lot of places—misconfiguration, hardware problems, DNS failures, firewall rules, or even upstream network outages. The trick is to diagnose step-by-step instead of guessing. Here’s a practical workflow you can follow.

1. Check Basic Connectivity

Start simple—make sure the system is even connected.

ip a

Look for an IP address (not just 127.0.0.1)
Check if the interface is UP

ip link

Verify your interface (e.g., eth0, wlan0) is not DOWN

If it’s down:

sudo ip link set eth0 up

2. Test Network Reachability

Try pinging known addresses:

ping -c 4 8.8.8.8

If this fails → no internet or routing issue

Then test DNS:

ping -c 4 google.com

If IP ping works but domain fails → DNS problem

3. Check Routing Table

ip route

Look for a default route:

default via 192.168.1.1 dev eth0

If missing, add it:

sudo ip route add default via <gateway-ip>

4. Verify DNS Configuration

Check:

cat /etc/resolv.conf

You should see something like:

nameserver 8.8.8.8

If not, temporarily add:

sudo nano /etc/resolv.conf

Or if using systemd:

systemd-resolve --status

5. Check Network Manager / Services

Depending on your system:

Using NetworkManager

			
nmcli device status
nmcli connection show

Restart:

sudo systemctl restart NetworkManager

Using systemd-networkd

			
networkctl status
sudo systemctl restart systemd-networkd

6. Inspect Firewall Rules

Firewall can silently block traffic.

sudo iptables -L -n

Or if using UFW:

sudo ufw status

Temporarily disable (for testing):

sudo ufw disable

7. Check Open Ports

ss -tulnp

This shows:

Listening services
Ports in use

8. Test with traceroute

See where packets stop:

traceroute 8.8.8.8

If it stops early → local network issue
If it goes far but fails → ISP or remote issue

9. Check Logs

Logs often reveal the real problem:

			
journalctl -u NetworkManager
journalctl -xe
dmesg | grep -i network

10. Wireless-Specific Checks

If using Wi-Fi:

iwconfig

Scan networks:

iw dev wlan0 scan

Reconnect:

nmcli device wifi connect "SSID" password "PASSWORD"

11. Hardware / Driver Issues

Check drivers:

			
lspci | grep -i network
lsmod | grep <driver>

If missing drivers, you may need to install firmware packages.

Common Problems & Quick Fixes

No IP address → DHCP issue → restart network service
Can ping IP but not domain → DNS misconfigured
Interface down → enable with ip link set
Connected but no internet → gateway missing
Wi-Fi connected but slow → interference or driver issue

On Ubuntu, networking is usually managed by NetworkManager, so troubleshooting is a bit more standardized. Let’s narrow this down quickly with a focused Ubuntu workflow.

1. Check if NetworkManager is running

sudo systemctl status NetworkManager

If it’s not active:

sudo systemctl restart NetworkManager

2. See connection status

nmcli device status

You’ll see something like:

connected → good
disconnected → problem
unmanaged → config issue

To reconnect:

nmcli connection up <connection-name>

3. Check IP address

ip a

Look for something like:

inet 192.168.x.x

❌ No IP? Try renewing DHCP:

sudo dhclient -v

4. Test connectivity

ping -c 4 8.8.8.8

❌ fails → network/gateway issue
✅ works → move to DNS test:

ping -c 4 google.com

❌ fails → DNS problem

5. Check default gateway

ip route

You should see:

default via 192.168.x.1

If missing:

sudo ip route add default via 192.168.1.1

6. Fix DNS (very common on Ubuntu)

Check:

cat /etc/resolv.conf

If it’s broken or empty, try:

sudo nano /etc/systemd/resolved.conf

Set:

DNS=8.8.8.8 1.1.1.1

Then restart:

sudo systemctl restart systemd-resolved

7. Wi-Fi specific (if applicable)

Check:

nmcli device wifi list

Reconnect:

nmcli device wifi connect "SSID" password "PASSWORD"

If Wi-Fi adapter missing:

lspci | grep -i network

8. Firewall check

Ubuntu often uses UFW:

sudo ufw status

Temporarily disable:

sudo ufw disable

9. Look at logs (very useful)

journalctl -u NetworkManager --no-pager | tail -50

Fast “reset everything” fix (works surprisingly often)

			
sudo systemctl restart NetworkManager
sudo systemctl restart systemd-resolved
sudo dhclient

Scenario

Diagram

How to debug it

1. Prove it is really namespace-specific

2. Check NetworkPolicy in the failing namespace

3. Check for an OpenShift EgressFirewall

4. Check whether the namespace is supposed to use EgressIP

5. Verify DNS separately

6. Compare with a working namespace

7. Check whether the block is by destination type

What this usually turns out to be

Fast triage sequence

Mental model

Scenario

Diagram

How to debug it

1. Prove it is DNS and not general networking

2. Check the failing pod’s /etc/resolv.conf

3. Make sure the pod is querying the right namespace

4. Check whether the DNS pods are healthy

5. Check whether only some nodes are affected

6. Test from a clean debug pod

7. Check DNS service reachability from the bad pod

8. Check logs from the DNS pods

What this usually turns out to be

Fast triage sequence

Mental model

Scenario

What this means (important insight)

Mental model (diagram)

Step-by-step debugging

Step 1: Confirm endpoints exist

If EMPTY:

Step 2: Verify service definition

Step 3: Test ClusterIP directly

Results:

Step 4: Check DNS (don’t skip this)

If fails:

Step 5: Inspect OVN load balancer

If missing:

Step 6: Check OVN logs

Step 7: Check kube-proxy replacement

Real root causes (from production)

1. Label mismatch (MOST COMMON)

2. Wrong port/targetPort

3. OVN load balancer not programmed

4. NetworkPolicy blocking service traffic

5. DNS issue (misdiagnosed often)

Fast debugging logic (this is gold)

Pro tip (what experts do fast)

Key takeaway

Scenario

How to debug it

1. Prove it’s node-specific

2. Check the OVN pod on the bad node

3. Check node readiness and basic health

4. Inspect OVS on the bad node

5. Check the node’s host networking

6. Compare MTU with a working node

7. Check if pod wiring exists on the bad node

8. Test service vs direct pod IP

9. Check for node-local firewall or host changes

10. Restart scope carefully

What this usually turns out to be

Fast triage checklist

Mental model

Scenario

Step-by-step debugging (real flow)

Step 1: Check if backend pod is healthy

Step 2: Test direct pod-to-pod connectivity

Outcomes:

Case A: This FAILS

Case B: This WORKS

Branch A: Pod-to-pod FAILS (OVN issue)

Step 3A: Check NetworkPolicies

Step 4A: Check node-level OVN

Step 5A: Test OVS health

Step 6A: Check OVN logs

Branch B: Pod-to-pod WORKS, Service FAILS

Step 3B: Check service

2. Check `NetworkPolicy` in the failing namespace

3. Check for an OpenShift `EgressFirewall`

4. Check whether the namespace is supposed to use `EgressIP`

2. Check the failing pod’s `/etc/resolv.conf`