Scenario

Traffic works for pods on node A, but fails for pods on node B.

That usually points to a node-local OVN/OVS problem, not an app problem.

Example:

frontend on worker-1 can reach backend
same app on worker-2 cannot

That pattern is a huge clue.

How to debug it

1. Prove it’s node-specific

List pods and nodes:

oc get pods -A -o wide

Run the same network test from a pod on each node:

			
oc exec -it <good-pod> -- curl http://<target-pod-ip>:<port>
oc exec -it <bad-pod> -- curl http://<target-pod-ip>:<port>

If one node always works and another always fails, focus on the bad node.

2. Check the OVN pod on the bad node

Find the ovnkube-node pod for that worker:

oc get pods -n openshift-ovn-kubernetes -o wide

Look for the pod scheduled on the failing node.

Then inspect:

			
oc describe pod -n openshift-ovn-kubernetes <ovnkube-node-pod>
oc logs -n openshift-ovn-kubernetes <ovnkube-node-pod>

Things that matter:

restarts
readiness failures
DB connection errors
OVS/flow programming errors

If ovnkube-node is unhealthy there, that is often the root cause.

3. Check node readiness and basic health

			
oc get node
oc describe node <bad-node>

Look for:

NotReady
memory/disk pressure
network-related events

Sometimes OVN is fine and the node itself is degraded.

4. Inspect OVS on the bad node

Open a debug shell:

			
oc debug node/<bad-node>
chroot /host

Then:

ovs-vsctl show

You want to see expected bridges such as br-int.

Also useful:

			
ovs-ofctl dump-ports br-int
ovs-appctl bond/show

Red flags:

missing br-int
interfaces missing
counters not increasing on expected ports

If OVS is broken on that node, pod traffic there will fail even while the rest of the cluster looks fine.

5. Check the node’s host networking

Still on the node:

			
ip addr
ip route
ip link

Look for:

missing routes
down interfaces
wrong MTU

A node can have OVN running, but if the host interface or route is wrong, encapsulated traffic will still fail.

6. Compare MTU with a working node

MTU mismatches are sneaky.

On both a good node and bad node:

ip link

Look at the main NIC and OVN-related interfaces.

Symptoms of MTU trouble:

DNS works sometimes
small pings work
larger curls/higher-volume traffic fail or hang

A quick test from a pod can help:

ping -M do -s 1400 <target-ip>

If smaller packets work and larger ones fail, suspect MTU.

7. Check if pod wiring exists on the bad node

From the failing node’s ovnkube-node logs, check whether the affected pod sandbox/interface got programmed correctly.

Also inspect pods on that node:

oc get pods -A -o wide | grep <bad-node>

If all pods on that node fail, it is likely node OVN/OVS or host network.
If only one pod fails, it may be a pod-specific attachment/setup issue.

8. Test service vs direct pod IP

From a failing pod:

			
curl http://<target-pod-ip>:<port>
curl http://<service-cluster-ip>:<port>

Interpretation:

both fail → node/local OVN path likely broken
pod IP works, service fails → service/load-balancer programming problem
DNS name fails, ClusterIP works → DNS problem

This helps avoid blaming OVN for the wrong layer.

9. Check for node-local firewall or host changes

On the bad node, inspect whether something changed outside OpenShift:

			
iptables -S
nft list ruleset
systemctl status ovs-vswitchd
systemctl status ovn-controller

A manual host change, bad firewall rule, or failed service can break just one node.

10. Restart scope carefully

If evidence points clearly to the bad node’s OVN stack, a targeted recovery step is safer than broad cluster changes.

Typical sequence:

cordon/drain the bad node if workloads are impacted
restart or recover the bad node’s OVN/OVS components
verify traffic before uncordoning

Avoid random restarts cluster-wide unless you’ve ruled out a local issue.

What this usually turns out to be

Most common causes:

ovnkube-node unhealthy on one node
broken or stale OVS state on that node
host NIC / route / MTU mismatch
node-specific firewall or kernel/network issue
the node recently rebooted or partially lost connectivity to OVN DB

Fast triage checklist

When traffic fails only on one node, I’d do this in order:

			
oc get pods -A -o wide
oc get pods -n openshift-ovn-kubernetes -o wide
oc logs -n openshift-ovn-kubernetes <ovnkube-node-on-bad-node>
oc debug node/<bad-node>
chroot /host
ovs-vsctl show
ip route
ip link
systemctl status ovs-vswitchd
systemctl status ovn-controller

		

That usually gets you very close.

Mental model

When only one node is broken:

cluster-wide policy is less likely
app config is less likely
service config is less likely
node-local data plane is most likely

So think:
bad node → ovnkube-node → OVS → host NIC/route/MTU

Here’s a realistic example:

pods on worker-2 cannot reach anything off-node
pods on worker-1 are fine
ovnkube-node on worker-2 shows repeated connection/programming errors
ovs-vsctl show on worker-2 is missing expected state

That strongly suggests the fix is on worker-2, not in the app or service definitions.

Infra Cloud Solutions

Leave a ReplyCancel reply