Troubleshooting Node-Specific Pod Traffic Failures

Scenario

Traffic works for pods on node A, but fails for pods on node B.

That usually points to a node-local OVN/OVS problem, not an app problem.

Example:

  • frontend on worker-1 can reach backend
  • same app on worker-2 cannot

That pattern is a huge clue.


How to debug it

1. Prove it’s node-specific

List pods and nodes:

oc get pods -A -o wide

Run the same network test from a pod on each node:

oc exec -it <good-pod> -- curl http://<target-pod-ip>:<port>
oc exec -it <bad-pod> -- curl http://<target-pod-ip>:<port>

If one node always works and another always fails, focus on the bad node.


2. Check the OVN pod on the bad node

Find the ovnkube-node pod for that worker:

oc get pods -n openshift-ovn-kubernetes -o wide

Look for the pod scheduled on the failing node.

Then inspect:

oc describe pod -n openshift-ovn-kubernetes <ovnkube-node-pod>
oc logs -n openshift-ovn-kubernetes <ovnkube-node-pod>

Things that matter:

  • restarts
  • readiness failures
  • DB connection errors
  • OVS/flow programming errors

If ovnkube-node is unhealthy there, that is often the root cause.


3. Check node readiness and basic health

oc get node
oc describe node <bad-node>

Look for:

  • NotReady
  • memory/disk pressure
  • network-related events

Sometimes OVN is fine and the node itself is degraded.


4. Inspect OVS on the bad node

Open a debug shell:

oc debug node/<bad-node>
chroot /host

Then:

ovs-vsctl show

You want to see expected bridges such as br-int.

Also useful:

ovs-ofctl dump-ports br-int
ovs-appctl bond/show

Red flags:

  • missing br-int
  • interfaces missing
  • counters not increasing on expected ports

If OVS is broken on that node, pod traffic there will fail even while the rest of the cluster looks fine.


5. Check the node’s host networking

Still on the node:

ip addr
ip route
ip link

Look for:

  • missing routes
  • down interfaces
  • wrong MTU

A node can have OVN running, but if the host interface or route is wrong, encapsulated traffic will still fail.


6. Compare MTU with a working node

MTU mismatches are sneaky.

On both a good node and bad node:

ip link

Look at the main NIC and OVN-related interfaces.

Symptoms of MTU trouble:

  • DNS works sometimes
  • small pings work
  • larger curls/higher-volume traffic fail or hang

A quick test from a pod can help:

ping -M do -s 1400 <target-ip>

If smaller packets work and larger ones fail, suspect MTU.


7. Check if pod wiring exists on the bad node

From the failing node’s ovnkube-node logs, check whether the affected pod sandbox/interface got programmed correctly.

Also inspect pods on that node:

oc get pods -A -o wide | grep <bad-node>

If all pods on that node fail, it is likely node OVN/OVS or host network.
If only one pod fails, it may be a pod-specific attachment/setup issue.


8. Test service vs direct pod IP

From a failing pod:

curl http://<target-pod-ip>:<port>
curl http://<service-cluster-ip>:<port>

Interpretation:

  • both fail → node/local OVN path likely broken
  • pod IP works, service fails → service/load-balancer programming problem
  • DNS name fails, ClusterIP works → DNS problem

This helps avoid blaming OVN for the wrong layer.


9. Check for node-local firewall or host changes

On the bad node, inspect whether something changed outside OpenShift:

iptables -S
nft list ruleset
systemctl status ovs-vswitchd
systemctl status ovn-controller

A manual host change, bad firewall rule, or failed service can break just one node.


10. Restart scope carefully

If evidence points clearly to the bad node’s OVN stack, a targeted recovery step is safer than broad cluster changes.

Typical sequence:

  • cordon/drain the bad node if workloads are impacted
  • restart or recover the bad node’s OVN/OVS components
  • verify traffic before uncordoning

Avoid random restarts cluster-wide unless you’ve ruled out a local issue.


What this usually turns out to be

Most common causes:

  • ovnkube-node unhealthy on one node
  • broken or stale OVS state on that node
  • host NIC / route / MTU mismatch
  • node-specific firewall or kernel/network issue
  • the node recently rebooted or partially lost connectivity to OVN DB

Fast triage checklist

When traffic fails only on one node, I’d do this in order:

oc get pods -A -o wide
oc get pods -n openshift-ovn-kubernetes -o wide
oc logs -n openshift-ovn-kubernetes <ovnkube-node-on-bad-node>
oc debug node/<bad-node>
chroot /host
ovs-vsctl show
ip route
ip link
systemctl status ovs-vswitchd
systemctl status ovn-controller

That usually gets you very close.


Mental model

When only one node is broken:

  • cluster-wide policy is less likely
  • app config is less likely
  • service config is less likely
  • node-local data plane is most likely

So think:
bad node → ovnkube-node → OVS → host NIC/route/MTU


Here’s a realistic example:

  • pods on worker-2 cannot reach anything off-node
  • pods on worker-1 are fine
  • ovnkube-node on worker-2 shows repeated connection/programming errors
  • ovs-vsctl show on worker-2 is missing expected state

That strongly suggests the fix is on worker-2, not in the app or service definitions.

Leave a comment