Scenario
Traffic works for pods on node A, but fails for pods on node B.
That usually points to a node-local OVN/OVS problem, not an app problem.
Example:
frontendon worker-1 can reachbackend- same app on worker-2 cannot
That pattern is a huge clue.
How to debug it
1. Prove it’s node-specific
List pods and nodes:
oc get pods -A -o wide
Run the same network test from a pod on each node:
oc exec -it <good-pod> -- curl http://<target-pod-ip>:<port>oc exec -it <bad-pod> -- curl http://<target-pod-ip>:<port>
If one node always works and another always fails, focus on the bad node.
2. Check the OVN pod on the bad node
Find the ovnkube-node pod for that worker:
oc get pods -n openshift-ovn-kubernetes -o wide
Look for the pod scheduled on the failing node.
Then inspect:
oc describe pod -n openshift-ovn-kubernetes <ovnkube-node-pod>oc logs -n openshift-ovn-kubernetes <ovnkube-node-pod>
Things that matter:
- restarts
- readiness failures
- DB connection errors
- OVS/flow programming errors
If ovnkube-node is unhealthy there, that is often the root cause.
3. Check node readiness and basic health
oc get nodeoc describe node <bad-node>
Look for:
NotReady- memory/disk pressure
- network-related events
Sometimes OVN is fine and the node itself is degraded.
4. Inspect OVS on the bad node
Open a debug shell:
oc debug node/<bad-node>chroot /host
Then:
ovs-vsctl show
You want to see expected bridges such as br-int.
Also useful:
ovs-ofctl dump-ports br-intovs-appctl bond/show
Red flags:
- missing
br-int - interfaces missing
- counters not increasing on expected ports
If OVS is broken on that node, pod traffic there will fail even while the rest of the cluster looks fine.
5. Check the node’s host networking
Still on the node:
ip addrip routeip link
Look for:
- missing routes
- down interfaces
- wrong MTU
A node can have OVN running, but if the host interface or route is wrong, encapsulated traffic will still fail.
6. Compare MTU with a working node
MTU mismatches are sneaky.
On both a good node and bad node:
ip link
Look at the main NIC and OVN-related interfaces.
Symptoms of MTU trouble:
- DNS works sometimes
- small pings work
- larger curls/higher-volume traffic fail or hang
A quick test from a pod can help:
ping -M do -s 1400 <target-ip>
If smaller packets work and larger ones fail, suspect MTU.
7. Check if pod wiring exists on the bad node
From the failing node’s ovnkube-node logs, check whether the affected pod sandbox/interface got programmed correctly.
Also inspect pods on that node:
oc get pods -A -o wide | grep <bad-node>
If all pods on that node fail, it is likely node OVN/OVS or host network.
If only one pod fails, it may be a pod-specific attachment/setup issue.
8. Test service vs direct pod IP
From a failing pod:
curl http://<target-pod-ip>:<port>curl http://<service-cluster-ip>:<port>
Interpretation:
- both fail → node/local OVN path likely broken
- pod IP works, service fails → service/load-balancer programming problem
- DNS name fails, ClusterIP works → DNS problem
This helps avoid blaming OVN for the wrong layer.
9. Check for node-local firewall or host changes
On the bad node, inspect whether something changed outside OpenShift:
iptables -Snft list rulesetsystemctl status ovs-vswitchdsystemctl status ovn-controller
A manual host change, bad firewall rule, or failed service can break just one node.
10. Restart scope carefully
If evidence points clearly to the bad node’s OVN stack, a targeted recovery step is safer than broad cluster changes.
Typical sequence:
- cordon/drain the bad node if workloads are impacted
- restart or recover the bad node’s OVN/OVS components
- verify traffic before uncordoning
Avoid random restarts cluster-wide unless you’ve ruled out a local issue.
What this usually turns out to be
Most common causes:
ovnkube-nodeunhealthy on one node- broken or stale OVS state on that node
- host NIC / route / MTU mismatch
- node-specific firewall or kernel/network issue
- the node recently rebooted or partially lost connectivity to OVN DB
Fast triage checklist
When traffic fails only on one node, I’d do this in order:
oc get pods -A -o wideoc get pods -n openshift-ovn-kubernetes -o wideoc logs -n openshift-ovn-kubernetes <ovnkube-node-on-bad-node>oc debug node/<bad-node>chroot /hostovs-vsctl showip routeip linksystemctl status ovs-vswitchdsystemctl status ovn-controller
That usually gets you very close.
Mental model
When only one node is broken:
- cluster-wide policy is less likely
- app config is less likely
- service config is less likely
- node-local data plane is most likely
So think:
bad node → ovnkube-node → OVS → host NIC/route/MTU
Here’s a realistic example:
- pods on
worker-2cannot reach anything off-node - pods on
worker-1are fine ovnkube-nodeonworker-2shows repeated connection/programming errorsovs-vsctl showonworker-2is missing expected state
That strongly suggests the fix is on worker-2, not in the app or service definitions.