Let’s walk through a realistic, production-style OVN debugging scenario in
OpenShift Container Platform using OVN-Kubernetes.
Scenario
A frontend pod cannot reach a backend service
You have:
frontendpodbackendpodbackend-service(ClusterIP)
And:
curl http://backend-service
fails
Step-by-step debugging (real flow)
Step 1: Check if backend pod is healthy
oc get pods -o wide
You want:
- Backend pod = Running
- Has an IP (e.g.,
10.128.2.15)
If pod is not running → stop here (not an OVN issue)
Step 2: Test direct pod-to-pod connectivity
From frontend pod:
oc exec -it frontend -- curl http://10.128.2.15
Outcomes:
Case A: This FAILS
→ Problem is networking (OVN / policy / routing)
Case B: This WORKS
→ Networking is fine → problem is service layer
Branch A: Pod-to-pod FAILS (OVN issue)
Step 3A: Check NetworkPolicies
oc get networkpolicy -A
Look for anything like:
- Deny all ingress
- Missing allow rules
Quick test:
Create temporary allow-all policy
If it suddenly works → root cause = NetworkPolicy
Step 4A: Check node-level OVN
Find nodes:
oc get pods -o wide
Then:
oc get pods -n openshift-ovn-kubernetes -o wide
Check:
- Is
ovnkube-noderunning on both nodes? - Any restarts?
Step 5A: Test OVS health
oc debug node/<node>chroot /hostovs-vsctl show
Look for:
br-intbridge- Proper interfaces
Missing interfaces = OVN not wiring pods correctly
Step 6A: Check OVN logs
oc logs -n openshift-ovn-kubernetes <ovnkube-node>
Common errors:
- Flow install failures
- DB sync issues
Branch B: Pod-to-pod WORKS, Service FAILS
This is VERY common and often misunderstood.
Step 3B: Check service
oc get svc backend-service -o wide
Check:
- ClusterIP exists
- Correct port
Step 4B: Check endpoints
oc get endpoints backend-service
If EMPTY:
→ Service is not linked to pods
Root cause:
- Wrong selector labels
Fix:
selector: app: backend
Step 5B: Test service IP directly
curl <ClusterIP>
Fails but pod IP works:
→ OVN load-balancing issue
Step 6B: Check OVN load balancer
On node:
ovn-nbctl lb-list
You should see:
- Service IP mapped to pod IPs
If missing → OVN not programming service
Bonus: DNS check (often confused with OVN)
From frontend:
nslookup backend-service
If fails:
→ DNS issue, NOT OVN
Check:
oc get pods -n openshift-dns
Real root cause examples (from production)
Case 1: Wrong labels
- Service selector doesn’t match pod
→ No endpoints → service fails
Case 2: NetworkPolicy blocking traffic
- Default deny policy applied
→ Pods isolated
Case 3: OVN desync
- Pod exists but not in OVN DB
→ No routing
Case 4: Node issue
- Only pods on one node fail
→ovnkube-nodebroken there
Case 5: MTU mismatch
- Small packets work, large fail
→ Very tricky to spot
The mental model (this is what experts use)
When debugging:
- Pod IP → works?
- ❌ → OVN / policy / routing
- ✅ → go to service layer
- Service endpoints exist?
- ❌ → labels problem
- ✅ → OVN load balancing
- DNS works?
- ❌ → DNS, not OVN
Pro move (what senior engineers do)
Spin up a debug pod:
oc run debug --image=busybox -it --rm -- sh
Then test:
pingcurlnslookup
This removes app complexity completely.