Debugging OVN Issues in OpenShift

Let’s walk through a realistic, production-style OVN debugging scenario in
OpenShift Container Platform using OVN-Kubernetes.


Scenario

A frontend pod cannot reach a backend service

You have:

  • frontend pod
  • backend pod
  • backend-service (ClusterIP)

And:

curl http://backend-service

fails


Step-by-step debugging (real flow)

Step 1: Check if backend pod is healthy

oc get pods -o wide

You want:

  • Backend pod = Running
  • Has an IP (e.g., 10.128.2.15)

If pod is not running → stop here (not an OVN issue)


Step 2: Test direct pod-to-pod connectivity

From frontend pod:

oc exec -it frontend -- curl http://10.128.2.15

Outcomes:

Case A: This FAILS

→ Problem is networking (OVN / policy / routing)

Case B: This WORKS

→ Networking is fine → problem is service layer


Branch A: Pod-to-pod FAILS (OVN issue)

Step 3A: Check NetworkPolicies

oc get networkpolicy -A

Look for anything like:

  • Deny all ingress
  • Missing allow rules

Quick test:
Create temporary allow-all policy

If it suddenly works → root cause = NetworkPolicy


Step 4A: Check node-level OVN

Find nodes:

oc get pods -o wide

Then:

oc get pods -n openshift-ovn-kubernetes -o wide

Check:

  • Is ovnkube-node running on both nodes?
  • Any restarts?

Step 5A: Test OVS health

oc debug node/<node>
chroot /host
ovs-vsctl show

Look for:

  • br-int bridge
  • Proper interfaces

Missing interfaces = OVN not wiring pods correctly


Step 6A: Check OVN logs

oc logs -n openshift-ovn-kubernetes <ovnkube-node>

Common errors:

  • Flow install failures
  • DB sync issues

Branch B: Pod-to-pod WORKS, Service FAILS

This is VERY common and often misunderstood.


Step 3B: Check service

oc get svc backend-service -o wide

Check:

  • ClusterIP exists
  • Correct port

Step 4B: Check endpoints

oc get endpoints backend-service

If EMPTY:

→ Service is not linked to pods

Root cause:

  • Wrong selector labels

Fix:

selector:
app: backend

Step 5B: Test service IP directly

curl <ClusterIP>

Fails but pod IP works:

→ OVN load-balancing issue


Step 6B: Check OVN load balancer

On node:

ovn-nbctl lb-list

You should see:

  • Service IP mapped to pod IPs

If missing → OVN not programming service


Bonus: DNS check (often confused with OVN)

From frontend:

nslookup backend-service

If fails:

→ DNS issue, NOT OVN

Check:

oc get pods -n openshift-dns

Real root cause examples (from production)

Case 1: Wrong labels

  • Service selector doesn’t match pod
    → No endpoints → service fails

Case 2: NetworkPolicy blocking traffic

  • Default deny policy applied
    → Pods isolated

Case 3: OVN desync

  • Pod exists but not in OVN DB
    → No routing

Case 4: Node issue

  • Only pods on one node fail
    ovnkube-node broken there

Case 5: MTU mismatch

  • Small packets work, large fail
    → Very tricky to spot

The mental model (this is what experts use)

When debugging:

  1. Pod IP → works?
    • ❌ → OVN / policy / routing
    • ✅ → go to service layer
  2. Service endpoints exist?
    • ❌ → labels problem
    • ✅ → OVN load balancing
  3. DNS works?
    • ❌ → DNS, not OVN

Pro move (what senior engineers do)

Spin up a debug pod:

oc run debug --image=busybox -it --rm -- sh

Then test:

  • ping
  • curl
  • nslookup

This removes app complexity completely.


Leave a comment