Debugging ClusterIP Issues in OVN Kubernetes

Great—let’s go through another very common real-world issue and include a simple visual to make it click.


Scenario

Service works via pod IP, but fails via ClusterIP (service name/IP)

Environment:

  • frontend → calling backend
  • Direct call works:curl http://10.128.2.15:8080 ✅
  • Service call fails:curl http://backend-service ❌

What this means (important insight)

If pod IP works but service fails, then:

Pod networking (OVN routing) is working
Problem is in service load-balancing layer inside OVN-Kubernetes


Mental model (diagram)

Image

Interpretation:

  • Pod → Pod = direct routing (works)
  • Pod → Service = goes through OVN load balancer (broken here)

Step-by-step debugging

Step 1: Confirm endpoints exist

oc get endpoints backend-service

If EMPTY:

Root cause = wrong labels

Example:

# Service selector
selector:
app: backend

But pod has:

labels:
app: api ❌ mismatch

Fix labels → service starts working instantly


Step 2: Verify service definition

oc get svc backend-service -o yaml

Check:

  • correct port
  • correct targetPort

Common mistake:

port: 80
targetPort: 8080 ✅ must match container port

Step 3: Test ClusterIP directly

curl <ClusterIP>:<port>

Results:

  • ❌ fails → OVN load balancer issue
  • ✅ works → DNS issue instead

Step 4: Check DNS (don’t skip this)

From pod:

nslookup backend-service
If fails:

→ Not OVN
→ Check:

oc get pods -n openshift-dns

Step 5: Inspect OVN load balancer

On a node:

oc debug node/<node>
chroot /host

Then:

ovn-nbctl lb-list

You should see something like:

VIP: 172.30.0.10:80 → 10.128.2.15:8080

If missing:

OVN didn’t program the service


Step 6: Check OVN logs

oc logs -n openshift-ovn-kubernetes <ovnkube-master>

Look for:

  • load balancer sync errors
  • endpoint update failures

Step 7: Check kube-proxy replacement

In OpenShift Container Platform, OVN replaces kube-proxy.

So if service routing is broken:
It’s handled by OVN, not iptables


Real root causes (from production)

1. Label mismatch (MOST COMMON)

  • Service selector doesn’t match pod
    → no endpoints → service dead

2. Wrong port/targetPort

  • Service pointing to wrong container port
    → connection refused

3. OVN load balancer not programmed

  • OVN DB out of sync
    → ClusterIP has no backend mapping

4. NetworkPolicy blocking service traffic

  • Pod allows direct IP but blocks service path
    (less common but happens)

5. DNS issue (misdiagnosed often)

  • Service name fails, ClusterIP works

Fast debugging logic (this is gold)

When pod IP works but service fails:

  1. Endpoints exist?
    • ❌ → labels problem
  2. ClusterIP works?
    • ❌ → OVN load balancing
  3. DNS works?
    • ❌ → DNS issue

Pro tip (what experts do fast)

From a debug pod:

oc run debug --image=busybox -it --rm -- sh

Run:

nslookup backend-service
curl <ClusterIP>
curl <pod-IP>

This instantly isolates:

  • DNS
  • service
  • networking

Key takeaway

  • Pod IP = routing layer (OVN switching)
  • Service IP = OVN load balancer layer
  • If one works and the other doesn’t → you know exactly where to look

Leave a comment