Debugging ClusterIP Issues in OVN Kubernetes

Great—let’s go through another very common real-world issue and include a simple visual to make it click.

Scenario

Service works via pod IP, but fails via ClusterIP (service name/IP)

Environment:

frontend → calling backend
Direct call works:curl http://10.128.2.15:8080 ✅
Service call fails:curl http://backend-service ❌

What this means (important insight)

If pod IP works but service fails, then:

Pod networking (OVN routing) is working
Problem is in service load-balancing layer inside OVN-Kubernetes

Mental model (diagram)

Interpretation:

Pod → Pod = direct routing (works)
Pod → Service = goes through OVN load balancer (broken here)

Step-by-step debugging

Step 1: Confirm endpoints exist

oc get endpoints backend-service

If EMPTY:

Root cause = wrong labels

Example:

			
# Service selector
selector:
  app: backend

But pod has:

			
labels:
  app: api   ❌ mismatch

Fix labels → service starts working instantly

Step 2: Verify service definition

oc get svc backend-service -o yaml

Check:

correct port
correct targetPort

Common mistake:

			
port: 80
targetPort: 8080   ✅ must match container port

Step 3: Test ClusterIP directly

curl <ClusterIP>:<port>

Results:

❌ fails → OVN load balancer issue
✅ works → DNS issue instead

Step 4: Check DNS (don’t skip this)

From pod:

nslookup backend-service

If fails:

→ Not OVN
→ Check:

oc get pods -n openshift-dns

Step 5: Inspect OVN load balancer

On a node:

			
oc debug node/<node>
chroot /host

Then:

ovn-nbctl lb-list

You should see something like:

VIP: 172.30.0.10:80 → 10.128.2.15:8080

If missing:

OVN didn’t program the service

Step 6: Check OVN logs

oc logs -n openshift-ovn-kubernetes <ovnkube-master>

Look for:

load balancer sync errors
endpoint update failures

Step 7: Check kube-proxy replacement

In OpenShift Container Platform, OVN replaces kube-proxy.

So if service routing is broken:
It’s handled by OVN, not iptables

Real root causes (from production)

1. Label mismatch (MOST COMMON)

Service selector doesn’t match pod
→ no endpoints → service dead

2. Wrong port/targetPort

Service pointing to wrong container port
→ connection refused

3. OVN load balancer not programmed

OVN DB out of sync
→ ClusterIP has no backend mapping

4. NetworkPolicy blocking service traffic

Pod allows direct IP but blocks service path
(less common but happens)

5. DNS issue (misdiagnosed often)

Service name fails, ClusterIP works

Fast debugging logic (this is gold)

When pod IP works but service fails:

Endpoints exist?
- ❌ → labels problem
ClusterIP works?
- ❌ → OVN load balancing
DNS works?
- ❌ → DNS issue

Pro tip (what experts do fast)

From a debug pod:

oc run debug --image=busybox -it --rm -- sh

Run:

			
nslookup backend-service
curl <ClusterIP>
curl <pod-IP>

This instantly isolates:

DNS
service
networking

Key takeaway

Pod IP = routing layer (OVN switching)
Service IP = OVN load balancer layer
If one works and the other doesn’t → you know exactly where to look

Infra Cloud Solutions

Leave a comment Cancel reply