Debugging DNS Issues in OpenShift Pods

DNS works for some pods but not others: this one is tricky because it often looks like OVN, but a lot of the time it is actually DNS path, namespace lookup, or pod DNS config.

In OpenShift, the DNS Operator manages CoreDNS for pod and service name resolution, and CoreDNS runs as the dns-default daemon set in openshift-dns. Pods rely on kubelet-provided DNS settings in /etc/resolv.conf to reach those DNS servers. (Red Hat Documentation)

Scenario

Some pods can resolve service names, but others cannot.

Examples:

Pod A: nslookup backend-service ✅
Pod B: nslookup backend-service ❌

That usually means one of these:

the failing pod has bad DNS settings,
the query is being made from the wrong namespace,
only some nodes can reach the DNS pods,
or the DNS pods themselves are unhealthy on part of the cluster. (Red Hat Documentation)

Diagram

                +------------------------------+
                |        failing pod           |
                |  /etc/resolv.conf            |
                |  nameserver -> DNS service   |
                +--------------+---------------+
                               |
                               v
                    +---------------------+
                    |   CoreDNS /         |
                    |   dns-default pods  |
                    |   in openshift-dns  |
                    +----------+----------+
                               |
                 resolves svc/pod names from cluster state
                               |
                               v
                    +---------------------+
                    |  Service / Pod DNS  |
                    |  records            |
                    +---------------------+

Where it breaks:
1) Pod resolv.conf is wrong
2) Pod queries wrong namespace
3) Pod/node cannot reach dns-default
4) dns-default pods unhealthy
5) Name exists, but target service/endpoints are wrong

How to debug it

1. Prove it is DNS and not general networking

From a good pod and a bad pod, test both DNS and direct IP access:

			
oc exec -it <good-pod> -- nslookup backend-service
oc exec -it <bad-pod> -- nslookup backend-service
oc exec -it <bad-pod> -- curl http://<service-cluster-ip>:<port>
oc exec -it <bad-pod> -- curl http://<pod-ip>:<port>

If IP-based access works but nslookup fails, that points strongly to DNS rather than OVN datapath routing. Kubernetes service and pod discovery are meant to work through DNS records. (Kubernetes)

2. Check the failing pod’s `/etc/resolv.conf`

This is one of the fastest checks:

oc exec -it <bad-pod> -- cat /etc/resolv.conf

A normal pod DNS config should include a cluster DNS nameserver and search domains such as the pod namespace, svc.cluster.local, and cluster.local; Kubernetes documents options ndots:5 as typical too. If those are missing or odd, the pod DNS setup is wrong. (Kubernetes)

3. Make sure the pod is querying the right namespace

A very common false alarm:

			
oc exec -it <bad-pod> -- nslookup backend-service
oc exec -it <bad-pod> -- nslookup backend-service.<namespace>

Kubernetes says unqualified service names are resolved relative to the pod’s own namespace. So backend-service from namespace frontend will not find a service that lives in namespace backend unless you query backend-service.backend. (Kubernetes)

4. Check whether the DNS pods are healthy

In OpenShift, look at the DNS operator and DNS pods:

			
oc get clusteroperator dns
oc get pods -n openshift-dns
oc get pods -n openshift-dns-operator

Red Hat documents that the DNS Operator manages CoreDNS, and that CoreDNS runs as the dns-default daemon set. If those pods are crashlooping, pending, or missing on expected nodes, pods may lose name resolution. (Red Hat Documentation)

5. Check whether only some nodes are affected

If only pods on one worker fail DNS, compare node placement:

			
oc get pods -A -o wide | grep <failing-node>
oc get pods -n openshift-dns -o wide

Red Hat notes DNS is available to all pods if DNS pods are running on some nodes and nodes without DNS pods still have network connectivity to nodes with DNS pods. So “only pods on node X fail DNS” often means node-to-DNS connectivity is broken rather than CoreDNS being globally broken. (Red Hat Documentation)

6. Test from a clean debug pod

This removes app-side noise:

			
oc run dns-debug --image=registry.k8s.io/e2e-test-images/agnhost:2.39 -it --rm -- sh
nslookup kubernetes.default
nslookup backend-service.<namespace>
cat /etc/resolv.conf

Kubernetes recommends creating a simple test pod and using nslookup kubernetes.default as a baseline DNS test. (Kubernetes)

7. Check DNS service reachability from the bad pod

If you know the DNS service IP from /etc/resolv.conf, test whether the pod can even reach it. If the DNS nameserver is unreachable from only some pods or nodes, the issue is likely network path to DNS, not DNS records themselves. This is an inference from the Kubernetes debug flow and OpenShift’s note about node connectivity to DNS pods. (Kubernetes)

8. Check logs from the DNS pods

If the DNS pods are up but resolution still fails:

oc logs -n openshift-dns <dns-default-pod>

If you are testing a workaround, Red Hat documents that the DNS Operator can be set to Unmanaged, but they also note you cannot upgrade while it remains unmanaged. (Red Hat Documentation)

What this usually turns out to be

Most common causes:

Wrong namespace lookup: querying service instead of service.namespace. (Kubernetes)
Bad pod DNS config: strange or missing nameserver/search domains in /etc/resolv.conf. (Kubernetes)
DNS pods unhealthy: dns-default issues in openshift-dns. (Red Hat Documentation)
Node-specific connectivity issue: pods on one node cannot reach DNS pods running elsewhere. (Red Hat Documentation)
Service confusion: DNS resolves, but the target service or endpoints are wrong, making it look like DNS. Kubernetes DNS only gives you the name-to-record mapping; the service still has to be valid. (Kubernetes)

Fast triage sequence

			
oc exec -it <bad-pod> -- cat /etc/resolv.conf
oc exec -it <bad-pod> -- nslookup kubernetes.default
oc exec -it <bad-pod> -- nslookup <service>.<namespace>
oc get clusteroperator dns
oc get pods -n openshift-dns -o wide
oc logs -n openshift-dns <dns-default-pod>

		

Mental model

When DNS fails only for some pods:

if all traffic is broken, think OVN/node networking
if IP access works but names fail, think DNS
if short names fail but FQDN works, think namespace/search path
if only one node’s pods fail, think node-to-dns connectivity

Infra Cloud Solutions

Leave a ReplyCancel reply