Kubernetes DNS Resolution Failure
Causes and Fixes
DNS resolution failures in Kubernetes occur when pods cannot resolve service names, external hostnames, or cluster DNS entries. This typically manifests as connection errors mentioning 'name resolution failed' or 'no such host' and is most commonly caused by issues with CoreDNS or the pod's DNS configuration.
Symptoms
- Application logs show 'could not resolve host' or 'name resolution failed'
- nslookup or dig commands inside pods return SERVFAIL or timeout
- Pods can reach IPs directly but not hostnames
- Service-to-service communication fails while pod-to-pod by IP works
- CoreDNS pods are crashing, restarting, or in error state
Common Causes
Step-by-Step Troubleshooting
DNS resolution is fundamental to Kubernetes networking — services discover each other by name, and applications rely on DNS for both internal and external communication. When DNS breaks, nearly everything breaks. This guide systematically diagnoses DNS failures from the pod level up through CoreDNS and upstream resolvers.
1. Confirm the DNS Failure
First, verify that DNS is actually failing from inside a pod. Launch a debug pod with DNS tools.
kubectl run dns-debug --image=busybox:1.36 --restart=Never -- sleep 3600
kubectl exec dns-debug -- nslookup kubernetes.default
If this times out or returns SERVFAIL, DNS is broken. Try a few different queries to narrow down the scope.
# Test cluster service resolution
kubectl exec dns-debug -- nslookup kubernetes.default.svc.cluster.local
# Test external resolution
kubectl exec dns-debug -- nslookup google.com
# Test with a specific DNS server (the kube-dns ClusterIP)
kubectl exec dns-debug -- nslookup kubernetes.default 10.96.0.10
If cluster names fail but external names work (or vice versa), the issue is more specific. If everything fails, the problem is likely with CoreDNS availability or network connectivity to it.
2. Check CoreDNS Pod Health
CoreDNS handles all cluster DNS. Verify its pods are running.
kubectl get pods -n kube-system -l k8s-app=kube-dns
# Check CoreDNS deployment
kubectl get deployment -n kube-system coredns
# Check logs for errors
kubectl logs -n kube-system -l k8s-app=kube-dns --tail=100
Common log errors include:
HINFO: read udp ... i/o timeout— upstream DNS is unreachableplugin/loop: Loop ... detected— DNS loop causing CoreDNS to crashconnection refused— upstream resolver is down
If CoreDNS pods are in CrashLoopBackOff, check for the loop detection issue.
kubectl logs -n kube-system <coredns-pod> --previous | grep -i loop
3. Check the kube-dns Service
The kube-dns Service provides the ClusterIP that pods use as their DNS server.
kubectl get service kube-dns -n kube-system
# Verify endpoints exist
kubectl get endpoints kube-dns -n kube-system
The service should have endpoints pointing to the CoreDNS pod IPs. If the endpoints list is empty, CoreDNS pods are not ready, and DNS queries have nowhere to go.
4. Check the Pod's DNS Configuration
Verify what DNS configuration the affected pod is using.
kubectl exec <pod-name> -- cat /etc/resolv.conf
A typical resolv.conf looks like:
nameserver 10.96.0.10
search default.svc.cluster.local svc.cluster.local cluster.local
options ndots:5
Verify that:
- The nameserver IP matches the kube-dns Service ClusterIP
- The search domains are correct for the pod's namespace
- The ndots value is appropriate (default is 5)
If the resolv.conf is wrong, check the kubelet configuration.
# On the node, check kubelet's DNS settings
ps aux | grep kubelet | grep -oP 'cluster-dns[= ]\S+'
5. Check CoreDNS Configuration (Corefile)
The CoreDNS configuration may have errors.
kubectl get configmap coredns -n kube-system -o yaml
Verify the Corefile has the correct structure. Key things to check:
- The
kubernetesplugin is configured with the correct cluster domain (usuallycluster.local) - The
forwarddirective points to valid upstream DNS servers - There are no syntax errors
# Common working Corefile structure
# .:53 {
# kubernetes cluster.local in-addr.arpa ip6.arpa {
# pods insecure
# fallthrough in-addr.arpa ip6.arpa
# }
# forward . /etc/resolv.conf
# cache 30
# loop
# reload
# loadbalance
# }
If CoreDNS is configured to forward to /etc/resolv.conf on the node, and the node's resolv.conf points back to the cluster DNS (creating a loop), CoreDNS will detect the loop and crash. Fix this by pointing forward to explicit upstream DNS servers.
# Edit the ConfigMap to fix the forward target
kubectl edit configmap coredns -n kube-system
# Change: forward . /etc/resolv.conf
# To: forward . 8.8.8.8 8.8.4.4
6. Test Network Connectivity to CoreDNS
Even if CoreDNS is running, network issues can prevent pods from reaching it.
# Get the CoreDNS pod IPs
kubectl get pods -n kube-system -l k8s-app=kube-dns -o wide
# Test direct connectivity from a debug pod
kubectl exec dns-debug -- nc -uzv <coredns-pod-ip> 53
# Test connectivity to the kube-dns service IP
kubectl exec dns-debug -- nc -uzv 10.96.0.10 53
If you cannot reach CoreDNS, check for NetworkPolicies that might be blocking DNS traffic.
# List NetworkPolicies in the pod's namespace
kubectl get networkpolicy -n <namespace>
# Check if any policy blocks egress to kube-system or port 53
kubectl describe networkpolicy -n <namespace>
If a NetworkPolicy blocks DNS, add an egress rule allowing UDP and TCP port 53 to the kube-dns service.
7. Check for DNS Rate Limiting or Overload
In high-traffic clusters, CoreDNS can become a bottleneck.
# Check CoreDNS resource usage
kubectl top pods -n kube-system -l k8s-app=kube-dns
# Check CoreDNS metrics if available
kubectl exec -n kube-system <coredns-pod> -- sh -c 'wget -qO- http://localhost:9153/metrics' | grep coredns_dns_requests_total
If CoreDNS is CPU-throttled or memory-constrained, scale it up.
# Scale CoreDNS
kubectl scale deployment coredns -n kube-system --replicas=4
# Or adjust resource limits
kubectl set resources deployment coredns -n kube-system --requests=cpu=200m,memory=256Mi --limits=cpu=500m,memory=512Mi
For large clusters, consider deploying NodeLocal DNSCache to offload queries from CoreDNS and reduce cross-node DNS traffic.
8. Fix the ndots Issue
The default ndots:5 setting means any name with fewer than 5 dots will have each search domain appended before trying the absolute name. This causes excessive DNS queries for external names.
# A lookup for "api.example.com" generates these queries:
# api.example.com.default.svc.cluster.local
# api.example.com.svc.cluster.local
# api.example.com.cluster.local
# api.example.com (finally the actual query)
If this is causing performance issues, set ndots to a lower value in your pod spec.
spec:
dnsConfig:
options:
- name: ndots
value: "2"
9. Restart CoreDNS
If configuration changes were made, restart CoreDNS to pick them up.
kubectl rollout restart deployment coredns -n kube-system
kubectl rollout status deployment coredns -n kube-system
10. Verify DNS Is Working
After fixing the issue, confirm DNS resolution works end-to-end.
# Test cluster DNS
kubectl exec dns-debug -- nslookup kubernetes.default.svc.cluster.local
# Test cross-namespace service resolution
kubectl exec dns-debug -- nslookup <service-name>.<namespace>.svc.cluster.local
# Test external DNS
kubectl exec dns-debug -- nslookup google.com
# Clean up debug pod
kubectl delete pod dns-debug
All queries should return valid responses with IP addresses. If cluster DNS works but external DNS does not, the issue is with the upstream forwarder configuration. If both work, DNS resolution is fully restored.
How to Explain This in an Interview
I would explain how Kubernetes DNS works end-to-end: the kubelet configures each pod's /etc/resolv.conf to point at the kube-dns ClusterIP, CoreDNS watches the API server for Service and Endpoint changes, and queries are resolved using search domains that allow short names like 'myservice' to resolve within the same namespace. I'd discuss the ndots setting (default 5) and how it affects query behavior, the role of the kubernetes plugin in CoreDNS's Corefile, and how to debug DNS at each layer — from the pod's resolv.conf to CoreDNS logs to upstream resolvers. I'd also mention NodeLocal DNSCache as a solution for DNS scalability.
Prevention
- Deploy CoreDNS with at least 2 replicas and a PodDisruptionBudget
- Monitor CoreDNS metrics (cache hits, errors, latency) with Prometheus
- Use NodeLocal DNSCache to reduce CoreDNS load and improve latency
- Configure appropriate ndots value to reduce unnecessary DNS queries
- Ensure NetworkPolicies always allow egress to kube-dns