How Do You Troubleshoot Network Issues in Kubernetes?
Kubernetes network troubleshooting follows a systematic approach: verify DNS resolution, test Pod-to-Pod connectivity, check Service endpoints, inspect NetworkPolicies, and examine CNI plugin and kube-proxy health. Tools like kubectl exec, nslookup, curl, and tcpdump are essential.
Detailed Answer
A Systematic Troubleshooting Framework
Network issues in Kubernetes can originate from many layers. A structured approach prevents wasted time:
- DNS resolution - Can the Pod resolve names?
- Pod-to-Pod - Can Pods reach each other by IP?
- Pod-to-Service - Does the Service route to healthy endpoints?
- Ingress/Egress - Can traffic enter and leave the cluster?
- NetworkPolicy - Are policies blocking expected traffic?
Step 1: Verify DNS Resolution
DNS failures are the most common networking issue. Start here.
# Deploy a debug Pod with networking tools
kubectl run netdebug --image=nicolaka/netshoot --rm -it --restart=Never -- bash
# Inside the debug Pod:
# Test cluster DNS
nslookup kubernetes.default.svc.cluster.local
# Test a specific Service
nslookup my-service.my-namespace.svc.cluster.local
# Test external DNS
nslookup google.com
# Check the resolv.conf
cat /etc/resolv.conf
If DNS fails, check CoreDNS:
# Are CoreDNS Pods running?
kubectl get pods -n kube-system -l k8s-app=kube-dns
# Check CoreDNS logs for errors
kubectl logs -n kube-system -l k8s-app=kube-dns --tail=50
# Verify CoreDNS Service exists
kubectl get svc kube-dns -n kube-system
# Check the Corefile for misconfigurations
kubectl get configmap coredns -n kube-system -o yaml
Step 2: Test Pod-to-Pod Connectivity
# Get Pod IPs
kubectl get pods -o wide
# From the debug Pod, ping another Pod by IP
ping -c 3 10.244.1.15
# Test TCP connectivity
nc -zv 10.244.1.15 8080
# Traceroute to see the path
traceroute 10.244.1.15
If Pod-to-Pod fails:
# Check CNI plugin health
kubectl get pods -n kube-system | grep -E 'calico|cilium|flannel|weave'
# Check CNI plugin logs
kubectl logs -n kube-system -l k8s-app=calico-node --tail=50
# On the node, verify the CNI configuration exists
ls /etc/cni/net.d/
cat /etc/cni/net.d/10-calico.conflist
# Check node routing tables
ip route show
Step 3: Verify Service Routing
# Check the Service exists and has a ClusterIP
kubectl get svc my-service -o wide
# Verify endpoints are populated
kubectl get endpoints my-service
kubectl get endpointslices -l kubernetes.io/service-name=my-service
# Test the Service from inside the cluster
kubectl exec netdebug -- curl -s http://my-service.default:80
# If no endpoints, check that Pod labels match the Service selector
kubectl get pods -l app=my-app --show-labels
kubectl get svc my-service -o jsonpath='{.spec.selector}'
A Service with zero endpoints typically means no Pods match the selector, or the matching Pods are not in Ready state.
Step 4: Check NetworkPolicies
# List all NetworkPolicies in the namespace
kubectl get networkpolicies -n my-namespace
# Describe a specific policy
kubectl describe networkpolicy my-policy -n my-namespace
# Check if the CNI plugin supports NetworkPolicy
# (Flannel alone does not; Calico and Cilium do)
A common mistake is applying an ingress NetworkPolicy without realizing it blocks all traffic not explicitly allowed:
# This policy blocks ALL ingress except from app=frontend
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: restrict-backend
spec:
podSelector:
matchLabels:
app: backend
policyTypes:
- Ingress
ingress:
- from:
- podSelector:
matchLabels:
app: frontend
Step 5: Packet Capture with tcpdump
For deep debugging, capture packets inside a Pod's network namespace:
# Option 1: Use kubectl debug (Kubernetes 1.25+)
kubectl debug -it my-pod --image=nicolaka/netshoot --target=my-container -- tcpdump -i eth0 -nn port 80
# Option 2: Use nsenter on the node
# Find the Pod's PID
CONTAINER_ID=$(crictl ps --name my-container -q)
PID=$(crictl inspect $CONTAINER_ID | jq .info.pid)
# Enter the network namespace and run tcpdump
nsenter -t $PID -n tcpdump -i eth0 -nn -c 20 port 80
Step 6: Check kube-proxy
# Verify kube-proxy is running
kubectl get pods -n kube-system -l k8s-app=kube-proxy
# Check kube-proxy logs
kubectl logs -n kube-system -l k8s-app=kube-proxy --tail=50
# Verify kube-proxy mode
kubectl logs -n kube-system -l k8s-app=kube-proxy | grep "Using"
# Check iptables rules (if in iptables mode)
iptables -t nat -L KUBE-SERVICES -n | grep <service-cluster-ip>
# Check IPVS rules (if in IPVS mode)
ipvsadm -Ln | grep <service-cluster-ip>
Common Issues and Fixes
| Symptom | Likely Cause | Fix | |---|---|---| | Pods stuck in ContainerCreating | CNI plugin not installed or crashed | Install/restart CNI DaemonSet | | DNS resolution fails | CoreDNS not running or misconfigured | Check CoreDNS Pods and Corefile | | Service has no endpoints | Pod labels do not match selector | Fix labels or selector | | Intermittent timeouts | conntrack table full | Increase nf_conntrack_max | | Cross-node Pods unreachable | Firewall blocking overlay traffic | Open VXLAN/BGP ports | | External traffic blocked | Missing SNAT/masquerade rules | Check kube-proxy and CNI config |
Essential Debugging Images
Keep these images handy for troubleshooting:
# netshoot - comprehensive networking tools
kubectl run debug --image=nicolaka/netshoot --rm -it --restart=Never -- bash
# Tools available: curl, wget, ping, traceroute, dig, nslookup,
# tcpdump, iperf, netstat, ss, ip, nmap, iftop
A well-prepared SRE maintains runbooks for each of these failure scenarios and knows which layer to investigate based on the symptoms observed.
Why Interviewers Ask This
Interviewers want to see a structured troubleshooting methodology and hands-on familiarity with the tools needed to diagnose real-world networking failures.
Common Follow-Up Questions
Key Takeaways
- Always start with DNS, as it is the most common failure point
- Use ephemeral debug containers or netshoot images for troubleshooting
- Check CNI plugin health and kube-proxy logs when connectivity fails