How Do You Troubleshoot Network Issues in Kubernetes?

advanced|networkingdevopssreCKA
TL;DR

Kubernetes network troubleshooting follows a systematic approach: verify DNS resolution, test Pod-to-Pod connectivity, check Service endpoints, inspect NetworkPolicies, and examine CNI plugin and kube-proxy health. Tools like kubectl exec, nslookup, curl, and tcpdump are essential.

Detailed Answer

A Systematic Troubleshooting Framework

Network issues in Kubernetes can originate from many layers. A structured approach prevents wasted time:

  1. DNS resolution - Can the Pod resolve names?
  2. Pod-to-Pod - Can Pods reach each other by IP?
  3. Pod-to-Service - Does the Service route to healthy endpoints?
  4. Ingress/Egress - Can traffic enter and leave the cluster?
  5. NetworkPolicy - Are policies blocking expected traffic?

Step 1: Verify DNS Resolution

DNS failures are the most common networking issue. Start here.

# Deploy a debug Pod with networking tools
kubectl run netdebug --image=nicolaka/netshoot --rm -it --restart=Never -- bash

# Inside the debug Pod:
# Test cluster DNS
nslookup kubernetes.default.svc.cluster.local

# Test a specific Service
nslookup my-service.my-namespace.svc.cluster.local

# Test external DNS
nslookup google.com

# Check the resolv.conf
cat /etc/resolv.conf

If DNS fails, check CoreDNS:

# Are CoreDNS Pods running?
kubectl get pods -n kube-system -l k8s-app=kube-dns

# Check CoreDNS logs for errors
kubectl logs -n kube-system -l k8s-app=kube-dns --tail=50

# Verify CoreDNS Service exists
kubectl get svc kube-dns -n kube-system

# Check the Corefile for misconfigurations
kubectl get configmap coredns -n kube-system -o yaml

Step 2: Test Pod-to-Pod Connectivity

# Get Pod IPs
kubectl get pods -o wide

# From the debug Pod, ping another Pod by IP
ping -c 3 10.244.1.15

# Test TCP connectivity
nc -zv 10.244.1.15 8080

# Traceroute to see the path
traceroute 10.244.1.15

If Pod-to-Pod fails:

# Check CNI plugin health
kubectl get pods -n kube-system | grep -E 'calico|cilium|flannel|weave'

# Check CNI plugin logs
kubectl logs -n kube-system -l k8s-app=calico-node --tail=50

# On the node, verify the CNI configuration exists
ls /etc/cni/net.d/
cat /etc/cni/net.d/10-calico.conflist

# Check node routing tables
ip route show

Step 3: Verify Service Routing

# Check the Service exists and has a ClusterIP
kubectl get svc my-service -o wide

# Verify endpoints are populated
kubectl get endpoints my-service
kubectl get endpointslices -l kubernetes.io/service-name=my-service

# Test the Service from inside the cluster
kubectl exec netdebug -- curl -s http://my-service.default:80

# If no endpoints, check that Pod labels match the Service selector
kubectl get pods -l app=my-app --show-labels
kubectl get svc my-service -o jsonpath='{.spec.selector}'

A Service with zero endpoints typically means no Pods match the selector, or the matching Pods are not in Ready state.

Step 4: Check NetworkPolicies

# List all NetworkPolicies in the namespace
kubectl get networkpolicies -n my-namespace

# Describe a specific policy
kubectl describe networkpolicy my-policy -n my-namespace

# Check if the CNI plugin supports NetworkPolicy
# (Flannel alone does not; Calico and Cilium do)

A common mistake is applying an ingress NetworkPolicy without realizing it blocks all traffic not explicitly allowed:

# This policy blocks ALL ingress except from app=frontend
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: restrict-backend
spec:
  podSelector:
    matchLabels:
      app: backend
  policyTypes:
    - Ingress
  ingress:
    - from:
        - podSelector:
            matchLabels:
              app: frontend

Step 5: Packet Capture with tcpdump

For deep debugging, capture packets inside a Pod's network namespace:

# Option 1: Use kubectl debug (Kubernetes 1.25+)
kubectl debug -it my-pod --image=nicolaka/netshoot --target=my-container -- tcpdump -i eth0 -nn port 80

# Option 2: Use nsenter on the node
# Find the Pod's PID
CONTAINER_ID=$(crictl ps --name my-container -q)
PID=$(crictl inspect $CONTAINER_ID | jq .info.pid)

# Enter the network namespace and run tcpdump
nsenter -t $PID -n tcpdump -i eth0 -nn -c 20 port 80

Step 6: Check kube-proxy

# Verify kube-proxy is running
kubectl get pods -n kube-system -l k8s-app=kube-proxy

# Check kube-proxy logs
kubectl logs -n kube-system -l k8s-app=kube-proxy --tail=50

# Verify kube-proxy mode
kubectl logs -n kube-system -l k8s-app=kube-proxy | grep "Using"

# Check iptables rules (if in iptables mode)
iptables -t nat -L KUBE-SERVICES -n | grep <service-cluster-ip>

# Check IPVS rules (if in IPVS mode)
ipvsadm -Ln | grep <service-cluster-ip>

Common Issues and Fixes

| Symptom | Likely Cause | Fix | |---|---|---| | Pods stuck in ContainerCreating | CNI plugin not installed or crashed | Install/restart CNI DaemonSet | | DNS resolution fails | CoreDNS not running or misconfigured | Check CoreDNS Pods and Corefile | | Service has no endpoints | Pod labels do not match selector | Fix labels or selector | | Intermittent timeouts | conntrack table full | Increase nf_conntrack_max | | Cross-node Pods unreachable | Firewall blocking overlay traffic | Open VXLAN/BGP ports | | External traffic blocked | Missing SNAT/masquerade rules | Check kube-proxy and CNI config |

Essential Debugging Images

Keep these images handy for troubleshooting:

# netshoot - comprehensive networking tools
kubectl run debug --image=nicolaka/netshoot --rm -it --restart=Never -- bash

# Tools available: curl, wget, ping, traceroute, dig, nslookup,
# tcpdump, iperf, netstat, ss, ip, nmap, iftop

A well-prepared SRE maintains runbooks for each of these failure scenarios and knows which layer to investigate based on the symptoms observed.

Why Interviewers Ask This

Interviewers want to see a structured troubleshooting methodology and hands-on familiarity with the tools needed to diagnose real-world networking failures.

Common Follow-Up Questions

A Pod can reach other Pods but not external services. What do you check?
Check CoreDNS forward configuration, node-level DNS resolution, NAT/masquerade rules, and whether the node has internet access.
How do you capture network traffic inside a Pod?
Use kubectl debug or an ephemeral container with tcpdump, or use nsenter on the node to enter the Pod's network namespace.
What causes intermittent DNS failures in Kubernetes?
Common causes include conntrack table exhaustion, CoreDNS resource limits being too low, and race conditions with ndots search domain expansion.

Key Takeaways

  • Always start with DNS, as it is the most common failure point
  • Use ephemeral debug containers or netshoot images for troubleshooting
  • Check CNI plugin health and kube-proxy logs when connectivity fails