How Do You Troubleshoot Network Issues in Kubernetes?

Q: How Do You Troubleshoot Network Issues in Kubernetes?

Kubernetes network troubleshooting follows a systematic approach: verify DNS resolution, test Pod-to-Pod connectivity, check Service endpoints, inspect NetworkPolicies, and examine CNI plugin and kube-proxy health. Tools like kubectl exec, nslookup, curl, and tcpdump are essential.

Detailed Answer

A Systematic Troubleshooting Framework

Network issues in Kubernetes can originate from many layers. A structured approach prevents wasted time:

DNS resolution - Can the Pod resolve names?
Pod-to-Pod - Can Pods reach each other by IP?
Pod-to-Service - Does the Service route to healthy endpoints?
Ingress/Egress - Can traffic enter and leave the cluster?
NetworkPolicy - Are policies blocking expected traffic?

Step 1: Verify DNS Resolution

DNS failures are the most common networking issue. Start here.

# Deploy a debug Pod with networking tools
kubectl run netdebug --image=nicolaka/netshoot --rm -it --restart=Never -- bash

# Inside the debug Pod:
# Test cluster DNS
nslookup kubernetes.default.svc.cluster.local

# Test a specific Service
nslookup my-service.my-namespace.svc.cluster.local

# Test external DNS
nslookup google.com

# Check the resolv.conf
cat /etc/resolv.conf

If DNS fails, check CoreDNS:

# Are CoreDNS Pods running?
kubectl get pods -n kube-system -l k8s-app=kube-dns

# Check CoreDNS logs for errors
kubectl logs -n kube-system -l k8s-app=kube-dns --tail=50

# Verify CoreDNS Service exists
kubectl get svc kube-dns -n kube-system

# Check the Corefile for misconfigurations
kubectl get configmap coredns -n kube-system -o yaml

Step 2: Test Pod-to-Pod Connectivity

# Get Pod IPs
kubectl get pods -o wide

# From the debug Pod, ping another Pod by IP
ping -c 3 10.244.1.15

# Test TCP connectivity
nc -zv 10.244.1.15 8080

# Traceroute to see the path
traceroute 10.244.1.15

If Pod-to-Pod fails:

# Check CNI plugin health
kubectl get pods -n kube-system | grep -E 'calico|cilium|flannel|weave'

# Check CNI plugin logs
kubectl logs -n kube-system -l k8s-app=calico-node --tail=50

# On the node, verify the CNI configuration exists
ls /etc/cni/net.d/
cat /etc/cni/net.d/10-calico.conflist

# Check node routing tables
ip route show

Step 3: Verify Service Routing

# Check the Service exists and has a ClusterIP
kubectl get svc my-service -o wide

# Verify endpoints are populated
kubectl get endpoints my-service
kubectl get endpointslices -l kubernetes.io/service-name=my-service

# Test the Service from inside the cluster
kubectl exec netdebug -- curl -s http://my-service.default:80

# If no endpoints, check that Pod labels match the Service selector
kubectl get pods -l app=my-app --show-labels
kubectl get svc my-service -o jsonpath='{.spec.selector}'

A Service with zero endpoints typically means no Pods match the selector, or the matching Pods are not in Ready state.

Step 4: Check NetworkPolicies

# List all NetworkPolicies in the namespace
kubectl get networkpolicies -n my-namespace

# Describe a specific policy
kubectl describe networkpolicy my-policy -n my-namespace

# Check if the CNI plugin supports NetworkPolicy
# (Flannel alone does not; Calico and Cilium do)

A common mistake is applying an ingress NetworkPolicy without realizing it blocks all traffic not explicitly allowed:

# This policy blocks ALL ingress except from app=frontend
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: restrict-backend
spec:
  podSelector:
    matchLabels:
      app: backend
  policyTypes:
    - Ingress
  ingress:
    - from:
        - podSelector:
            matchLabels:
              app: frontend

Step 5: Packet Capture with tcpdump

For deep debugging, capture packets inside a Pod's network namespace:

# Option 1: Use kubectl debug (Kubernetes 1.25+)
kubectl debug -it my-pod --image=nicolaka/netshoot --target=my-container -- tcpdump -i eth0 -nn port 80

# Option 2: Use nsenter on the node
# Find the Pod's PID
CONTAINER_ID=$(crictl ps --name my-container -q)
PID=$(crictl inspect $CONTAINER_ID | jq .info.pid)

# Enter the network namespace and run tcpdump
nsenter -t $PID -n tcpdump -i eth0 -nn -c 20 port 80

Step 6: Check kube-proxy

# Verify kube-proxy is running
kubectl get pods -n kube-system -l k8s-app=kube-proxy

# Check kube-proxy logs
kubectl logs -n kube-system -l k8s-app=kube-proxy --tail=50

# Verify kube-proxy mode
kubectl logs -n kube-system -l k8s-app=kube-proxy | grep "Using"

# Check iptables rules (if in iptables mode)
iptables -t nat -L KUBE-SERVICES -n | grep <service-cluster-ip>

# Check IPVS rules (if in IPVS mode)
ipvsadm -Ln | grep <service-cluster-ip>

Common Issues and Fixes

| Symptom | Likely Cause | Fix | |---|---|---| | Pods stuck in ContainerCreating | CNI plugin not installed or crashed | Install/restart CNI DaemonSet | | DNS resolution fails | CoreDNS not running or misconfigured | Check CoreDNS Pods and Corefile | | Service has no endpoints | Pod labels do not match selector | Fix labels or selector | | Intermittent timeouts | conntrack table full | Increase nf_conntrack_max | | Cross-node Pods unreachable | Firewall blocking overlay traffic | Open VXLAN/BGP ports | | External traffic blocked | Missing SNAT/masquerade rules | Check kube-proxy and CNI config |

Essential Debugging Images

Keep these images handy for troubleshooting:

# netshoot - comprehensive networking tools
kubectl run debug --image=nicolaka/netshoot --rm -it --restart=Never -- bash

# Tools available: curl, wget, ping, traceroute, dig, nslookup,
# tcpdump, iperf, netstat, ss, ip, nmap, iftop

A well-prepared SRE maintains runbooks for each of these failure scenarios and knows which layer to investigate based on the symptoms observed.