How Do You Tune Kubernetes Network Performance?

advanced|networkingsreplatform engineerCKA
TL;DR

Kubernetes network performance tuning involves optimizing kube-proxy mode (IPVS over iptables at scale), tuning DNS (lowering ndots), configuring MTU correctly, using eBPF-based CNIs, and addressing conntrack table exhaustion.

Detailed Answer

Network performance in Kubernetes is affected by multiple layers: the CNI plugin, kube-proxy mode, DNS resolution, kernel parameters, and application-level configuration. Tuning each layer can dramatically improve throughput and latency.

1. kube-proxy Mode: iptables vs. IPVS

The default iptables mode processes rules linearly. With thousands of Services, every new connection walks through thousands of rules:

# iptables: O(n) — 10,000 Services = 10,000+ rules to walk
# IPVS:     O(1) — hash table lookup regardless of Service count

Switch to IPVS for clusters with >1000 Services:

# kube-proxy ConfigMap
apiVersion: kubeproxy.config.k8s.io/v1alpha1
kind: KubeProxyConfiguration
mode: "ipvs"
ipvs:
  scheduler: "lc"
  syncPeriod: "30s"
  minSyncPeriod: "2s"

Or use eBPF-based kube-proxy replacement with Cilium, bypassing both iptables and IPVS:

# Cilium with kube-proxy replacement
helm install cilium cilium/cilium \
  --set kubeProxyReplacement=true

2. DNS Performance

DNS is often the hidden bottleneck. The default ndots:5 setting causes excessive queries for external names.

Reduce ndots

apiVersion: v1
kind: Pod
metadata:
  name: app
spec:
  dnsConfig:
    options:
      - name: ndots
        value: "2"
  containers:
    - name: app
      image: myapp:1.0
      resources:
        requests:
          cpu: "100m"
          memory: "128Mi"

Scale CoreDNS

# CoreDNS HPA for automatic scaling
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: coredns
  namespace: kube-system
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: coredns
  minReplicas: 3
  maxReplicas: 10
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70

Use NodeLocal DNSCache

NodeLocal DNSCache runs a DNS cache on every node, reducing latency and CoreDNS load:

kubectl apply -f https://raw.githubusercontent.com/kubernetes/kubernetes/master/cluster/addons/dns/nodelocaldns/nodelocaldns.yaml

3. Conntrack Table Tuning

Every connection through a Service creates a conntrack entry. When the table is full, new connections are silently dropped.

# Check current conntrack usage
sysctl net.netfilter.nf_conntrack_count
sysctl net.netfilter.nf_conntrack_max

# Increase conntrack table size
sysctl -w net.netfilter.nf_conntrack_max=1048576
sysctl -w net.netfilter.nf_conntrack_buckets=262144

Symptoms of conntrack exhaustion:

  • Intermittent connection failures
  • DNS resolution timeouts (UDP conntrack)
  • nf_conntrack: table full, dropping packet in dmesg

4. MTU Configuration

Incorrect MTU causes packet fragmentation or drops, especially with overlay networks:

Host MTU:     1500
VXLAN overhead: 50 bytes
Pod MTU:      1450 (1500 - 50)

Host MTU:     9000 (jumbo frames)
VXLAN overhead: 50 bytes
Pod MTU:      8950

Configure MTU in your CNI:

# Calico MTU configuration
apiVersion: projectcalico.org/v3
kind: FelixConfiguration
metadata:
  name: default
spec:
  mtu: 1450
# Cilium MTU configuration
# helm install cilium --set mtu=1450

5. Kernel Parameter Tuning

Key sysctl parameters for high-throughput clusters:

# Increase socket buffer sizes
sysctl -w net.core.rmem_max=16777216
sysctl -w net.core.wmem_max=16777216

# Increase connection backlog
sysctl -w net.core.somaxconn=32768
sysctl -w net.core.netdev_max_backlog=16384

# TCP tuning
sysctl -w net.ipv4.tcp_max_syn_backlog=8096
sysctl -w net.ipv4.tcp_slow_start_after_idle=0
sysctl -w net.ipv4.tcp_tw_reuse=1

# Increase local port range
sysctl -w net.ipv4.ip_local_port_range="1024 65535"

Apply via a DaemonSet for consistency:

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: sysctl-tuner
  namespace: kube-system
spec:
  selector:
    matchLabels:
      app: sysctl-tuner
  template:
    metadata:
      labels:
        app: sysctl-tuner
    spec:
      hostPID: true
      hostNetwork: true
      initContainers:
        - name: sysctl
          image: busybox:1.36
          command: ["sh", "-c"]
          args:
            - |
              sysctl -w net.core.somaxconn=32768
              sysctl -w net.netfilter.nf_conntrack_max=1048576
          securityContext:
            privileged: true
          resources:
            requests:
              cpu: "10m"
              memory: "16Mi"
      containers:
        - name: pause
          image: registry.k8s.io/pause:3.9
          resources:
            requests:
              cpu: "5m"
              memory: "8Mi"

6. CNI Performance Considerations

| CNI Mode | Throughput | Latency | Best For | |----------|-----------|---------|----------| | VXLAN overlay | Lower | Higher | Multi-subnet clusters | | Direct routing (BGP) | Higher | Lower | Same-subnet or BGP-capable networks | | eBPF | Highest | Lowest | Modern kernels (5.10+) | | Host networking | Native | Native | Latency-critical workloads (bypass CNI) |

For latency-critical workloads, consider hostNetwork: true to bypass the CNI entirely, at the cost of port conflicts and reduced isolation.

Monitoring Network Performance

# Check for dropped packets
kubectl exec <pod> -- netstat -s | grep -i drop

# Monitor conntrack
watch -n 1 'sysctl net.netfilter.nf_conntrack_count'

# Test latency between Pods
kubectl exec pod-a -- ping pod-b-ip

# Benchmark throughput
kubectl exec pod-a -- iperf3 -c pod-b-ip -t 30

Why Interviewers Ask This

Network performance problems in Kubernetes are subtle and often misdiagnosed. This question tests your ability to identify and resolve performance bottlenecks at the infrastructure level.

Common Follow-Up Questions

How does conntrack table exhaustion manifest?
You see random connection failures, DNS timeouts, and packet drops. Check with conntrack -C and compare against the nf_conntrack_max sysctl value.
What is the impact of MTU misconfiguration?
If the overlay MTU is too high, packets are silently fragmented or dropped, causing mysterious connection hangs. The overlay MTU should be the host MTU minus the encapsulation overhead.
When should you switch from iptables to IPVS kube-proxy mode?
When you have more than 1000 Services. iptables rule processing is O(n) while IPVS uses hash tables for O(1) lookup.

Key Takeaways

  • Switch to IPVS mode when you exceed 1000 Services to avoid iptables performance degradation.
  • Lower ndots from 5 to 2-3 to reduce DNS query amplification for external names.
  • Size conntrack tables based on your connection volume to prevent dropped connections.

Related Questions

You Might Also Like