What Are the Differences Between iptables and IPVS Mode in kube-proxy?

Q: What Are the Differences Between iptables and IPVS Mode in kube-proxy?

iptables mode uses sequential rule chains with O(n) lookup time, while IPVS mode uses kernel-level hash tables with O(1) lookup. IPVS supports multiple load-balancing algorithms and handles thousands of Services efficiently, making it the better choice for large production clusters.

Detailed Answer

How iptables Mode Works Internally

In iptables mode, kube-proxy creates a chain of NAT rules for each Service. The rule structure follows this pattern:

A top-level KUBE-SERVICES chain matches on the Service ClusterIP and port.
It jumps to a KUBE-SVC-* chain for that Service.
The KUBE-SVC-* chain uses the statistic module with probability-based matching to select a backend Pod.
Each backend has a KUBE-SEP-* chain that performs DNAT to the Pod IP.

# Inspect the full iptables chain for a Service
iptables -t nat -L KUBE-SERVICES -n --line-numbers
iptables -t nat -L KUBE-SVC-XXXXXXXXXXXXXXXX -n

# Example output for a Service with 3 endpoints:
# Chain KUBE-SVC-XXXXXXXXXXXXXXXX
#  1  statistic mode random probability 0.33333  -> KUBE-SEP-AAAA
#  2  statistic mode random probability 0.50000  -> KUBE-SEP-BBBB
#  3                                             -> KUBE-SEP-CCCC

The problem is that every packet must traverse these chains sequentially. With 5,000 Services averaging 3 endpoints each, that is 15,000+ rules. Rule updates require a full rewrite of the chain, which locks the iptables table and can cause latency spikes.

How IPVS Mode Works Internally

IPVS (IP Virtual Server) operates at the kernel's transport layer (Layer 4). It maintains a hash table of virtual servers (Service ClusterIPs) mapped to real servers (Pod IPs). Packet matching is O(1) regardless of the number of Services.

# View IPVS virtual servers
ipvsadm -Ln

# Example output:
# TCP  10.96.45.12:80 rr
#   -> 10.244.1.5:8080     Masq    1      0      0
#   -> 10.244.2.8:8080     Masq    1      0      0
#   -> 10.244.3.12:8080    Masq    1      0      0

IPVS creates a dummy interface called kube-ipvs0 and binds all Service ClusterIPs to it. This ensures the kernel recognizes the IPs as local, allowing IPVS to intercept the traffic.

# View the kube-ipvs0 interface
ip addr show kube-ipvs0

# Example output shows all ClusterIPs bound to this interface
# inet 10.96.0.1/32 scope global kube-ipvs0
# inet 10.96.0.10/32 scope global kube-ipvs0
# inet 10.96.45.12/32 scope global kube-ipvs0

Performance Comparison

| Metric | iptables | IPVS | |---|---|---| | Rule lookup | O(n) sequential | O(1) hash table | | Rule update | Full chain rewrite | Incremental update | | 1,000 Services sync | ~200ms | ~20ms | | 10,000 Services sync | ~5s | ~100ms | | Connection tracking | conntrack | conntrack | | CPU under load | Higher at scale | Consistent |

Benchmarks consistently show that iptables rule sync time grows linearly with Service count, while IPVS remains nearly constant.

Load-Balancing Algorithms

iptables mode only supports random selection with equal probability. IPVS supports several algorithms:

apiVersion: kubeproxy.config.k8s.io/v1alpha1
kind: KubeProxyConfiguration
mode: "ipvs"
ipvs:
  scheduler: "lc"  # least connections

Available schedulers:

rr (round-robin): Distributes evenly in rotation. Default.
lc (least connections): Sends to the backend with fewest active connections. Best for long-lived connections.
dh (destination hashing): Hashes the destination IP. Useful for caching proxies.
sh (source hashing): Hashes the source IP. Provides session affinity.
sed (shortest expected delay): Accounts for backend weight and active connections.
nq (never queue): Avoids backends that already have connections, then falls back to SED.

Switching from iptables to IPVS

# 1. Load required kernel modules on ALL nodes
modprobe ip_vs
modprobe ip_vs_rr
modprobe ip_vs_wrr
modprobe ip_vs_sh
modprobe nf_conntrack

# Make persistent across reboots
cat >> /etc/modules-load.d/ipvs.conf <<EOF
ip_vs
ip_vs_rr
ip_vs_wrr
ip_vs_sh
nf_conntrack
EOF

# 2. Install ipvsadm for debugging (optional)
apt-get install -y ipvsadm

# 3. Update kube-proxy ConfigMap
kubectl edit configmap kube-proxy -n kube-system
# Change mode: "" to mode: "ipvs"

# 4. Restart kube-proxy
kubectl rollout restart daemonset kube-proxy -n kube-system

# 5. Verify IPVS mode is active
kubectl logs -n kube-system -l k8s-app=kube-proxy | grep "Using ipvs"

# 6. Clean up stale iptables rules
# kube-proxy --cleanup handles this automatically on restart

Connection Tracking and Edge Cases

Both modes rely on the Linux conntrack (connection tracking) subsystem. In clusters with very high connection rates, the conntrack table can fill up, causing packet drops. Monitor and tune as needed:

# Check conntrack table usage
cat /proc/sys/net/netfilter/nf_conntrack_count
cat /proc/sys/net/netfilter/nf_conntrack_max

# Increase if needed
sysctl -w net.netfilter.nf_conntrack_max=262144

When to Use Which

Stick with iptables when the cluster has fewer than 1,000 Services and you want the simplest, most battle-tested configuration.

Switch to IPVS when you have thousands of Services, need specific load-balancing algorithms (like least-connections), or observe high kube-proxy CPU usage and sync times.

Consider eBPF (via Cilium) when you want to eliminate both iptables and IPVS overhead entirely and gain advanced observability at the same time.

Monitoring kube-proxy Performance

# Key metrics to watch (exposed on :10249/metrics)
# kubeproxy_sync_proxy_rules_duration_seconds - time to sync all rules
# kubeproxy_sync_proxy_rules_iptables_total - number of iptables rules
# kubeproxy_sync_proxy_rules_last_timestamp_seconds - last sync time

curl http://localhost:10249/metrics | grep kubeproxy_sync

If sync_proxy_rules_duration_seconds consistently exceeds 1 second, it is time to switch to IPVS or eBPF.