Kubernetes Networking Explained — From Pods to Policies

The Networking Model

Kubernetes networking rests on three fundamental rules:

  1. Every pod gets its own IP address. No NAT between pods.
  2. Every pod can reach every other pod using that pod's IP, regardless of which node it's on.
  3. The IP a pod sees for itself is the same IP other pods use to reach it. No address translation surprises.

These rules seem simple, but implementing them across multiple physical or virtual machines is the hard part. That's where CNI plugins come in.

Pod Networking: What Actually Happens

When a pod is created on a node, here's the networking setup:

  Node 1 (10.0.1.5)                   Node 2 (10.0.1.6)
  ┌──────────────────────┐             ┌──────────────────────┐
  │  ┌───────┐ ┌───────┐ │             │  ┌───────┐ ┌───────┐ │
  │  │Pod A  │ │Pod B  │ │             │  │Pod C  │ │Pod D  │ │
  │  │10.244 │ │10.244 │ │             │  │10.244 │ │10.244 │ │
  │  │.1.2   │ │.1.3   │ │             │  │.2.2   │ │.2.3   │ │
  │  └───┬───┘ └───┬───┘ │             │  └───┬───┘ └───┬───┘ │
  │      │         │     │             │      │         │     │
  │  ┌───┴─────────┴───┐ │             │  ┌───┴─────────┴───┐ │
  │  │   cbr0 bridge   │ │             │  │   cbr0 bridge   │ │
  │  │   10.244.1.1    │ │             │  │   10.244.2.1    │ │
  │  └────────┬────────┘ │             │  └────────┬────────┘ │
  │           │          │             │           │          │
  │       eth0│          │             │       eth0│          │
  └───────────┼──────────┘             └───────────┼──────────┘
              │                                    │
              └──────── Physical Network ──────────┘

Each node gets a subnet from the cluster's pod CIDR (e.g., 10.244.0.0/16). Node 1 might get 10.244.1.0/24, Node 2 gets 10.244.2.0/24. Each pod on a node gets an IP from that node's subnet.

Within a single node, pods communicate through a virtual bridge (like Linux bridge or veth pairs). Across nodes, the CNI plugin handles routing — whether through overlay networks (VXLAN, Geneve), direct routing (BGP), or cloud-native mechanisms (VPC routes).

Inside a Pod's Network Namespace

All containers in a pod share the same network namespace. This means they share:

  • The same IP address
  • The same port space (two containers in the same pod can't both bind port 8080)
  • The same loopback interface (containers can talk to each other via localhost)

This is implemented using a "pause container" (also called the infrastructure container) that holds the network namespace alive. Application containers join this namespace.

# See the pause container alongside your app container
crictl ps | grep my-pod

# On a node, inspect the network namespace
crictl inspect <container-id> | grep -i pid
nsenter -t <pid> -n ip addr

CNI Plugins: How Pod Networking Is Implemented

The Container Network Interface (CNI) is a specification that defines how networking is set up for containers. When the kubelet creates a pod, it calls the CNI plugin to:

  1. Create a network interface for the pod
  2. Assign an IP address
  3. Set up routes so the pod can reach other pods and the outside world

Major CNI Plugins

Calico: The most widely deployed CNI plugin. Uses BGP for routing by default (no overlay overhead), supports network policies natively, and can integrate with Istio for service mesh policy. Calico is the default choice when you need network policies.

# Check Calico status
kubectl get pods -n calico-system
calicoctl node status
calicoctl get ippool -o wide

Cilium: Uses eBPF instead of iptables for packet processing. This gives it significantly better performance at scale and advanced features like transparent encryption, L7 policy enforcement, and deep observability (Hubble). Increasingly popular for clusters that need performance or advanced security.

# Check Cilium status
cilium status
cilium connectivity test
hubble observe --pod my-namespace/my-pod

Flannel: The simplest option. Uses VXLAN overlay by default. No built-in network policy support (you'd pair it with Calico for that). Good for learning environments and simple clusters.

AWS VPC CNI: Used on EKS. Assigns pods real VPC IP addresses from the node's subnet. This means pods are directly routable within the VPC — no overlay. The tradeoff is IP address consumption: each node can only support a limited number of pods based on the number of ENIs and IPs per ENI for that instance type.

Overlay vs Direct Routing

Overlay networks (VXLAN, Geneve) encapsulate pod traffic inside UDP packets between nodes. The physical network doesn't need to know about pod IPs — it just sees normal node-to-node traffic. This works everywhere but adds encapsulation overhead (~50 bytes per packet).

Direct routing (BGP with Calico, native VPC routing on cloud) injects pod CIDR routes into the network. No encapsulation overhead, but requires the network infrastructure to cooperate. On-prem, this means your routers need to accept BGP peering from cluster nodes. In cloud, the CNI plugin manages VPC route tables.

Service Networking

Pods are ephemeral — they come and go, and their IPs change. Services provide a stable abstraction on top.

ClusterIP: Internal Services

The default Service type. Kubernetes assigns a virtual IP (the ClusterIP) from the service CIDR (e.g., 10.96.0.0/12). This IP exists only in iptables/IPVS rules — there's no network interface backing it.

apiVersion: v1
kind: Service
metadata:
  name: backend
spec:
  selector:
    app: backend
  ports:
  - port: 80           # Service port — what clients connect to
    targetPort: 8080    # Container port — where the app listens

When a pod sends traffic to 10.96.45.23:80 (the ClusterIP), kube-proxy's iptables rules intercept it and DNAT to one of the backing pod IPs (e.g., 10.244.1.5:8080). The client pod never knows it was redirected.

# See the endpoints backing a service
kubectl get endpoints backend

# Trace the iptables rules (run on a node)
iptables -t nat -L KUBE-SERVICES -n | grep backend
iptables -t nat -L KUBE-SEP-XXXXX -n  # individual endpoint rules

NodePort: Exposing on Every Node

NodePort allocates a port (30000-32767 by default) on every node in the cluster. Traffic hitting <any-node-ip>:<nodeport> gets forwarded to the Service's ClusterIP, then to a backing pod.

apiVersion: v1
kind: Service
metadata:
  name: web
spec:
  type: NodePort
  selector:
    app: web
  ports:
  - port: 80
    targetPort: 8080
    nodePort: 30080     # optional — auto-assigned if omitted

The traffic flow: Client → Node:30080 → iptables DNAT → Pod:8080

Important detail: by default, the target pod might be on a different node than the one that received the request. This causes an extra network hop. To avoid this:

spec:
  externalTrafficPolicy: Local  # Only route to pods on the receiving node

This preserves the client's source IP but means traffic can only be routed to nodes that have a backing pod running.

LoadBalancer: Cloud Integration

On cloud providers, LoadBalancer type creates an external load balancer (AWS ELB/NLB, GCP GLB, Azure LB) that routes to NodePorts behind the scenes.

apiVersion: v1
kind: Service
metadata:
  name: web-public
  annotations:
    service.beta.kubernetes.io/aws-load-balancer-type: "nlb"  # AWS-specific
spec:
  type: LoadBalancer
  selector:
    app: web
  ports:
  - port: 443
    targetPort: 8443
# Check the external IP once provisioned
kubectl get svc web-public
# NAME        TYPE           CLUSTER-IP     EXTERNAL-IP      PORT(S)
# web-public  LoadBalancer   10.96.33.12    a1b2c3.elb...    443:31234/TCP

Headless Services: Direct Pod Access

Sometimes you need to reach specific pods (e.g., database replicas in a StatefulSet). A headless service (ClusterIP: None) skips the virtual IP and returns pod IPs directly via DNS.

apiVersion: v1
kind: Service
metadata:
  name: postgres
spec:
  clusterIP: None
  selector:
    app: postgres
  ports:
  - port: 5432

DNS query for postgres.default.svc.cluster.local returns individual pod IPs instead of a single ClusterIP. With a StatefulSet, you also get per-pod DNS records: postgres-0.postgres.default.svc.cluster.local.

kube-proxy Modes In Detail

iptables Mode

Creates a chain of iptables rules for each Service. For a Service with 3 endpoints:

KUBE-SERVICES chain:
  → match 10.96.45.23 → jump to KUBE-SVC-XXXXX

KUBE-SVC-XXXXX chain:
  → 33% probability → jump to KUBE-SEP-AAA (pod 1)
  → 50% probability → jump to KUBE-SEP-BBB (pod 2)
  → 100% probability → jump to KUBE-SEP-CCC (pod 3)

KUBE-SEP-AAA:
  → DNAT to 10.244.1.5:8080

The probability math ensures equal distribution: 1/3, then 1/2 of the remaining 2/3, then all of the remaining 1/3.

Limitation: With thousands of services, the iptables rule count explodes, and rule evaluation becomes O(n) per packet. This is the main reason large clusters switch to IPVS.

IPVS Mode

Uses Linux kernel IPVS, which is a transport-layer load balancer built into the kernel. Instead of sequential iptables rules, IPVS uses hash tables for O(1) lookups.

# Enable IPVS mode in kube-proxy
# In the kube-proxy ConfigMap:
# mode: "ipvs"
# ipvs:
#   scheduler: "rr"  # round-robin, lc (least connections), sh (source hash)

# View IPVS rules
ipvsadm -Ln

IPVS mode supports multiple load balancing algorithms that iptables mode can't offer. Choose based on your needs:

  • rr (round-robin): Simple and predictable
  • lc (least connections): Better for long-lived connections
  • sh (source hash): Session affinity by client IP

DNS: CoreDNS

CoreDNS runs as a Deployment in the kube-system namespace and provides DNS resolution for the entire cluster. Every pod is configured to use CoreDNS as its DNS server (via /etc/resolv.conf).

DNS Record Formats

# Services
<service>.<namespace>.svc.cluster.local

# Examples:
backend.default.svc.cluster.local          → ClusterIP
backend.production.svc.cluster.local       → ClusterIP

# Pods (by IP with dashes)
10-244-1-5.default.pod.cluster.local       → Pod IP

# StatefulSet pods (via headless service)
postgres-0.postgres.default.svc.cluster.local  → Pod IP of postgres-0
postgres-1.postgres.default.svc.cluster.local  → Pod IP of postgres-1

How DNS Resolution Works in a Pod

# Inside a pod, check the DNS configuration
cat /etc/resolv.conf
# nameserver 10.96.0.10        ← CoreDNS ClusterIP
# search default.svc.cluster.local svc.cluster.local cluster.local
# options ndots:5

# The search domains mean you can use short names:
curl backend          # resolves via search domains
curl backend.default  # resolves via search domains
curl backend.default.svc.cluster.local  # fully qualified

The ndots:5 option is important and often causes confusion. It means that any name with fewer than 5 dots will have search domains appended before trying the name as-is. A lookup for api.example.com (2 dots, less than 5) will first try api.example.com.default.svc.cluster.local, then api.example.com.svc.cluster.local, then api.example.com.cluster.local, then finally api.example.com.

For external DNS-heavy workloads, this generates 4x the DNS queries. You can optimize this:

spec:
  dnsConfig:
    options:
    - name: ndots
      value: "2"

Or always use fully qualified domain names with a trailing dot: api.example.com.

CoreDNS Configuration

CoreDNS uses a Corefile stored in a ConfigMap:

kubectl get configmap coredns -n kube-system -o yaml
.:53 {
    errors
    health {
        lameduck 5s
    }
    ready
    kubernetes cluster.local in-addr.arpa ip6.arpa {
        pods insecure
        fallthrough in-addr.arpa ip6.arpa
        ttl 30
    }
    prometheus :9153
    forward . /etc/resolv.conf {
        max_concurrent 1000
    }
    cache 30
    loop
    reload
    loadbalance
}

The kubernetes plugin handles all cluster DNS. The forward plugin sends external queries to the node's upstream DNS servers.

Ingress: HTTP/HTTPS Routing

While Services handle L4 (TCP/UDP) load balancing, Ingress provides L7 (HTTP/HTTPS) routing. An Ingress resource defines rules; an Ingress controller implements them.

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: app-ingress
  annotations:
    nginx.ingress.kubernetes.io/ssl-redirect: "true"
    nginx.ingress.kubernetes.io/rate-limit: "10"
spec:
  ingressClassName: nginx
  tls:
  - hosts:
    - app.example.com
    secretName: app-tls-cert
  rules:
  - host: app.example.com
    http:
      paths:
      - path: /api
        pathType: Prefix
        backend:
          service:
            name: api-service
            port:
              number: 80
      - path: /
        pathType: Prefix
        backend:
          service:
            name: frontend-service
            port:
              number: 80

This single definition routes app.example.com/api/* to one service and app.example.com/* to another, with TLS termination.

Common Ingress Controllers

  • NGINX Ingress Controller: The most widely used. Configures NGINX instances from Ingress resources. Supports annotations for rate limiting, CORS, authentication, and more.
  • Traefik: Auto-discovery, built-in Let's Encrypt, middleware chains.
  • HAProxy Ingress: High performance, TCP passthrough support.
  • Cloud-native: AWS ALB Ingress Controller, GCE Ingress Controller — create cloud load balancers directly.
# Check Ingress resources
kubectl get ingress
kubectl describe ingress app-ingress

# View the ingress controller's logs for debugging
kubectl logs -n ingress-nginx -l app.kubernetes.io/name=ingress-nginx

Gateway API: The Future of Ingress

Gateway API is the successor to Ingress, designed to address its limitations. It introduces multiple resource types for separation of concerns:

# Infrastructure team creates the Gateway
apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
  name: main-gateway
  namespace: infra
spec:
  gatewayClassName: cilium
  listeners:
  - name: https
    protocol: HTTPS
    port: 443
    tls:
      mode: Terminate
      certificateRefs:
      - name: wildcard-cert
    allowedRoutes:
      namespaces:
        from: All
---
# Application team creates HTTPRoutes in their namespace
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
  name: api-route
  namespace: app-team
spec:
  parentRefs:
  - name: main-gateway
    namespace: infra
  hostnames:
  - "api.example.com"
  rules:
  - matches:
    - path:
        type: PathPrefix
        value: /v2
    backendRefs:
    - name: api-v2
      port: 80
      weight: 90
    - name: api-v3
      port: 80
      weight: 10    # canary: send 10% to v3

Gateway API advantages over Ingress:

  • Role-oriented design: Infrastructure teams manage Gateways, app teams manage Routes
  • Portable: Standard resource types work across controllers
  • Expressive: Built-in support for traffic splitting, header matching, redirects
  • Multi-protocol: Supports HTTP, HTTPS, TCP, TLS, gRPC natively

Network Policies: Controlling Traffic Flow

By default, all pods can talk to all other pods. Network Policies let you restrict this.

Default Deny Everything

Start with a deny-all policy, then explicitly allow what's needed:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: default-deny-all
  namespace: production
spec:
  podSelector: {}    # matches all pods in the namespace
  policyTypes:
  - Ingress
  - Egress

With this in place, pods in production can't send or receive any traffic (including DNS). Add back what you need:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-web-to-api
  namespace: production
spec:
  podSelector:
    matchLabels:
      app: api
  policyTypes:
  - Ingress
  ingress:
  - from:
    - podSelector:
        matchLabels:
          app: web
    ports:
    - protocol: TCP
      port: 8080
---
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-dns-egress
  namespace: production
spec:
  podSelector: {}
  policyTypes:
  - Egress
  egress:
  - to: []
    ports:
    - protocol: UDP
      port: 53
    - protocol: TCP
      port: 53

Cross-Namespace Policies

Allow traffic from a specific namespace using namespaceSelector:

spec:
  ingress:
  - from:
    - namespaceSelector:
        matchLabels:
          purpose: monitoring
      podSelector:
        matchLabels:
          app: prometheus

This allows only pods labeled app: prometheus in namespaces labeled purpose: monitoring to reach the selected pods. Note the single dash — both selectors must match (AND logic). Two dashes would mean OR logic:

  # OR logic — either condition matches
  ingress:
  - from:
    - namespaceSelector:
        matchLabels:
          purpose: monitoring
    - podSelector:
        matchLabels:
          app: prometheus

This difference — single element with both selectors vs two elements — is one of the most common mistakes in Network Policies.

Troubleshooting Kubernetes Networking

Problem: Pod Can't Reach a Service

# 1. Verify the Service has endpoints
kubectl get endpoints my-service
# If ENDPOINTS is <none>, check the selector matches pod labels

# 2. Verify from inside a pod
kubectl exec -it debug-pod -- nslookup my-service
kubectl exec -it debug-pod -- curl -v my-service:80

# 3. Check if it's a DNS issue
kubectl exec -it debug-pod -- nslookup my-service.default.svc.cluster.local
kubectl exec -it debug-pod -- cat /etc/resolv.conf

# 4. Check CoreDNS is running
kubectl get pods -n kube-system -l k8s-app=kube-dns

# 5. Bypass DNS to isolate the issue
kubectl get svc my-service -o jsonpath='{.spec.clusterIP}'
kubectl exec -it debug-pod -- curl -v <cluster-ip>:80

Problem: Pod Can't Reach External Services

# 1. Check if DNS resolution works for external names
kubectl exec -it debug-pod -- nslookup google.com

# 2. Check if Network Policies are blocking egress
kubectl get networkpolicy -n <namespace>

# 3. Check if the node can reach the internet
# (this rules out underlying network issues)

# 4. Check if the pod has the right egress rules
kubectl describe networkpolicy -n <namespace>

Problem: Intermittent Connection Failures

# 1. Check if pods are cycling (endpoints changing)
kubectl get pods -w

# 2. Check readiness probe failures
kubectl describe pod <pod-name> | grep -A5 Readiness

# 3. Check for conntrack table overflow (on node)
cat /proc/sys/net/netfilter/nf_conntrack_count
cat /proc/sys/net/netfilter/nf_conntrack_max

# 4. Check kube-proxy logs
kubectl logs -n kube-system -l k8s-app=kube-proxy --tail=50

The Debug Container Approach

Keep a lightweight debug image available:

# Run an ephemeral debug container in an existing pod
kubectl debug -it my-pod --image=nicolaka/netshoot --target=my-container

# Or run a standalone debug pod
kubectl run netshoot --rm -it --image=nicolaka/netshoot -- /bin/bash

# Inside netshoot, you have all the tools:
# tcpdump, dig, nslookup, curl, wget, ping, traceroute,
# iperf, netstat, ss, ip, iptables, nmap

DNS Debugging Specifically

# Check CoreDNS logs
kubectl logs -n kube-system -l k8s-app=kube-dns -f

# Look at CoreDNS metrics
kubectl exec -n kube-system <coredns-pod> -- wget -qO- http://localhost:9153/metrics | grep coredns_dns_requests_total

# Run a DNS lookup with verbose output
kubectl exec debug-pod -- dig +search +all my-service

# Test if the CoreDNS pod itself can reach upstream DNS
kubectl exec -n kube-system <coredns-pod> -- nslookup google.com 8.8.8.8

Key Takeaways for Interviews

  1. Know the model: Every pod gets an IP, pods can reach each other without NAT, and Services provide stable endpoints on top.
  2. Understand kube-proxy: Know the difference between iptables and IPVS modes and when you'd choose one over the other.
  3. DNS is critical: Most service discovery issues are DNS issues. Understand ndots, search domains, and how CoreDNS resolves cluster-internal names.
  4. Network Policies are additive: If no policy selects a pod, all traffic is allowed. Once any policy selects a pod, only explicitly allowed traffic gets through.
  5. Troubleshoot systematically: DNS first, then endpoints, then connectivity, then network policies.

Related Topics