How Do You Achieve Zero-Downtime Deployments in Kubernetes?

advanced|deploymentsdevopssreCKACKAD
TL;DR

Zero-downtime deployments require a combination of rolling updates with maxUnavailable: 0, readiness probes, graceful shutdown handling, PodDisruptionBudgets, and preStop hooks. No single setting achieves it -- you need all layers working together.

Detailed Answer

Achieving true zero-downtime deployments in Kubernetes is more nuanced than setting strategy: RollingUpdate. It requires careful configuration at multiple layers: the Deployment strategy, health probes, graceful shutdown, network propagation, and disruption budgets.

The Five Pillars of Zero Downtime

  1. Rolling update with maxUnavailable: 0
  2. Readiness probes
  3. Graceful shutdown with preStop hooks
  4. PodDisruptionBudgets
  5. Application-level connection draining

Pillar 1 -- Rolling Update Strategy

spec:
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0

With maxUnavailable: 0, Kubernetes never terminates an old Pod until a new one is fully Ready. This guarantees the total number of available Pods never drops below the desired count.

Pillar 2 -- Readiness Probes

readinessProbe:
  httpGet:
    path: /ready
    port: 8080
  initialDelaySeconds: 5
  periodSeconds: 5
  failureThreshold: 3
  successThreshold: 1

Without a readiness probe, Kubernetes considers a Pod ready the moment its containers start. Traffic will hit an application that is still initializing, causing errors.

Pillar 3 -- Graceful Shutdown

This is where most teams miss a critical detail. When Kubernetes terminates a Pod, two things happen in parallel:

  1. The Pod is removed from Service endpoints.
  2. The Pod receives SIGTERM.

The problem is that kube-proxy and ingress controllers may take a few seconds to update their routing rules. During this window, traffic can still be sent to the terminating Pod.

The solution is a preStop hook that introduces a short delay:

spec:
  terminationGracePeriodSeconds: 60
  containers:
    - name: web-app
      image: web-app:2.0
      ports:
        - containerPort: 8080
      lifecycle:
        preStop:
          exec:
            command: ["/bin/sh", "-c", "sleep 10"]

The timeline becomes:

t=0s   SIGTERM sent + Pod removed from endpoints (parallel)
t=0-10s  preStop sleep -- app still running, kube-proxy updates propagate
t=10s  preStop finishes, app receives SIGTERM, begins graceful shutdown
t=10-60s  App drains connections and exits
t=60s  SIGKILL if still running

The 10-second sleep ensures routing rules are updated before the application starts shutting down.

Pillar 4 -- PodDisruptionBudgets

Rolling updates are managed by the Deployment controller. But other operations can disrupt Pods too: node drains, cluster autoscaler, kubectl delete pod.

A PodDisruptionBudget (PDB) prevents too many Pods from being disrupted simultaneously:

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: web-app-pdb
spec:
  minAvailable: 2              # Or use maxUnavailable: 1
  selector:
    matchLabels:
      app: web-app

With minAvailable: 2 and 3 replicas, at most 1 Pod can be voluntarily disrupted at a time. This protects against:

  • kubectl drain during node maintenance
  • Cluster autoscaler removing nodes
  • Spot/preemptible instance termination

Pillar 5 -- Application-Level Connection Draining

Your application must handle SIGTERM gracefully:

# Python example with graceful shutdown
import signal
import sys
from http.server import HTTPServer

server = HTTPServer(('0.0.0.0', 8080), MyHandler)

def graceful_shutdown(signum, frame):
    print("Received SIGTERM, draining connections...")
    server.shutdown()  # Stops accepting new connections, finishes existing ones
    sys.exit(0)

signal.signal(signal.SIGTERM, graceful_shutdown)
server.serve_forever()

The key behaviors:

  • Stop accepting new connections after receiving SIGTERM.
  • Finish processing in-flight requests within the grace period.
  • Exit cleanly so Kubernetes does not need to send SIGKILL.

The Complete Zero-Downtime Deployment

apiVersion: apps/v1
kind: Deployment
metadata:
  name: web-app
spec:
  replicas: 3
  revisionHistoryLimit: 5
  progressDeadlineSeconds: 300
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0
  selector:
    matchLabels:
      app: web-app
  template:
    metadata:
      labels:
        app: web-app
    spec:
      terminationGracePeriodSeconds: 60
      containers:
        - name: web-app
          image: web-app:2.0
          ports:
            - containerPort: 8080
          resources:
            requests:
              cpu: 100m
              memory: 128Mi
            limits:
              cpu: 500m
              memory: 256Mi
          startupProbe:
            httpGet:
              path: /healthz
              port: 8080
            failureThreshold: 30
            periodSeconds: 10
          readinessProbe:
            httpGet:
              path: /ready
              port: 8080
            periodSeconds: 5
            failureThreshold: 3
          livenessProbe:
            httpGet:
              path: /healthz
              port: 8080
            periodSeconds: 10
            failureThreshold: 5
          lifecycle:
            preStop:
              exec:
                command: ["/bin/sh", "-c", "sleep 10"]
---
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: web-app-pdb
spec:
  minAvailable: 2
  selector:
    matchLabels:
      app: web-app

The Endpoint Propagation Race Condition

This is the most commonly overlooked issue. Here is the detailed sequence when a Pod is terminated:

1. API server marks Pod for deletion
2. kubelet receives watch event → sends SIGTERM + runs preStop hook
3. endpoints controller receives watch event → removes Pod from Endpoints
4. kube-proxy receives Endpoints update → updates iptables/ipvs rules
5. Ingress controller receives Endpoints update → updates upstream list

Steps 2-5 happen concurrently and asynchronously. Without the preStop sleep, the application might shut down before kube-proxy finishes updating its rules. This causes connection refused errors or 502s for a brief window.

The sleep in the preStop hook gives all components time to propagate the endpoint removal.

Testing Zero Downtime

Use a load testing tool during deployment to verify:

# Terminal 1: Generate continuous traffic
while true; do
  curl -s -o /dev/null -w "%{http_code}\n" http://web-app.default.svc.cluster.local/
done | sort | uniq -c

# Terminal 2: Trigger a deployment update
kubectl set image deployment/web-app web-app=web-app:2.1

If you see any non-200 responses during the rollout, one of the five pillars is misconfigured.

Additional Considerations

minReadySeconds: Adds a delay after a Pod is Ready before it counts as Available. Useful as an additional safety buffer:

spec:
  minReadySeconds: 10

Topology spread constraints: Ensure Pods are spread across nodes and zones so a single node failure does not take down the service:

topologySpreadConstraints:
  - maxSkew: 1
    topologyKey: topology.kubernetes.io/zone
    whenUnsatisfiable: DoNotSchedule
    labelSelector:
      matchLabels:
        app: web-app

Summary

True zero-downtime deployments require all five pillars working together: a rolling update with maxUnavailable: 0, readiness probes to gate traffic, preStop hooks to handle the endpoint propagation race condition, PodDisruptionBudgets to protect against voluntary disruptions, and application-level graceful shutdown. Missing any one of these layers can cause brief outages during otherwise well-configured deployments.

Why Interviewers Ask This

Zero-downtime deployment is a production requirement for most organizations. This question tests whether you understand the full stack of configurations needed, not just the basics.

Common Follow-Up Questions

Why do you still see errors during a rolling update even with readiness probes?
There is a race condition between the Pod being removed from endpoints and kube-proxy updating iptables rules. In-flight requests can still reach a terminating Pod. A preStop hook with a short sleep solves this.
How do PodDisruptionBudgets help with zero downtime?
PDBs prevent voluntary disruptions (node drains, cluster upgrades) from taking down too many Pods at once. They complement maxUnavailable by protecting against operations outside the Deployment controller.
What role does connection draining play?
When a Pod is terminating, it should stop accepting new connections and finish processing existing ones within the terminationGracePeriodSeconds window. This prevents dropped requests.

Key Takeaways

  • Zero downtime requires multiple overlapping configurations, not just a rolling update.
  • The preStop hook solves the race condition between endpoint removal and iptables updates.
  • PodDisruptionBudgets protect against voluntary disruptions like node drains.

Related Questions