How Do You Monitor and Troubleshoot a Deployment Rollout?

intermediate|deploymentsdevopssreCKACKAD
TL;DR

Use kubectl rollout status to watch a rollout in real time. Combine it with kubectl describe deployment and kubectl get events to diagnose stuck or failed rollouts caused by image pull errors, resource limits, or failing health checks.

Detailed Answer

When a deployment rollout does not go as planned, you need to quickly determine what went wrong and decide whether to fix forward or roll back. Kubernetes provides several tools for monitoring and troubleshooting rollouts.

Monitoring a Rollout in Real Time

# Watch the rollout progress
kubectl rollout status deployment/web-app

# Output during a healthy rollout:
# Waiting for deployment "web-app" rollout to finish: 1 of 3 updated replicas are available...
# Waiting for deployment "web-app" rollout to finish: 2 of 3 updated replicas are available...
# deployment "web-app" successfully rolled out

The command exits with code 0 on success or non-zero on failure. This makes it ideal for CI/CD pipelines:

kubectl apply -f deployment.yaml
kubectl rollout status deployment/web-app --timeout=300s || {
  echo "Rollout failed! Rolling back..."
  kubectl rollout undo deployment/web-app
  exit 1
}

Deployment Conditions

Kubernetes tracks three conditions on every Deployment:

kubectl get deployment web-app -o jsonpath='{.status.conditions[*]}' | jq .

| Condition | Meaning | |---|---| | Available | Minimum required Pods are ready and have been available for minReadySeconds. | | Progressing | The rollout is making progress (creating or deleting Pods). | | ReplicaFailure | The controller could not create new Pods (quota exceeded, invalid spec, etc.). |

A healthy Deployment has Available=True and Progressing=True. A stuck rollout typically shows Progressing=True with reason ReplicaSetUpdated but no new Pods becoming Ready.

Diagnosing a Stuck Rollout

Step 1 -- Check Deployment status

kubectl describe deployment web-app

Look at the Conditions and Events sections:

Conditions:
  Type           Status  Reason
  ----           ------  ------
  Available      True    MinimumReplicasAvailable
  Progressing    True    ReplicaSetUpdated

Events:
  Type    Reason             Age   Message
  ----    ------             ----  -------
  Normal  ScalingReplicaSet  2m    Scaled up replica set web-app-8d9f7e0b2 to 1

Step 2 -- Check ReplicaSet status

kubectl get replicasets -l app=web-app
NAME                  DESIRED   CURRENT   READY   AGE
web-app-7c8e6d9a1     3         3         3       1d    # old, still running
web-app-8d9f7e0b2     1         1         0       2m    # new, not ready

The new ReplicaSet has 1 Pod created but 0 Ready -- the Pod is failing.

Step 3 -- Check the failing Pod

# Find the new Pod
kubectl get pods -l app=web-app --sort-by=.metadata.creationTimestamp

# Describe the failing Pod
kubectl describe pod web-app-8d9f7e0b2-xyz99

# Check container logs
kubectl logs web-app-8d9f7e0b2-xyz99

Common Rollout Failure Causes

Image Pull Errors

Events:
  Warning  Failed     1m  kubelet  Failed to pull image "web-app:typo": ...
  Warning  Failed     1m  kubelet  Error: ImagePullBackOff

Fix: Correct the image name or tag, ensure the image exists, verify imagePullSecrets.

Insufficient Resources

Events:
  Warning  FailedScheduling  1m  default-scheduler  0/5 nodes are available:
  5 Insufficient cpu.

Fix: Reduce resource requests, add nodes, or scale down other workloads.

Failing Readiness Probe

Events:
  Warning  Unhealthy  1m  kubelet  Readiness probe failed: HTTP probe failed with statuscode: 503

Fix: Check application startup, verify the probe endpoint path and port, increase initialDelaySeconds.

CrashLoopBackOff

Events:
  Warning  BackOff  1m  kubelet  Back-off restarting failed container

Fix: Check kubectl logs for application errors. Common causes: missing environment variables, config map errors, database connection failures.

Using progressDeadlineSeconds

apiVersion: apps/v1
kind: Deployment
metadata:
  name: web-app
spec:
  progressDeadlineSeconds: 300
  replicas: 3
  selector:
    matchLabels:
      app: web-app
  template:
    metadata:
      labels:
        app: web-app
    spec:
      containers:
        - name: web-app
          image: web-app:2.0
          ports:
            - containerPort: 8080
          readinessProbe:
            httpGet:
              path: /healthz
              port: 8080
            initialDelaySeconds: 10
            periodSeconds: 5

If no progress is made for 300 seconds, the Deployment condition changes:

Conditions:
  Type           Status  Reason
  ----           ------  ------
  Progressing    False   ProgressDeadlineExceeded

Kubernetes does not automatically roll back. The failed condition is a signal for external automation or alerting.

Pausing and Resuming Rollouts

# Pause the rollout (useful for making multiple changes)
kubectl rollout pause deployment/web-app

# Make several changes without triggering multiple rollouts
kubectl set image deployment/web-app web-app=web-app:2.1
kubectl set env deployment/web-app LOG_LEVEL=debug
kubectl set resources deployment/web-app -c web-app --limits=cpu=500m,memory=512Mi

# Resume to trigger a single rollout with all changes
kubectl rollout resume deployment/web-app

CI/CD Integration Pattern

#!/bin/bash
set -euo pipefail

DEPLOYMENT="web-app"
TIMEOUT="300s"

echo "Applying deployment..."
kubectl apply -f deployment.yaml

echo "Waiting for rollout to complete..."
if ! kubectl rollout status deployment/${DEPLOYMENT} --timeout=${TIMEOUT}; then
  echo "FAILED: Rollout did not complete within ${TIMEOUT}"

  echo "Deployment status:"
  kubectl get deployment ${DEPLOYMENT} -o wide

  echo "Pod status:"
  kubectl get pods -l app=${DEPLOYMENT} --sort-by=.metadata.creationTimestamp

  echo "Recent events:"
  kubectl get events --sort-by=.lastTimestamp --field-selector involvedObject.kind=Deployment,involvedObject.name=${DEPLOYMENT}

  echo "Rolling back..."
  kubectl rollout undo deployment/${DEPLOYMENT}
  kubectl rollout status deployment/${DEPLOYMENT} --timeout=${TIMEOUT}

  exit 1
fi

echo "Rollout complete."

Summary

Monitoring and troubleshooting rollouts requires a systematic approach: check the Deployment conditions, examine the new ReplicaSet, then inspect the failing Pods. The kubectl rollout status command is your primary monitoring tool, progressDeadlineSeconds automates failure detection, and kubectl rollout pause/resume lets you batch multiple changes into a single rollout. Building these checks into your CI/CD pipeline ensures failed deployments are caught and reverted automatically.

Why Interviewers Ask This

Debugging a stuck deployment is a common on-call task. Interviewers want to see that you have a systematic approach to diagnosing rollout failures rather than guessing.

Common Follow-Up Questions

What does progressDeadlineSeconds do?
It sets the maximum time a Deployment has to make progress before it is considered failed. The default is 600 seconds. When exceeded, the Deployment condition changes to Progressing=False.
What is the difference between a stalled rollout and a failed rollout?
A stalled rollout is still in progress but not making headway (e.g., new Pods keep failing readiness checks). A failed rollout has exceeded progressDeadlineSeconds and been marked as failed by the controller.
How do you pause and resume a rollout?
Use kubectl rollout pause deployment/<name> and kubectl rollout resume deployment/<name>. Pausing lets you make multiple changes before triggering a single rollout.

Key Takeaways

  • kubectl rollout status is the primary tool for monitoring rollout progress.
  • Deployment conditions (Available, Progressing, ReplicaFailure) reveal the root cause.
  • progressDeadlineSeconds automates failure detection for stuck rollouts.

Related Questions