How Do You Troubleshoot Kubernetes Deployment Issues?

intermediate|deploymentsdevopssrebackend developerCKACKAD

TL;DR

Troubleshooting Deployment issues involves checking rollout status, inspecting Pod events, reviewing container logs, and verifying resource availability. Common problems include image pull errors, crashlooping containers, insufficient resources, and failed health checks.

Detailed Answer

When a Deployment is not behaving as expected, a systematic approach helps you identify the root cause quickly. Here is a step-by-step troubleshooting methodology.

Step 1: Check Rollout Status

kubectl rollout status deployment/web
# Output examples:
# "deployment "web" successfully rolled out"
# "Waiting for deployment "web" rollout to finish: 1 out of 3 new replicas have been updated..."
# "error: deployment "web" exceeded its progress deadline"

If the rollout is stuck, the progressDeadlineSeconds (default 600s) may have been exceeded.

Step 2: Inspect the Deployment

kubectl describe deployment web

Key sections to examine:

Conditions: Look for Available, Progressing, and ReplicaFailure
Events: Shows scaling decisions and errors
Replicas: Compare desired, updated, ready, and available counts

Conditions:
  Type           Status  Reason
  ----           ------  ------
  Available      True    MinimumReplicasAvailable
  Progressing    False   ProgressDeadlineExceeded

Step 3: Check ReplicaSets

kubectl get rs -l app=web
# NAME         DESIRED   CURRENT   READY   AGE
# web-abc123   3         3         3       2d    (old - stable)
# web-def456   3         3         0       5m    (new - not ready)

If the new ReplicaSet has Pods that are not ready, drill into those Pods.

Step 4: Inspect Pods

kubectl get pods -l app=web
kubectl describe pod web-def456-xyz

Common Pod Error States

ImagePullBackOff / ErrImagePull

Events:
  Warning  Failed   kubelet  Failed to pull image "myapp:latest": rpc error
  Warning  Failed   kubelet  Error: ImagePullBackOff

Causes and fixes:

Wrong image name or tag → verify image exists in the registry
Missing imagePullSecrets → create and attach the secret
Private registry authentication → check the docker-registry secret

# Verify image exists
docker manifest inspect myapp:2.0

# Check imagePullSecrets
kubectl get pod web-xyz -o jsonpath='{.spec.imagePullSecrets}'

# Create a pull secret
kubectl create secret docker-registry regcred \
  --docker-server=registry.example.com \
  --docker-username=user \
  --docker-password=pass

CrashLoopBackOff

Events:
  Warning  BackOff  kubelet  Back-off restarting failed container

Debug steps:

# View the crash logs (--previous shows logs from the last terminated container)
kubectl logs web-xyz --previous

# Check if it was OOMKilled
kubectl describe pod web-xyz | grep -A 3 "Last State"
# Last State:  Terminated
#   Reason:    OOMKilled
#   Exit Code: 137

# Try running the container interactively
kubectl run debug --image=myapp:2.0 --rm -it -- /bin/sh

Pending Pods

kubectl describe pod web-xyz
# Events:
#   Warning  FailedScheduling  0/5 nodes available:
#   3 Insufficient cpu, 2 node(s) had taint NoSchedule

Common causes:

Insufficient cluster resources → check node allocatable vs. requests
Taints without matching tolerations → check node taints
PVC not bound → check PV availability
Node selector or affinity mismatch → verify node labels

# Check node resources
kubectl describe nodes | grep -A 5 "Allocated resources"

# Check node taints
kubectl get nodes -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.spec.taints}{"\n"}{end}'

Step 5: Check Health Probes

Failed readiness probes prevent Pods from receiving traffic and stall rollouts:

kubectl describe pod web-xyz | grep -A 10 "Readiness"
# Readiness probe failed: HTTP probe failed with statuscode: 503

Fixes:

Increase initialDelaySeconds if the app needs time to start
Check the probe endpoint actually returns 200
Verify the probe port matches the container port

Step 6: Review Events Cluster-Wide

# All events sorted by time
kubectl get events --sort-by='.lastTimestamp' -A

# Events for a specific namespace
kubectl get events -n production --field-selector reason=FailedScheduling

Deployment Troubleshooting Flowchart

Deployment issue
├── kubectl rollout status → Stuck?
│   ├── Yes → Check new ReplicaSet Pods
│   │   ├── Pending → Resource/scheduling issue
│   │   ├── CrashLoopBackOff → Check logs --previous
│   │   ├── ImagePullBackOff → Check image name/secrets
│   │   └── Running but not Ready → Check readiness probe
│   └── No → Rollout succeeded, issue is elsewhere
├── Wrong version running?
│   └── Check image tag on running Pods
└── Pods running but not receiving traffic?
    └── Check Service selector matches Pod labels

Useful Troubleshooting Commands Summary

# Quick health overview
kubectl get deployment,rs,pods -l app=web

# Rollout history
kubectl rollout history deployment web

# Rollback if needed
kubectl rollout undo deployment web

# Watch Pods in real-time
kubectl get pods -l app=web -w

# Get YAML of a running Pod for comparison
kubectl get pod web-xyz -o yaml

Why Interviewers Ask This

Deployment troubleshooting is one of the most practical skills tested in interviews. It demonstrates your ability to systematically diagnose production issues under pressure.

Common Follow-Up Questions

What does ImagePullBackOff mean and how do you fix it?

The kubelet cannot pull the container image. Common causes are wrong image name/tag, missing imagePullSecret, or a registry being unreachable. Check the image name and ensure the pull secret exists.

How do you debug a CrashLoopBackOff Pod?

Check kubectl logs --previous to see the crash output. Common causes are missing environment variables, failed database connections, misconfigured entrypoints, or OOMKilled containers.

What does a stalled rollout indicate?

A rollout that does not progress usually means new Pods are failing readiness probes or cannot schedule. Check kubectl rollout status and kubectl describe on the stuck Pods.

Key Takeaways

Always start with kubectl rollout status, then drill down into describe and logs.
ImagePullBackOff and CrashLoopBackOff are the two most common Pod failure patterns.
A stuck rollout is often caused by readiness probe failures on the new ReplicaSet's Pods.