Kubernetes DeadlineExceeded
Causes and Fixes
DeadlineExceeded means a Kubernetes resource exceeded its configured time limit. This most commonly applies to Jobs that exceed their activeDeadlineSeconds, Deployments that exceed their progressDeadlineSeconds during a rollout, or pods that exceed their activeDeadlineSeconds.
Symptoms
- Job status shows 'DeadlineExceeded' condition
- Deployment shows 'ProgressDeadlineExceeded' in conditions
- kubectl describe job shows 'Job was active longer than specified deadline'
- Pods are terminated when the deadline is reached
- Rollout appears stuck and eventually reports failure
Common Causes
Step-by-Step Troubleshooting
1. Determine the Context
DeadlineExceeded applies to different resources. Identify which one is affected.
# Check Jobs
kubectl get jobs -n <namespace>
kubectl describe job <job-name> -n <namespace>
# Check Deployments
kubectl rollout status deployment/<deploy-name> -n <namespace>
kubectl describe deployment <deploy-name> -n <namespace>
2. Troubleshoot Job DeadlineExceeded
For Jobs, the activeDeadlineSeconds sets a hard time limit on the entire Job.
# Check the Job's deadline configuration
kubectl get job <job-name> -o jsonpath='{.spec.activeDeadlineSeconds}'
# Check how long the Job ran
kubectl describe job <job-name> | grep -E "Start Time|Completion Time|Active Deadline"
# Check Job conditions
kubectl get job <job-name> -o jsonpath='{.status.conditions}' | jq .
The condition will show:
{
"type": "Failed",
"status": "True",
"reason": "DeadlineExceeded",
"message": "Job was active longer than specified deadline"
}
Fix: Increase the deadline or optimize the workload.
apiVersion: batch/v1
kind: Job
metadata:
name: long-running-job
spec:
activeDeadlineSeconds: 7200 # 2 hours instead of default
template:
spec:
containers:
- name: worker
image: myapp:v1
command: ["./process.sh"]
restartPolicy: OnFailure
3. Troubleshoot Deployment ProgressDeadlineExceeded
For Deployments, progressDeadlineSeconds (default: 600s) controls how long Kubernetes waits for a rollout to make progress.
# Check rollout status
kubectl rollout status deployment/<deploy-name>
# Check deployment conditions
kubectl get deployment <deploy-name> -o jsonpath='{.status.conditions}' | jq '.[] | select(.type=="Progressing")'
# Check the progress deadline
kubectl get deployment <deploy-name> -o jsonpath='{.spec.progressDeadlineSeconds}'
The condition will show:
{
"type": "Progressing",
"status": "False",
"reason": "ProgressDeadlineExceeded",
"message": "ReplicaSet \"app-5b9f\" has timed out progressing."
}
4. Investigate Why the Rollout Stalled
Check the new ReplicaSet's pods to understand why they are not becoming ready.
# Find the new ReplicaSet
kubectl get rs -n <namespace> --sort-by='.metadata.creationTimestamp' | tail -3
# Check pods in the new ReplicaSet
kubectl get pods -n <namespace> -l app=<app-label> --sort-by='.metadata.creationTimestamp'
# Describe a pod from the new ReplicaSet
kubectl describe pod <new-pod-name>
Common reasons the rollout stalls:
- Pods stuck in Pending (insufficient resources)
- Pods in CrashLoopBackOff (application error in new version)
- Pods in ImagePullBackOff (wrong image reference)
- Pods running but readiness probe failing (application not healthy)
5. Check Readiness Probes
If pods are running but not ready, the readiness probe is failing.
# Check readiness probe config
kubectl get pod <pod-name> -o jsonpath='{.spec.containers[0].readinessProbe}' | jq .
# Check if the pod is ready
kubectl get pod <pod-name> -o jsonpath='{.status.conditions}' | jq '.[] | select(.type=="Ready")'
# Test the readiness endpoint manually
kubectl exec <pod-name> -- wget -q -O - http://localhost:8080/readyz
6. Roll Back the Deployment
If the new version is broken, roll back.
# Roll back to the previous version
kubectl rollout undo deployment/<deploy-name>
# Roll back to a specific revision
kubectl rollout history deployment/<deploy-name>
kubectl rollout undo deployment/<deploy-name> --to-revision=3
# Verify the rollback
kubectl rollout status deployment/<deploy-name>
7. Adjust the Progress Deadline
If the rollout is legitimate but slow (e.g., large image pulls, slow startup), increase the deadline.
kubectl patch deployment <deploy-name> \
-p '{"spec":{"progressDeadlineSeconds": 1200}}'
For applications with slow startups, also consider startup probes:
startupProbe:
httpGet:
path: /healthz
port: 8080
failureThreshold: 60
periodSeconds: 10
8. Clean Up Failed Jobs
Failed Jobs remain in the cluster. Clean them up to free resources.
# Delete failed jobs
kubectl delete jobs --field-selector status.successful=0 -n <namespace>
# Set TTL to auto-clean completed/failed Jobs
apiVersion: batch/v1
kind: Job
metadata:
name: cleanup-job
spec:
ttlSecondsAfterFinished: 3600 # Delete 1 hour after completion
activeDeadlineSeconds: 600
template:
spec:
containers:
- name: worker
image: myapp:v1
restartPolicy: Never
9. Verify the Fix
# For Jobs: confirm the Job completes within the deadline
kubectl get job <job-name> -w
# For Deployments: confirm the rollout completes
kubectl rollout status deployment/<deploy-name>
kubectl get deployment <deploy-name> -o jsonpath='{.status.conditions}' | jq '.[] | select(.type=="Progressing")'
The Job should show Complete status, or the Deployment should show Progressing: True with reason NewReplicaSetAvailable.
How to Explain This in an Interview
I would distinguish between the two main contexts: Job activeDeadlineSeconds (which terminates the Job's pods after the deadline) and Deployment progressDeadlineSeconds (which marks the rollout as failed but does not roll back automatically). For Jobs, I would discuss setting appropriate deadlines based on expected run times plus buffer. For Deployments, I would explain that the progress deadline is a monitoring mechanism — if a rollout stalls, you need to investigate why pods are not becoming ready and decide whether to roll back manually.
Prevention
- Set realistic activeDeadlineSeconds for Jobs based on benchmarked run times
- Monitor Deployment rollouts with progressDeadlineSeconds alerts
- Use readiness probes so rollouts detect unhealthy pods
- Pre-pull images on nodes to reduce startup time
- Ensure cluster has enough headroom for rolling updates