Kubernetes Back-off Restarting Failed Container
Causes and Fixes
The 'Back-off restarting failed container' event indicates that a container has failed and the kubelet is waiting before restarting it, using an exponential backoff delay. This is the mechanism behind the CrashLoopBackOff status and means the container keeps crashing after each restart attempt.
Symptoms
- Pod events show 'Back-off restarting failed container' message
- Pod status may show CrashLoopBackOff
- Container restart count keeps increasing
- Backoff delay increases: 10s, 20s, 40s, up to 5 minutes
- kubectl describe pod shows increasing restart intervals
Common Causes
Step-by-Step Troubleshooting
1. Check Pod Events and Restart Count
kubectl describe pod <pod-name> -n <namespace>
Look at the Events section:
Warning BackOff Back-off restarting failed container
And check the container status:
Restart Count: 8
Last State: Terminated
Exit Code: 1
Started: ...
Finished: ...
2. Check the Exit Code
The exit code tells you why the container is failing.
kubectl get pod <pod-name> -o jsonpath='{.status.containerStatuses[0].lastState.terminated.exitCode}'
| Exit Code | Meaning | Common Cause | |-----------|---------|-------------| | 0 | Success | Container ran to completion (wrong restart policy) | | 1 | Application error | Unhandled exception, config error | | 126 | Permission denied | Binary not executable | | 127 | Command not found | Invalid entrypoint | | 137 | SIGKILL (OOM) | Memory limit exceeded | | 139 | SIGSEGV | Segmentation fault | | 143 | SIGTERM | Graceful shutdown |
3. Check Previous Container Logs
The most important diagnostic step is reading the logs from the crashed container.
# Logs from the previous (crashed) container
kubectl logs <pod-name> --previous
# Logs from a specific container in a multi-container pod
kubectl logs <pod-name> -c <container-name> --previous
# If the container crashes too quickly, get all available logs
kubectl logs <pod-name> --previous --timestamps
4. Check Liveness Probe Configuration
A misconfigured liveness probe can cause repeated kills.
kubectl get pod <pod-name> -o jsonpath='{.spec.containers[0].livenessProbe}' | jq .
Signs of a probe issue:
initialDelaySecondsis too short for the app to starttimeoutSecondsis too short for the health endpoint to respondfailureThresholdis too low
Fix by adding a startup probe or adjusting liveness probe timings:
startupProbe:
httpGet:
path: /healthz
port: 8080
failureThreshold: 30
periodSeconds: 10
livenessProbe:
httpGet:
path: /healthz
port: 8080
initialDelaySeconds: 0
periodSeconds: 10
failureThreshold: 3
5. Check Restart Policy
If the container exits with code 0 but keeps restarting, the restart policy may be wrong.
kubectl get pod <pod-name> -o jsonpath='{.spec.restartPolicy}'
Always(default for Deployments): Restarts the container regardless of exit codeOnFailure: Only restarts on non-zero exit codeNever: Never restarts
For one-shot tasks that should run to completion, use a Job:
apiVersion: batch/v1
kind: Job
metadata:
name: my-task
spec:
template:
spec:
restartPolicy: OnFailure
containers:
- name: task
image: myapp:v1
command: ["./run-migration.sh"]
6. Understanding the Backoff Timer
The kubelet uses exponential backoff when restarting a failed container:
Attempt 1: restart after 10 seconds
Attempt 2: restart after 20 seconds
Attempt 3: restart after 40 seconds
Attempt 4: restart after 80 seconds
Attempt 5: restart after 160 seconds
Attempt 6+: restart after 300 seconds (5-minute cap)
The backoff timer resets after the container runs successfully for 10 minutes. This means that during debugging, you may need to wait up to 5 minutes for the next restart attempt.
To force an immediate restart:
# Delete the pod (a Deployment will recreate it with reset backoff)
kubectl delete pod <pod-name>
7. Debug with an Ephemeral Container
If the container crashes too quickly to exec into it, use an ephemeral container.
# Attach a debug container to the pod
kubectl debug <pod-name> -it --image=busybox --target=<container-name> -- sh
# Or run a copy of the pod with a different command
kubectl debug <pod-name> -it --copy-to=debug-pod --container=<container-name> -- sh
8. Debug by Overriding the Command
Create a debug version of the pod that sleeps instead of running the crashing command.
kubectl run debug-pod --image=<same-image> --restart=Never --command -- sleep 3600
kubectl exec -it debug-pod -- sh
# Inside the container, try running the original command manually
/app/start.sh
This lets you see the error output interactively and inspect the environment.
9. Fix and Verify
Apply the appropriate fix and verify the backoff clears.
# Fix the deployment
kubectl set image deployment/<deploy-name> <container>=<fixed-image>
# Or fix the config issue
kubectl edit deployment <deploy-name>
# Watch for successful start
kubectl get pods -w
# Verify no more backoff events
kubectl describe pod <pod-name> | grep -i backoff
The pod should start, remain running, and the restart count should stop incrementing.
How to Explain This in an Interview
I would explain that the back-off mechanism is the kubelet's way of avoiding resource waste when a container keeps failing. The delay starts at 10 seconds and doubles up to a cap of 5 minutes. The back-off resets after the container runs successfully for 10 minutes. I would describe how to debug this by checking the exit code and previous container logs, and I would discuss the relationship between the back-off event, the CrashLoopBackOff status, and the container's restart policy.
Prevention
- Implement health checks and graceful startup in applications
- Use startup probes for slow-starting applications
- Set appropriate resource limits to avoid OOM kills
- Use init containers to wait for dependencies
- Use Jobs for one-shot tasks instead of Deployments