Kubernetes Back-off Restarting Failed Container

Causes and Fixes

The 'Back-off restarting failed container' event indicates that a container has failed and the kubelet is waiting before restarting it, using an exponential backoff delay. This is the mechanism behind the CrashLoopBackOff status and means the container keeps crashing after each restart attempt.

Symptoms

  • Pod events show 'Back-off restarting failed container' message
  • Pod status may show CrashLoopBackOff
  • Container restart count keeps increasing
  • Backoff delay increases: 10s, 20s, 40s, up to 5 minutes
  • kubectl describe pod shows increasing restart intervals

Common Causes

1
Application crashes on startup
The application exits with a non-zero code immediately after starting. Check logs with kubectl logs --previous to see the crash output.
2
Liveness probe kills the container
An overly aggressive liveness probe terminates the container before it can become healthy. Increase initialDelaySeconds or use a startup probe.
3
Container runs a one-shot command
The container runs a command that completes and exits successfully (exit code 0) but restartPolicy is Always. Use a Job instead of a Deployment.
4
Out of memory
The container is OOM-killed (exit code 137) each time it starts. Increase memory limits.
5
Missing runtime dependencies
The application cannot connect to a required database or service and exits on failure. Use init containers to wait for dependencies.

Step-by-Step Troubleshooting

1. Check Pod Events and Restart Count

kubectl describe pod <pod-name> -n <namespace>

Look at the Events section:

Warning  BackOff  Back-off restarting failed container

And check the container status:

Restart Count:  8
Last State:     Terminated
  Exit Code:    1
  Started:      ...
  Finished:     ...

2. Check the Exit Code

The exit code tells you why the container is failing.

kubectl get pod <pod-name> -o jsonpath='{.status.containerStatuses[0].lastState.terminated.exitCode}'

| Exit Code | Meaning | Common Cause | |-----------|---------|-------------| | 0 | Success | Container ran to completion (wrong restart policy) | | 1 | Application error | Unhandled exception, config error | | 126 | Permission denied | Binary not executable | | 127 | Command not found | Invalid entrypoint | | 137 | SIGKILL (OOM) | Memory limit exceeded | | 139 | SIGSEGV | Segmentation fault | | 143 | SIGTERM | Graceful shutdown |

3. Check Previous Container Logs

The most important diagnostic step is reading the logs from the crashed container.

# Logs from the previous (crashed) container
kubectl logs <pod-name> --previous

# Logs from a specific container in a multi-container pod
kubectl logs <pod-name> -c <container-name> --previous

# If the container crashes too quickly, get all available logs
kubectl logs <pod-name> --previous --timestamps

4. Check Liveness Probe Configuration

A misconfigured liveness probe can cause repeated kills.

kubectl get pod <pod-name> -o jsonpath='{.spec.containers[0].livenessProbe}' | jq .

Signs of a probe issue:

  • initialDelaySeconds is too short for the app to start
  • timeoutSeconds is too short for the health endpoint to respond
  • failureThreshold is too low

Fix by adding a startup probe or adjusting liveness probe timings:

startupProbe:
  httpGet:
    path: /healthz
    port: 8080
  failureThreshold: 30
  periodSeconds: 10
livenessProbe:
  httpGet:
    path: /healthz
    port: 8080
  initialDelaySeconds: 0
  periodSeconds: 10
  failureThreshold: 3

5. Check Restart Policy

If the container exits with code 0 but keeps restarting, the restart policy may be wrong.

kubectl get pod <pod-name> -o jsonpath='{.spec.restartPolicy}'
  • Always (default for Deployments): Restarts the container regardless of exit code
  • OnFailure: Only restarts on non-zero exit code
  • Never: Never restarts

For one-shot tasks that should run to completion, use a Job:

apiVersion: batch/v1
kind: Job
metadata:
  name: my-task
spec:
  template:
    spec:
      restartPolicy: OnFailure
      containers:
        - name: task
          image: myapp:v1
          command: ["./run-migration.sh"]

6. Understanding the Backoff Timer

The kubelet uses exponential backoff when restarting a failed container:

Attempt 1: restart after 10 seconds
Attempt 2: restart after 20 seconds
Attempt 3: restart after 40 seconds
Attempt 4: restart after 80 seconds
Attempt 5: restart after 160 seconds
Attempt 6+: restart after 300 seconds (5-minute cap)

The backoff timer resets after the container runs successfully for 10 minutes. This means that during debugging, you may need to wait up to 5 minutes for the next restart attempt.

To force an immediate restart:

# Delete the pod (a Deployment will recreate it with reset backoff)
kubectl delete pod <pod-name>

7. Debug with an Ephemeral Container

If the container crashes too quickly to exec into it, use an ephemeral container.

# Attach a debug container to the pod
kubectl debug <pod-name> -it --image=busybox --target=<container-name> -- sh

# Or run a copy of the pod with a different command
kubectl debug <pod-name> -it --copy-to=debug-pod --container=<container-name> -- sh

8. Debug by Overriding the Command

Create a debug version of the pod that sleeps instead of running the crashing command.

kubectl run debug-pod --image=<same-image> --restart=Never --command -- sleep 3600
kubectl exec -it debug-pod -- sh

# Inside the container, try running the original command manually
/app/start.sh

This lets you see the error output interactively and inspect the environment.

9. Fix and Verify

Apply the appropriate fix and verify the backoff clears.

# Fix the deployment
kubectl set image deployment/<deploy-name> <container>=<fixed-image>

# Or fix the config issue
kubectl edit deployment <deploy-name>

# Watch for successful start
kubectl get pods -w

# Verify no more backoff events
kubectl describe pod <pod-name> | grep -i backoff

The pod should start, remain running, and the restart count should stop incrementing.

How to Explain This in an Interview

I would explain that the back-off mechanism is the kubelet's way of avoiding resource waste when a container keeps failing. The delay starts at 10 seconds and doubles up to a cap of 5 minutes. The back-off resets after the container runs successfully for 10 minutes. I would describe how to debug this by checking the exit code and previous container logs, and I would discuss the relationship between the back-off event, the CrashLoopBackOff status, and the container's restart policy.

Prevention

  • Implement health checks and graceful startup in applications
  • Use startup probes for slow-starting applications
  • Set appropriate resource limits to avoid OOM kills
  • Use init containers to wait for dependencies
  • Use Jobs for one-shot tasks instead of Deployments

Related Errors