Kubernetes Liveness Probe Failed
Causes and Fixes
A liveness probe failure means the kubelet has determined that a container is no longer alive and healthy. When a liveness probe fails consecutively for the configured failureThreshold number of times, the kubelet kills the container and restarts it according to the pod's restart policy. This is the most common cause of container restarts in production.
Symptoms
- Container restarts repeatedly with the restart reason showing 'Liveness probe failed'
- kubectl describe pod shows 'Liveness probe failed: ...' events
- Container restart count keeps incrementing
- Pod events show 'killing container with id ... : failed liveness probe'
- Application experiences brief downtime during each restart cycle
Common Causes
Step-by-Step Troubleshooting
Liveness probe failures trigger container restarts, which can cascade into CrashLoopBackOff if the underlying issue is not resolved. This guide walks through determining whether the probe is correctly identifying an unhealthy application or if the probe itself is misconfigured.
1. Check Pod Events for Probe Failure Details
Start by examining the exact probe failure messages.
kubectl describe pod <pod-name>
Look in the Events section for entries like:
Warning Unhealthy Liveness probe failed: HTTP probe failed with statuscode: 503
Warning Unhealthy Liveness probe failed: Get "http://10.244.1.5:8080/healthz": dial tcp 10.244.1.5:8080: connect: connection refused
Warning Unhealthy Liveness probe failed: command "cat /tmp/healthy" returned exit code 1
The message tells you the probe type (HTTP, TCP, exec) and the specific failure.
2. Check the Probe Configuration
Understand how the liveness probe is configured.
kubectl get pod <pod-name> -o jsonpath='{.spec.containers[0].livenessProbe}' | jq .
Key parameters to note:
- httpGet/tcpSocket/exec: The probe mechanism
- initialDelaySeconds: How long to wait before the first probe
- periodSeconds: How often to probe (default 10)
- timeoutSeconds: How long to wait for a response (default 1)
- failureThreshold: How many consecutive failures before restart (default 3)
- successThreshold: How many successes to consider it alive (always 1 for liveness)
3. Test the Probe Endpoint Manually
Verify whether the probe endpoint is actually working.
# For HTTP probes
kubectl exec <pod-name> -- curl -s -o /dev/null -w "%{http_code}" http://localhost:<port><path>
# For example
kubectl exec <pod-name> -- curl -s http://localhost:8080/healthz
# For TCP probes, check if the port is open
kubectl exec <pod-name> -- sh -c 'cat < /dev/tcp/localhost/<port>'
# For exec probes, run the command manually
kubectl exec <pod-name> -- <probe-command>
If the endpoint works when tested manually, the issue is timing (probe runs when app is temporarily busy). If it consistently fails, the probe configuration or the application's health endpoint is wrong.
4. Check if the Application Is Genuinely Unhealthy
Look at application logs around the time of probe failures.
# Current logs
kubectl logs <pod-name> --tail=100
# Previous container logs (from before the restart)
kubectl logs <pod-name> --previous --tail=100
# Follow logs in real time
kubectl logs <pod-name> -f
Look for:
- Stack traces or error messages
- Deadlock indicators
- Memory exhaustion messages
- Connection pool exhaustion
- Thread pool saturation
5. Check Resource Pressure
Resource contention can cause probe timeouts even when the application is fundamentally healthy.
# Check container resource usage
kubectl top pod <pod-name> --containers
# Check resource limits
kubectl get pod <pod-name> -o jsonpath='{.spec.containers[0].resources}' | jq .
# Check if the container was OOM killed
kubectl get pod <pod-name> -o jsonpath='{.status.containerStatuses[0].lastState.terminated.reason}'
If the container is consistently near its CPU limit, it may be throttled during probe execution. Increase CPU limits or make the probe timeout more generous.
6. Fix Overly Aggressive Probe Settings
If the probe is too aggressive for the application's behavior, adjust the parameters.
kubectl patch deployment <deployment-name> -p '{
"spec": {
"template": {
"spec": {
"containers": [{
"name": "<container-name>",
"livenessProbe": {
"httpGet": {
"path": "/healthz",
"port": 8080
},
"initialDelaySeconds": 30,
"periodSeconds": 15,
"timeoutSeconds": 5,
"failureThreshold": 5
}
}]
}
}
}
}'
Guidelines for tuning:
- timeoutSeconds: Set to at least 2-3x the average response time of the health endpoint
- periodSeconds: 10-30 seconds is reasonable for most applications
- failureThreshold: 3-5 consecutive failures before declaring unhealthy
- initialDelaySeconds: Set to the maximum expected startup time (or use a startup probe instead)
7. Add a Startup Probe
If the application takes a long time to start and the liveness probe kills it during startup, add a startup probe.
kubectl patch deployment <deployment-name> -p '{
"spec": {
"template": {
"spec": {
"containers": [{
"name": "<container-name>",
"startupProbe": {
"httpGet": {
"path": "/healthz",
"port": 8080
},
"periodSeconds": 10,
"failureThreshold": 30
}
}]
}
}
}
}'
The startup probe allows up to 300 seconds (30 failures x 10 seconds) for the application to start. The liveness probe does not run until the startup probe succeeds.
8. Fix the Health Endpoint
If the probe endpoint is wrong or checking the wrong thing, fix it.
# A good liveness probe should:
# 1. Return 200 if the application process is alive and can handle requests
# 2. NOT check external dependencies
# 3. Be lightweight and fast
# 4. Return non-200 only if a restart would help
# Bad: checks database connectivity
# GET /health -> checks DB, cache, queue -> 503 if any are down
# This restarts the pod when the database is down, which does not help
# Good: checks application process health
# GET /healthz -> checks internal state -> 200 if process is functional
Update your application's health endpoint to only check internal state, then update the probe path if needed.
9. Consider Removing the Liveness Probe
In some cases, a liveness probe is not needed at all. If the application exits on failure (which Kubernetes handles via the restart policy), a liveness probe adds complexity without benefit.
# Remove the liveness probe if the application self-exits on failure
kubectl patch deployment <deployment-name> --type=json -p='[{"op":"remove","path":"/spec/template/spec/containers/0/livenessProbe"}]'
Use a liveness probe only when the application can enter a broken state where it is running but not functional (deadlocks, resource leaks, corruption).
10. Verify Probe Is Working Correctly
After adjusting the probe, verify it functions as expected.
# Watch the pod for restarts
kubectl get pod <pod-name> -w
# Check that no liveness probe failures are occurring
kubectl describe pod <pod-name> | grep -i "liveness\|unhealthy"
# Monitor restart count (should stay stable)
kubectl get pod <pod-name> -o jsonpath='{.status.containerStatuses[0].restartCount}'
# After some time, verify no restarts have occurred
sleep 120 && kubectl get pod <pod-name>
The fix is successful when the container restart count stabilizes and no new Unhealthy or liveness probe failure events appear. If restarts continue, the application may be genuinely entering an unhealthy state that requires application-level debugging.
How to Explain This in an Interview
I would explain the three types of probes and their purposes: liveness (is the container alive?), readiness (can it serve traffic?), and startup (has it finished starting?). Liveness probes should only check the application's internal health — whether it can process requests — not external dependencies. I'd discuss the probe parameters (initialDelaySeconds, periodSeconds, timeoutSeconds, failureThreshold, successThreshold), how to configure them based on the application's behavior, and the critical rule that liveness probes should never check dependencies. I'd explain why startup probes were added in Kubernetes 1.16 to solve the slow-start problem, and how they protect the container from premature liveness kills during initialization.
Prevention
- Use startup probes for applications with variable startup times
- Never check external dependencies in liveness probes
- Set timeoutSeconds higher than the application's worst-case response time
- Monitor probe failure rates and adjust thresholds based on data
- Test probe configurations under load before production deployment