Kubernetes Liveness Probe Failed

Causes and Fixes

A liveness probe failure means the kubelet has determined that a container is no longer alive and healthy. When a liveness probe fails consecutively for the configured failureThreshold number of times, the kubelet kills the container and restarts it according to the pod's restart policy. This is the most common cause of container restarts in production.

Symptoms

  • Container restarts repeatedly with the restart reason showing 'Liveness probe failed'
  • kubectl describe pod shows 'Liveness probe failed: ...' events
  • Container restart count keeps incrementing
  • Pod events show 'killing container with id ... : failed liveness probe'
  • Application experiences brief downtime during each restart cycle

Common Causes

1
Application is genuinely unhealthy
The application has entered a deadlock, infinite loop, or corrupted state where it can no longer handle requests. The liveness probe correctly detects this and triggers a restart.
2
Probe configuration too aggressive
The probe's timeoutSeconds is too short, periodSeconds is too frequent, or failureThreshold is too low for the application's normal behavior, causing false positives during brief CPU spikes or garbage collection pauses.
3
Wrong probe endpoint or port
The liveness probe is configured to check a URL path or port that does not exist, always returns an error, or does not reflect the application's actual health.
4
Application slow to start
Without a startup probe, the liveness probe starts checking immediately. If the application takes time to initialize, the liveness probe kills it before it finishes starting, causing CrashLoopBackOff.
5
Resource contention
The container is under heavy CPU or memory pressure, causing the application to respond slowly. The liveness probe times out even though the application would eventually respond.
6
Dependency check in liveness probe
The liveness probe checks external dependencies (database, cache) instead of just the application's internal health. When the dependency is slow or down, the probe fails and the pod is unnecessarily restarted.

Step-by-Step Troubleshooting

Liveness probe failures trigger container restarts, which can cascade into CrashLoopBackOff if the underlying issue is not resolved. This guide walks through determining whether the probe is correctly identifying an unhealthy application or if the probe itself is misconfigured.

1. Check Pod Events for Probe Failure Details

Start by examining the exact probe failure messages.

kubectl describe pod <pod-name>

Look in the Events section for entries like:

Warning  Unhealthy  Liveness probe failed: HTTP probe failed with statuscode: 503
Warning  Unhealthy  Liveness probe failed: Get "http://10.244.1.5:8080/healthz": dial tcp 10.244.1.5:8080: connect: connection refused
Warning  Unhealthy  Liveness probe failed: command "cat /tmp/healthy" returned exit code 1

The message tells you the probe type (HTTP, TCP, exec) and the specific failure.

2. Check the Probe Configuration

Understand how the liveness probe is configured.

kubectl get pod <pod-name> -o jsonpath='{.spec.containers[0].livenessProbe}' | jq .

Key parameters to note:

  • httpGet/tcpSocket/exec: The probe mechanism
  • initialDelaySeconds: How long to wait before the first probe
  • periodSeconds: How often to probe (default 10)
  • timeoutSeconds: How long to wait for a response (default 1)
  • failureThreshold: How many consecutive failures before restart (default 3)
  • successThreshold: How many successes to consider it alive (always 1 for liveness)

3. Test the Probe Endpoint Manually

Verify whether the probe endpoint is actually working.

# For HTTP probes
kubectl exec <pod-name> -- curl -s -o /dev/null -w "%{http_code}" http://localhost:<port><path>

# For example
kubectl exec <pod-name> -- curl -s http://localhost:8080/healthz

# For TCP probes, check if the port is open
kubectl exec <pod-name> -- sh -c 'cat < /dev/tcp/localhost/<port>'

# For exec probes, run the command manually
kubectl exec <pod-name> -- <probe-command>

If the endpoint works when tested manually, the issue is timing (probe runs when app is temporarily busy). If it consistently fails, the probe configuration or the application's health endpoint is wrong.

4. Check if the Application Is Genuinely Unhealthy

Look at application logs around the time of probe failures.

# Current logs
kubectl logs <pod-name> --tail=100

# Previous container logs (from before the restart)
kubectl logs <pod-name> --previous --tail=100

# Follow logs in real time
kubectl logs <pod-name> -f

Look for:

  • Stack traces or error messages
  • Deadlock indicators
  • Memory exhaustion messages
  • Connection pool exhaustion
  • Thread pool saturation

5. Check Resource Pressure

Resource contention can cause probe timeouts even when the application is fundamentally healthy.

# Check container resource usage
kubectl top pod <pod-name> --containers

# Check resource limits
kubectl get pod <pod-name> -o jsonpath='{.spec.containers[0].resources}' | jq .

# Check if the container was OOM killed
kubectl get pod <pod-name> -o jsonpath='{.status.containerStatuses[0].lastState.terminated.reason}'

If the container is consistently near its CPU limit, it may be throttled during probe execution. Increase CPU limits or make the probe timeout more generous.

6. Fix Overly Aggressive Probe Settings

If the probe is too aggressive for the application's behavior, adjust the parameters.

kubectl patch deployment <deployment-name> -p '{
  "spec": {
    "template": {
      "spec": {
        "containers": [{
          "name": "<container-name>",
          "livenessProbe": {
            "httpGet": {
              "path": "/healthz",
              "port": 8080
            },
            "initialDelaySeconds": 30,
            "periodSeconds": 15,
            "timeoutSeconds": 5,
            "failureThreshold": 5
          }
        }]
      }
    }
  }
}'

Guidelines for tuning:

  • timeoutSeconds: Set to at least 2-3x the average response time of the health endpoint
  • periodSeconds: 10-30 seconds is reasonable for most applications
  • failureThreshold: 3-5 consecutive failures before declaring unhealthy
  • initialDelaySeconds: Set to the maximum expected startup time (or use a startup probe instead)

7. Add a Startup Probe

If the application takes a long time to start and the liveness probe kills it during startup, add a startup probe.

kubectl patch deployment <deployment-name> -p '{
  "spec": {
    "template": {
      "spec": {
        "containers": [{
          "name": "<container-name>",
          "startupProbe": {
            "httpGet": {
              "path": "/healthz",
              "port": 8080
            },
            "periodSeconds": 10,
            "failureThreshold": 30
          }
        }]
      }
    }
  }
}'

The startup probe allows up to 300 seconds (30 failures x 10 seconds) for the application to start. The liveness probe does not run until the startup probe succeeds.

8. Fix the Health Endpoint

If the probe endpoint is wrong or checking the wrong thing, fix it.

# A good liveness probe should:
# 1. Return 200 if the application process is alive and can handle requests
# 2. NOT check external dependencies
# 3. Be lightweight and fast
# 4. Return non-200 only if a restart would help

# Bad: checks database connectivity
# GET /health -> checks DB, cache, queue -> 503 if any are down
# This restarts the pod when the database is down, which does not help

# Good: checks application process health
# GET /healthz -> checks internal state -> 200 if process is functional

Update your application's health endpoint to only check internal state, then update the probe path if needed.

9. Consider Removing the Liveness Probe

In some cases, a liveness probe is not needed at all. If the application exits on failure (which Kubernetes handles via the restart policy), a liveness probe adds complexity without benefit.

# Remove the liveness probe if the application self-exits on failure
kubectl patch deployment <deployment-name> --type=json -p='[{"op":"remove","path":"/spec/template/spec/containers/0/livenessProbe"}]'

Use a liveness probe only when the application can enter a broken state where it is running but not functional (deadlocks, resource leaks, corruption).

10. Verify Probe Is Working Correctly

After adjusting the probe, verify it functions as expected.

# Watch the pod for restarts
kubectl get pod <pod-name> -w

# Check that no liveness probe failures are occurring
kubectl describe pod <pod-name> | grep -i "liveness\|unhealthy"

# Monitor restart count (should stay stable)
kubectl get pod <pod-name> -o jsonpath='{.status.containerStatuses[0].restartCount}'

# After some time, verify no restarts have occurred
sleep 120 && kubectl get pod <pod-name>

The fix is successful when the container restart count stabilizes and no new Unhealthy or liveness probe failure events appear. If restarts continue, the application may be genuinely entering an unhealthy state that requires application-level debugging.

How to Explain This in an Interview

I would explain the three types of probes and their purposes: liveness (is the container alive?), readiness (can it serve traffic?), and startup (has it finished starting?). Liveness probes should only check the application's internal health — whether it can process requests — not external dependencies. I'd discuss the probe parameters (initialDelaySeconds, periodSeconds, timeoutSeconds, failureThreshold, successThreshold), how to configure them based on the application's behavior, and the critical rule that liveness probes should never check dependencies. I'd explain why startup probes were added in Kubernetes 1.16 to solve the slow-start problem, and how they protect the container from premature liveness kills during initialization.

Prevention

  • Use startup probes for applications with variable startup times
  • Never check external dependencies in liveness probes
  • Set timeoutSeconds higher than the application's worst-case response time
  • Monitor probe failure rates and adjust thresholds based on data
  • Test probe configurations under load before production deployment

Related Errors