Kubernetes Startup Probe Failed
Causes and Fixes
A startup probe failure means the kubelet determined that a container did not start successfully within the allowed time. The startup probe runs before liveness and readiness probes. When it fails (after failureThreshold consecutive failures), the kubelet kills the container, which typically results in CrashLoopBackOff. Startup probes were designed for slow-starting applications that need more time to initialize.
Symptoms
- Container is killed before the application finishes starting
- Pod events show 'Startup probe failed' followed by container restart
- Pod enters CrashLoopBackOff with startup probe failure as the root cause
- Application logs show incomplete initialization before the kill
- Container restart count increases with 'startup probe failed' in describe output
Common Causes
Step-by-Step Troubleshooting
Startup probe failures kill containers before they finish initializing. The key diagnostic question is whether the application needs more time to start or whether it is genuinely failing during startup. This guide helps answer that question and fix the issue.
1. Check Pod Events for Startup Probe Failures
Examine the pod events to confirm the startup probe is the issue.
kubectl describe pod <pod-name>
Look for events like:
Warning Unhealthy Startup probe failed: HTTP probe failed with statuscode: 503
Warning Unhealthy Startup probe failed: Get "http://10.244.1.5:8080/healthz": dial tcp 10.244.1.5:8080: connect: connection refused
Followed by:
Normal Killing Container <name> failed startup probe, will be restarted
2. Check the Startup Probe Configuration
Understand the current startup budget.
kubectl get pod <pod-name> -o jsonpath='{.spec.containers[0].startupProbe}' | jq .
Calculate the total startup budget:
- Total time allowed = failureThreshold x periodSeconds
- Example: failureThreshold=30, periodSeconds=10 = 300 seconds (5 minutes)
If the application needs more than this time to start, the probe window is too small.
3. Check Application Logs Before the Kill
The previous container's logs show what the application was doing when it was killed.
# Check logs from the previous (killed) container
kubectl logs <pod-name> --previous --tail=100
# If the container restarted multiple times, the logs show the most recent previous
Look for:
- Startup progress messages (how far did it get?)
- Error messages during initialization
- Slow operations (database migrations, cache loading, index building)
- Missing configuration or environment variables
4. Measure Actual Startup Time
Determine how long the application actually needs to start.
# Start the application without the probe killing it (increase the window temporarily)
kubectl patch deployment <deployment-name> -p '{
"spec": {
"template": {
"spec": {
"containers": [{
"name": "<container-name>",
"startupProbe": {
"httpGet": {
"path": "/healthz",
"port": 8080
},
"periodSeconds": 10,
"failureThreshold": 60
}
}]
}
}
}
}'
# Watch the pod and note when it becomes ready
kubectl get pod -l <selector> -w
# Check timestamps in application logs
kubectl logs <pod-name> | head -5 # First log entry
kubectl logs <pod-name> | grep -i "started\|ready\|listening" # Startup complete
The time between the first log entry and the "started" message is the actual startup time.
5. Increase the Startup Probe Budget
If the application legitimately needs more time, increase the probe's total budget.
kubectl patch deployment <deployment-name> -p '{
"spec": {
"template": {
"spec": {
"containers": [{
"name": "<container-name>",
"startupProbe": {
"httpGet": {
"path": "/healthz",
"port": 8080
},
"periodSeconds": 10,
"failureThreshold": 60,
"timeoutSeconds": 5
}
}]
}
}
}
}'
This gives the application 600 seconds (10 minutes) to start. Set the budget to at least 2x the observed maximum startup time to account for variability (cold caches, slow storage, heavy load).
6. Use a TCP Probe Instead of HTTP
If the HTTP endpoint is not available until late in startup, switch to a TCP probe that just checks if the port is open.
kubectl patch deployment <deployment-name> -p '{
"spec": {
"template": {
"spec": {
"containers": [{
"name": "<container-name>",
"startupProbe": {
"tcpSocket": {
"port": 8080
},
"periodSeconds": 5,
"failureThreshold": 60
}
}]
}
}
}
}'
TCP probes succeed as soon as the port is listening, which usually happens earlier in the startup process than when an HTTP health endpoint is fully functional.
7. Check Resource Availability During Startup
Startup often requires more resources than steady state (JVM class loading, cache warming, data loading).
# Check current resource limits
kubectl get pod <pod-name> -o jsonpath='{.spec.containers[0].resources}' | jq .
# Check if the container is being CPU-throttled
kubectl top pod <pod-name> --containers
If the container is at its CPU limit during startup, initialization takes longer. Consider temporarily higher resource limits or using burstable QoS by setting requests lower than limits.
kubectl set resources deployment <deployment-name> \
--requests=cpu=250m,memory=512Mi \
--limits=cpu=2000m,memory=2Gi
8. Fix Application Startup Failures
If the logs show the application fails during startup (not just slow):
# Check for missing environment variables
kubectl exec <pod-name> -- env | sort
# Check for missing config files
kubectl exec <pod-name> -- ls -la /etc/config/
# Check for dependency connectivity
kubectl exec <pod-name> -- nc -zv <dependency-host> <port>
# Check for volume mount issues
kubectl exec <pod-name> -- ls -la /data/
Fix the underlying startup failure (missing config, unreachable dependency, etc.), then the startup probe will pass.
9. Consider Using an Exec Probe
For applications with complex startup requirements, an exec probe can run a custom script that checks multiple conditions.
startupProbe:
exec:
command:
- /bin/sh
- -c
- |
if [ -f /tmp/app-ready ]; then
exit 0
else
exit 1
fi
periodSeconds: 5
failureThreshold: 120
The application writes /tmp/app-ready when it finishes initialization. This allows precise control over when the startup probe succeeds.
10. Verify Startup Probe Passes
After adjusting the probe or fixing the application, verify the pod starts successfully.
# Watch the pod
kubectl get pod -l <selector> -w
# Verify the container is running and not restarting
kubectl get pod <pod-name> -o custom-columns=NAME:.metadata.name,READY:.status.containerStatuses[0].ready,RESTARTS:.status.containerStatuses[0].restartCount,STARTED:.status.containerStatuses[0].started
# Check that the startup probe succeeded (started will be true)
kubectl get pod <pod-name> -o jsonpath='{.status.containerStatuses[0].started}'
# Confirm liveness and readiness probes are now active
kubectl describe pod <pod-name> | grep -E "Liveness|Readiness|Startup"
The startup probe has succeeded when the container's started field is true, the restart count stabilizes, and the pod transitions to Ready. Once the startup probe passes, it never runs again for that container — the liveness and readiness probes take over from that point.
How to Explain This in an Interview
I would explain that startup probes were introduced in Kubernetes 1.16 (GA in 1.20) to solve a fundamental problem: how to handle slow-starting applications without making liveness probes too lenient. Before startup probes, operators had to set high initialDelaySeconds on liveness probes, which meant that an application crash after startup would not be detected for a long time. With startup probes, the startup probe can have a generous timeout for initialization, and once it succeeds, the liveness probe takes over with tighter timings. I'd discuss how to calculate the right failureThreshold and periodSeconds (total startup budget = failureThreshold x periodSeconds), and how to choose between HTTP, TCP, and exec probes for startup checking.
Prevention
- Calculate startup probe budget based on actual maximum startup time plus buffer
- Monitor application startup times and adjust probes when they change
- Use a lightweight TCP probe for startup instead of HTTP if the endpoint is not available early
- Ensure containers have sufficient resources for the initialization phase
- Log startup progress so failures can be diagnosed from logs