Kubernetes Exit Code 137
Causes and Fixes
Exit code 137 means the container process was killed by SIGKILL (signal 9). The formula is 128 + 9 = 137. This is most commonly caused by the Linux OOM killer terminating the process for exceeding its memory limit, but it can also result from kubectl delete, preemption, or a failed liveness probe.
Symptoms
- Pod shows OOMKilled or Error with exit code 137
- kubectl describe pod shows exit code 137 in terminated state
- Container may show 'OOMKilled' as the termination reason
- Container restarts with CrashLoopBackOff after SIGKILL
- Node dmesg may show OOM killer messages
Common Causes
Step-by-Step Troubleshooting
1. Confirm the Exit Code and Reason
kubectl describe pod <pod-name>
Check the termination reason:
Last State: Terminated
Reason: OOMKilled
Exit Code: 137
If the reason is OOMKilled, the container exceeded its memory limit. If the reason is Error, another source sent the SIGKILL.
2. Distinguish Between OOM and Other SIGKILL Sources
| Reason | Cause | How to Verify |
|--------|-------|--------------|
| OOMKilled | Memory limit exceeded | kubectl describe pod shows Reason: OOMKilled |
| Liveness probe | Probe failed | Events show "Killing" after "Unhealthy" |
| Manual delete | kubectl delete pod | Check audit logs or user activity |
| Preemption | Higher-priority pod | Events show "Preempted by" |
| Node shutdown | Node going down | Node shows NotReady around the same time |
# Check events for clues
kubectl get events --field-selector involvedObject.name=<pod-name> --sort-by='.lastTimestamp'
3. Troubleshoot OOMKilled (Most Common)
Check memory usage and limits.
# Check memory limits
kubectl get pod <pod-name> -o jsonpath='{.spec.containers[0].resources.limits.memory}'
# Check current memory usage (if pod is running)
kubectl top pod <pod-name> --containers
# Check node-level OOM events
NODE=$(kubectl get pod <pod-name> -o jsonpath='{.spec.nodeName}')
kubectl debug node/$NODE -it --image=ubuntu -- bash -c "dmesg | grep -i oom | tail -20"
Fix by increasing memory limits:
resources:
requests:
memory: "512Mi"
limits:
memory: "1Gi"
4. Troubleshoot Liveness Probe Kills
If the pod events show liveness probe failures before the kill:
Warning Unhealthy Liveness probe failed: connection refused
Warning Unhealthy Liveness probe failed: connection refused
Warning Unhealthy Liveness probe failed: connection refused
Normal Killing Container app failed liveness probe, will be restarted
# Check liveness probe configuration
kubectl get pod <pod-name> -o jsonpath='{.spec.containers[0].livenessProbe}' | jq .
Common fixes:
# Add a startup probe for slow-starting apps
startupProbe:
httpGet:
path: /healthz
port: 8080
failureThreshold: 30
periodSeconds: 10
# Relax the liveness probe
livenessProbe:
httpGet:
path: /healthz
port: 8080
initialDelaySeconds: 30
periodSeconds: 10
timeoutSeconds: 5
failureThreshold: 5
5. Profile Memory Usage
For applications that are OOM-killed, understand where memory is being consumed.
Java:
# Check JVM heap settings
kubectl exec <pod-name> -- jcmd 1 VM.flags | grep -i heap
# Get heap usage
kubectl exec <pod-name> -- jcmd 1 GC.heap_info
# Configure JVM for containers
env:
- name: JAVA_OPTS
value: "-XX:MaxRAMPercentage=75.0 -XX:+UseContainerSupport -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp"
Go:
# If pprof is enabled
kubectl port-forward <pod-name> 6060:6060
go tool pprof http://localhost:6060/debug/pprof/heap
Node.js:
env:
- name: NODE_OPTIONS
value: "--max-old-space-size=768" # 75% of 1Gi limit
6. Check for Memory Leaks
If memory usage grows over time until OOM:
# Monitor memory over time
watch kubectl top pod <pod-name> --containers
# Set up a Prometheus query to track the trend
# PromQL: container_memory_working_set_bytes{pod="<pod-name>"}
If memory grows linearly, the application has a memory leak. Use language-specific profiling tools to identify it.
7. Handle SIGTERM to Avoid SIGKILL
When pods are deleted, Kubernetes sends SIGTERM first, then SIGKILL after the grace period (default 30 seconds). If your app does not handle SIGTERM, it gets SIGKILL.
# Python example
import signal
import sys
def handle_sigterm(signum, frame):
print("Received SIGTERM, shutting down gracefully...")
# Clean up resources
sys.exit(0)
signal.signal(signal.SIGTERM, handle_sigterm)
Increase the grace period if your app needs more time:
spec:
terminationGracePeriodSeconds: 60
8. Verify the Fix
# Watch the pod
kubectl get pods -w
# Monitor memory usage
kubectl top pod <pod-name> --containers
# Check no more OOM events
kubectl get events --field-selector reason=OOMKilling
# Verify restart count is stable
kubectl get pod <pod-name> -o jsonpath='{.status.containerStatuses[0].restartCount}'
The pod should run stably within its memory limits.
How to Explain This in an Interview
I would explain that 137 = 128 + 9, meaning SIGKILL, and the most common cause is OOM killing. I would distinguish between container-level OOM (the cgroup limit is exceeded, reported as OOMKilled) and node-level memory pressure (which causes pod eviction). For OOM, I would discuss right-sizing memory limits using metrics, detecting memory leaks with profiling tools, and using VPA for automated recommendations. For liveness probe kills, I would explain the difference between liveness and startup probes.
Prevention
- Set appropriate memory limits based on profiled application usage
- Use VPA to get memory limit recommendations
- Handle SIGTERM gracefully so containers stop before SIGKILL
- Use startup probes instead of relying on liveness probe initialDelaySeconds
- Monitor container memory usage and alert at 80% of limit
- For JVM apps, set -XX:MaxRAMPercentage=75.0