Kubernetes OOMKilled
Causes and Fixes
OOMKilled means a container was terminated because it exceeded its memory limit. The Linux kernel's Out-Of-Memory (OOM) killer sends SIGKILL (exit code 137) to the process consuming the most memory in the cgroup, and Kubernetes reports this as OOMKilled in the pod status.
Symptoms
- Pod status shows OOMKilled in kubectl get pods output
- Container exit code is 137
- kubectl describe pod shows 'OOMKilled' as the termination reason
- Container restarts frequently with CrashLoopBackOff after OOM events
- Node dmesg shows 'oom-kill' entries for the container's cgroup
Common Causes
Step-by-Step Troubleshooting
1. Confirm the OOMKill
First, verify that the container was actually OOM-killed and not terminated for another reason.
# Check pod status
kubectl get pod <pod-name> -o jsonpath='{.status.containerStatuses[0].lastState.terminated}'
# Look for reason: OOMKilled and exitCode: 137
kubectl describe pod <pod-name>
The output will show something like:
Last State: Terminated
Reason: OOMKilled
Exit Code: 137
2. Check Current Memory Limits
Compare the container's memory limit with what the application actually needs.
# View memory limits
kubectl get pod <pod-name> -o jsonpath='{.spec.containers[0].resources}'
# View actual memory usage (requires metrics-server)
kubectl top pod <pod-name>
# View memory usage over time (requires Prometheus)
# PromQL: container_memory_working_set_bytes{pod="<pod-name>"}
If kubectl top shows usage near the limit before the OOM event, the limit is likely too low.
3. Check Node-Level OOM Events
The Linux kernel logs OOM kill events in dmesg. Check the node where the pod was running.
# Find the node
kubectl get pod <pod-name> -o jsonpath='{.spec.nodeName}'
# Check dmesg on the node
kubectl debug node/<node-name> -it --image=ubuntu -- bash -c "dmesg | grep -i oom"
You will see entries like:
Memory cgroup out of memory: Killed process 12345 (java) total-vm:2048000kB, anon-rss:1024000kB
This confirms the OOM kill and shows exactly how much memory the process was using.
4. Profile Application Memory Usage
Understanding your application's actual memory needs is critical for setting correct limits.
For Java applications:
# Check JVM flags
kubectl exec <pod-name> -- jcmd 1 VM.flags
# Check heap usage
kubectl exec <pod-name> -- jcmd 1 GC.heap_info
# Create a heap dump before the OOM (if you can catch it in time)
kubectl exec <pod-name> -- jcmd 1 GC.heap_dump /tmp/heap.hprof
kubectl cp <pod-name>:/tmp/heap.hprof ./heap.hprof
For Node.js applications:
# Check heap statistics
kubectl exec <pod-name> -- node -e "console.log(process.memoryUsage())"
For Go applications:
# If pprof is enabled
kubectl port-forward <pod-name> 6060:6060
curl http://localhost:6060/debug/pprof/heap > heap.prof
go tool pprof heap.prof
5. Increase Memory Limits
If the application genuinely needs more memory, increase the limit:
# Quick fix via patch
kubectl patch deployment <deploy-name> --type=json \
-p='[{"op": "replace", "path": "/spec/template/spec/containers/0/resources/limits/memory", "value": "1Gi"}]'
Or update the manifest:
resources:
requests:
memory: "512Mi"
limits:
memory: "1Gi"
Sizing guidelines:
- Set requests to the application's typical working memory
- Set limits to 1.5x-2x the request to handle spikes
- For guaranteed QoS, set requests equal to limits
6. Fix JVM Memory Configuration
Java applications are especially prone to OOM-kills because the JVM uses memory beyond the heap (metaspace, thread stacks, native memory, etc.).
env:
- name: JAVA_OPTS
value: "-XX:MaxRAMPercentage=75.0 -XX:+UseContainerSupport"
resources:
requests:
memory: "1Gi"
limits:
memory: "1Gi"
The MaxRAMPercentage=75.0 leaves 25% of the container's memory for non-heap usage. The UseContainerSupport flag (on by default since JDK 10) makes the JVM aware of container limits.
7. Use Vertical Pod Autoscaler for Recommendations
VPA can analyze historical usage and recommend appropriate resource values.
# Install VPA if not present
# Check VPA recommendations
kubectl get vpa <vpa-name> -o jsonpath='{.status.recommendation.containerRecommendations}'
Example VPA resource:
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
name: my-app-vpa
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: my-app
updatePolicy:
updateMode: "Off" # Start with recommendations only
8. Set Up Memory Monitoring and Alerts
Prevent future OOM-kills by alerting before they happen.
# Prometheus alert rule
- alert: ContainerMemoryNearLimit
expr: |
container_memory_working_set_bytes / container_spec_memory_limit_bytes > 0.8
for: 5m
labels:
severity: warning
annotations:
summary: "Container {{ $labels.container }} in pod {{ $labels.pod }} is using >80% of its memory limit"
9. Verify the Fix
After adjusting limits, monitor the pod to confirm it no longer gets OOM-killed.
# Watch pod status
kubectl get pods -w
# Monitor memory usage
kubectl top pod <pod-name> --containers
# Check no OOM events in the last hour
kubectl get events --field-selector reason=OOMKilling --sort-by='.lastTimestamp'
Quick Reference: OOMKilled vs Eviction
| Scenario | Trigger | Mechanism | |----------|---------|-----------| | OOMKilled | Container exceeds its memory limit | Kernel OOM killer, exit code 137 | | Eviction | Node runs out of allocatable memory | Kubelet evicts pods based on QoS class |
OOMKilled is a container-level event, while eviction is a node-level event. Both indicate memory pressure but require different fixes.
How to Explain This in an Interview
I would explain that OOMKilled is the kernel's OOM killer terminating a process that exceeded its cgroup memory limit. Kubernetes sets these cgroup limits based on the container's resources.limits.memory. The key distinction is between the container being OOM-killed (exit code 137) and the node running out of memory (which triggers pod eviction instead). I would discuss how to right-size limits using metrics from Prometheus or VPA recommendations, and how to detect memory leaks using profiling tools.
Prevention
- Set memory requests equal to limits for guaranteed QoS class
- Use Vertical Pod Autoscaler (VPA) to recommend appropriate limits
- Monitor memory usage trends with Prometheus and set alerts at 80% of limit
- For JVM apps use -XX:MaxRAMPercentage=75.0 instead of fixed -Xmx
- Implement bounded caches with eviction policies in applications
- Load test applications to understand memory requirements under peak traffic