Kubernetes OOMKilled

Causes and Fixes

OOMKilled means a container was terminated because it exceeded its memory limit. The Linux kernel's Out-Of-Memory (OOM) killer sends SIGKILL (exit code 137) to the process consuming the most memory in the cgroup, and Kubernetes reports this as OOMKilled in the pod status.

Symptoms

  • Pod status shows OOMKilled in kubectl get pods output
  • Container exit code is 137
  • kubectl describe pod shows 'OOMKilled' as the termination reason
  • Container restarts frequently with CrashLoopBackOff after OOM events
  • Node dmesg shows 'oom-kill' entries for the container's cgroup

Common Causes

1
Memory limit set too low
The container's memory limit is lower than what the application actually needs. Profile the application under load to determine correct limits.
2
Memory leak in the application
The application gradually consumes more memory over time until it hits the limit. Use profiling tools to detect and fix the leak.
3
JVM heap not aligned with container limits
Java applications may set -Xmx higher than the container memory limit, or not account for off-heap memory. Use -XX:MaxRAMPercentage instead.
4
Sudden traffic spike
A burst of requests causes the application to buffer more data in memory than usual, pushing it past the limit.
5
Large in-memory caches or datasets
The application loads large datasets or maintains unbounded caches. Implement cache eviction policies and size limits.

Step-by-Step Troubleshooting

1. Confirm the OOMKill

First, verify that the container was actually OOM-killed and not terminated for another reason.

# Check pod status
kubectl get pod <pod-name> -o jsonpath='{.status.containerStatuses[0].lastState.terminated}'

# Look for reason: OOMKilled and exitCode: 137
kubectl describe pod <pod-name>

The output will show something like:

Last State:     Terminated
  Reason:       OOMKilled
  Exit Code:    137

2. Check Current Memory Limits

Compare the container's memory limit with what the application actually needs.

# View memory limits
kubectl get pod <pod-name> -o jsonpath='{.spec.containers[0].resources}'

# View actual memory usage (requires metrics-server)
kubectl top pod <pod-name>

# View memory usage over time (requires Prometheus)
# PromQL: container_memory_working_set_bytes{pod="<pod-name>"}

If kubectl top shows usage near the limit before the OOM event, the limit is likely too low.

3. Check Node-Level OOM Events

The Linux kernel logs OOM kill events in dmesg. Check the node where the pod was running.

# Find the node
kubectl get pod <pod-name> -o jsonpath='{.spec.nodeName}'

# Check dmesg on the node
kubectl debug node/<node-name> -it --image=ubuntu -- bash -c "dmesg | grep -i oom"

You will see entries like:

Memory cgroup out of memory: Killed process 12345 (java) total-vm:2048000kB, anon-rss:1024000kB

This confirms the OOM kill and shows exactly how much memory the process was using.

4. Profile Application Memory Usage

Understanding your application's actual memory needs is critical for setting correct limits.

For Java applications:

# Check JVM flags
kubectl exec <pod-name> -- jcmd 1 VM.flags

# Check heap usage
kubectl exec <pod-name> -- jcmd 1 GC.heap_info

# Create a heap dump before the OOM (if you can catch it in time)
kubectl exec <pod-name> -- jcmd 1 GC.heap_dump /tmp/heap.hprof
kubectl cp <pod-name>:/tmp/heap.hprof ./heap.hprof

For Node.js applications:

# Check heap statistics
kubectl exec <pod-name> -- node -e "console.log(process.memoryUsage())"

For Go applications:

# If pprof is enabled
kubectl port-forward <pod-name> 6060:6060
curl http://localhost:6060/debug/pprof/heap > heap.prof
go tool pprof heap.prof

5. Increase Memory Limits

If the application genuinely needs more memory, increase the limit:

# Quick fix via patch
kubectl patch deployment <deploy-name> --type=json \
  -p='[{"op": "replace", "path": "/spec/template/spec/containers/0/resources/limits/memory", "value": "1Gi"}]'

Or update the manifest:

resources:
  requests:
    memory: "512Mi"
  limits:
    memory: "1Gi"

Sizing guidelines:

  • Set requests to the application's typical working memory
  • Set limits to 1.5x-2x the request to handle spikes
  • For guaranteed QoS, set requests equal to limits

6. Fix JVM Memory Configuration

Java applications are especially prone to OOM-kills because the JVM uses memory beyond the heap (metaspace, thread stacks, native memory, etc.).

env:
  - name: JAVA_OPTS
    value: "-XX:MaxRAMPercentage=75.0 -XX:+UseContainerSupport"
resources:
  requests:
    memory: "1Gi"
  limits:
    memory: "1Gi"

The MaxRAMPercentage=75.0 leaves 25% of the container's memory for non-heap usage. The UseContainerSupport flag (on by default since JDK 10) makes the JVM aware of container limits.

7. Use Vertical Pod Autoscaler for Recommendations

VPA can analyze historical usage and recommend appropriate resource values.

# Install VPA if not present
# Check VPA recommendations
kubectl get vpa <vpa-name> -o jsonpath='{.status.recommendation.containerRecommendations}'

Example VPA resource:

apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: my-app-vpa
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: my-app
  updatePolicy:
    updateMode: "Off"  # Start with recommendations only

8. Set Up Memory Monitoring and Alerts

Prevent future OOM-kills by alerting before they happen.

# Prometheus alert rule
- alert: ContainerMemoryNearLimit
  expr: |
    container_memory_working_set_bytes / container_spec_memory_limit_bytes > 0.8
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "Container {{ $labels.container }} in pod {{ $labels.pod }} is using >80% of its memory limit"

9. Verify the Fix

After adjusting limits, monitor the pod to confirm it no longer gets OOM-killed.

# Watch pod status
kubectl get pods -w

# Monitor memory usage
kubectl top pod <pod-name> --containers

# Check no OOM events in the last hour
kubectl get events --field-selector reason=OOMKilling --sort-by='.lastTimestamp'

Quick Reference: OOMKilled vs Eviction

| Scenario | Trigger | Mechanism | |----------|---------|-----------| | OOMKilled | Container exceeds its memory limit | Kernel OOM killer, exit code 137 | | Eviction | Node runs out of allocatable memory | Kubelet evicts pods based on QoS class |

OOMKilled is a container-level event, while eviction is a node-level event. Both indicate memory pressure but require different fixes.

How to Explain This in an Interview

I would explain that OOMKilled is the kernel's OOM killer terminating a process that exceeded its cgroup memory limit. Kubernetes sets these cgroup limits based on the container's resources.limits.memory. The key distinction is between the container being OOM-killed (exit code 137) and the node running out of memory (which triggers pod eviction instead). I would discuss how to right-size limits using metrics from Prometheus or VPA recommendations, and how to detect memory leaks using profiling tools.

Prevention

  • Set memory requests equal to limits for guaranteed QoS class
  • Use Vertical Pod Autoscaler (VPA) to recommend appropriate limits
  • Monitor memory usage trends with Prometheus and set alerts at 80% of limit
  • For JVM apps use -XX:MaxRAMPercentage=75.0 instead of fixed -Xmx
  • Implement bounded caches with eviction policies in applications
  • Load test applications to understand memory requirements under peak traffic

Related Errors