Kubernetes DiskPressure
Causes and Fixes
DiskPressure is a node condition that indicates the node is running low on available disk space. When active, the kubelet stops accepting new pods, garbage collects unused images and dead containers, and may evict pods to reclaim disk space. This condition affects both the root filesystem and the container image filesystem.
Symptoms
- kubectl describe node shows DiskPressure condition as True
- Node has the taint node.kubernetes.io/disk-pressure:NoSchedule
- New pods cannot be scheduled to the affected node
- Pod events show eviction due to disk pressure
- Container image pulls may fail on the affected node
- kubectl get events shows 'NodeHasDiskPressure' warnings
Common Causes
Step-by-Step Troubleshooting
1. Identify Nodes with DiskPressure
# Check all node conditions
kubectl get nodes -o custom-columns='NAME:.metadata.name,DISK_PRESSURE:.status.conditions[?(@.type=="DiskPressure")].status'
# Get details
kubectl describe node <node-name> | grep -A5 DiskPressure
2. Check Disk Usage on the Node
# Overall disk usage
kubectl debug node/<node-name> -it --image=ubuntu -- bash -c "df -h"
# Check specific directories
kubectl debug node/<node-name> -it --image=ubuntu -- bash -c "du -sh /var/lib/containerd /var/log/pods /var/lib/kubelet 2>/dev/null"
Key directories:
/var/lib/containerdor/var/lib/docker— Container images and writable layers/var/log/pods— Container log files/var/lib/kubelet— Kubelet data, emptyDir volumes
3. Check Container Log Sizes
Container logs are often the biggest disk consumer.
# Find the largest log files
kubectl debug node/<node-name> -it --image=ubuntu -- bash -c "find /var/log/pods -name '*.log' -exec ls -lhS {} + | head -20"
# Total log size
kubectl debug node/<node-name> -it --image=ubuntu -- bash -c "du -sh /var/log/pods"
4. Check Container Image Disk Usage
# List images and their sizes
kubectl debug node/<node-name> -it --image=ubuntu -- bash -c "crictl images --no-trunc"
# Check total image storage
kubectl debug node/<node-name> -it --image=ubuntu -- bash -c "du -sh /var/lib/containerd/io.containerd.snapshotter.v1.overlayfs"
5. Clean Up Unused Images
Trigger garbage collection of unused images.
# Remove unused images
kubectl debug node/<node-name> -it --image=ubuntu -- bash -c "crictl rmi --prune"
# List and remove specific large unused images
kubectl debug node/<node-name> -it --image=ubuntu -- bash -c "crictl images | sort -k3 -h -r | head -20"
6. Clean Up Dead Containers
# Remove stopped containers
kubectl debug node/<node-name> -it --image=ubuntu -- bash -c "crictl rm \$(crictl ps -a -q --state exited)"
7. Delete Evicted Pods
Evicted pods leave behind residual data. Clean them up.
# Delete all evicted pods
kubectl get pods -A --field-selector=status.phase=Failed -o json | \
jq -r '.items[] | select(.status.reason=="Evicted") | "\(.metadata.namespace) \(.metadata.name)"' | \
while read ns name; do kubectl delete pod "$name" -n "$ns"; done
8. Set Log Rotation
Configure log rotation to prevent logs from consuming all disk space.
Kubelet log rotation settings (kubelet config):
containerLogMaxSize: "50Mi"
containerLogMaxFiles: 3
This limits each container log to 50Mi with 3 rotated files, capping total log usage per container at 150Mi.
9. Set emptyDir Size Limits
Prevent pods from using unlimited disk via emptyDir volumes.
volumes:
- name: cache
emptyDir:
sizeLimit: "500Mi"
When the emptyDir exceeds the size limit, the pod is evicted. This protects the node from individual pods consuming all disk space.
10. Set Ephemeral Storage Requests and Limits
Kubernetes can track and limit total ephemeral storage usage per pod (logs + emptyDir + writable layer).
resources:
requests:
ephemeral-storage: "1Gi"
limits:
ephemeral-storage: "2Gi"
When the pod exceeds the ephemeral storage limit, it is evicted.
11. Configure Image Garbage Collection
Adjust the kubelet's image garbage collection thresholds.
# kubelet configuration
imageGCHighThresholdPercent: 85 # Start GC when disk is 85% full
imageGCLowThresholdPercent: 80 # Stop GC when disk drops to 80%
More aggressive settings for small disks:
imageGCHighThresholdPercent: 70
imageGCLowThresholdPercent: 60
12. Resize the Node Disk
If the node disk is too small for the workload, resize it.
# AWS: Resize the EBS volume
aws ec2 modify-volume --volume-id <vol-id> --size 200
# Then extend the filesystem on the node
kubectl debug node/<node-name> -it --image=ubuntu -- bash -c "growpart /dev/xvda 1 && resize2fs /dev/xvda1"
For managed Kubernetes services, update the node group configuration to use larger disks and replace nodes.
13. Verify Resolution
# Check DiskPressure is cleared
kubectl describe node <node-name> | grep DiskPressure
# Should show: DiskPressure False
# Check taint is removed
kubectl describe node <node-name> | grep disk-pressure
# Should be empty
# Verify disk space
kubectl debug node/<node-name> -it --image=ubuntu -- bash -c "df -h"
# Verify new pods can be scheduled
kubectl run test --image=busybox --restart=Never --command -- sleep 10
kubectl get pod test -o wide
kubectl delete pod test
The DiskPressure condition should clear automatically once available disk space rises above the eviction threshold.
How to Explain This in an Interview
I would explain that DiskPressure is monitored by the kubelet against two filesystems: nodefs (the node's root filesystem where kubelet stores logs and local data) and imagefs (where the container runtime stores images and writable layers). Default thresholds are nodefs.available < 10% and imagefs.available < 15%. I would describe the garbage collection mechanism (images are collected when disk usage exceeds the high threshold, starting with least recently used images), and how to prevent disk pressure by limiting log sizes, setting emptyDir sizeLimit, and using appropriate node disk sizes.
Prevention
- Set log rotation policies and maximum log file sizes
- Set sizeLimit on emptyDir volumes
- Configure image garbage collection thresholds appropriately
- Use ephemeral storage requests and limits on pods
- Monitor node disk usage and alert at 70% utilization
- Size node disks appropriately for the workload