Kubernetes Node Not Ready
Causes and Fixes
A node in NotReady status means the kubelet on that node has stopped reporting healthy status to the API server. Pods on a NotReady node continue running but are not monitored, and new pods will not be scheduled there. After the pod-eviction-timeout (default 5 minutes), pods on the node are evicted.
Symptoms
- kubectl get nodes shows one or more nodes with STATUS NotReady
- Pods on the node stop receiving traffic (endpoints removed)
- New pods are not scheduled to the NotReady node
- After timeout, pods are evicted and rescheduled to healthy nodes
- kubectl describe node shows 'KubeletNotReady' or condition Ready=False
Common Causes
Step-by-Step Troubleshooting
1. Identify NotReady Nodes
# Check node status
kubectl get nodes
# Get detailed conditions for a NotReady node
kubectl describe node <node-name>
Look at the Conditions section:
Conditions:
Type Status Reason Message
---- ------ ------ -------
Ready False KubeletNotReady PLEG is not healthy
The Reason and Message tell you what is wrong.
2. Check if the Node is Reachable
# Ping the node (if you have network access)
ping <node-ip>
# SSH to the node
ssh <node-ip>
# If using a managed service, use the cloud provider's console
# AWS: aws ssm start-session --target <instance-id>
# GCP: gcloud compute ssh <instance-name>
If the node is unreachable, the issue is likely a network partition or the VM/machine is down.
3. Check the Kubelet Service
Once on the node:
# Check kubelet status
systemctl status kubelet
# If kubelet is not running, check the logs
journalctl -u kubelet --since "30 minutes ago" | tail -100
# Check for common kubelet errors
journalctl -u kubelet | grep -i "error\|failed\|unable" | tail -20
Common kubelet issues:
- Cannot reach API server: Certificate or network issue
- PLEG not healthy: Container runtime is unresponsive
- Disk pressure: Not enough disk space
If kubelet is stopped, restart it:
systemctl restart kubelet
systemctl status kubelet
4. Check the Container Runtime
The kubelet depends on the container runtime (containerd or CRI-O).
# Check containerd status
systemctl status containerd
# Check CRI-O status
systemctl status crio
# Check runtime responsiveness
crictl info
crictl ps
# If the runtime is down, restart it
systemctl restart containerd
5. Check Node Resources
Resource exhaustion causes NotReady conditions.
# On the node, check disk space
df -h
# Check memory
free -h
# Check PIDs
ls /proc | grep -c '^[0-9]'
cat /proc/sys/kernel/pid_max
# Check system load
uptime
top -bn1 | head -20
Quick fixes for resource issues:
# Free disk space
crictl rmi --prune # Remove unused images
docker system prune -af # If using Docker
# Check for large log files
du -sh /var/log/pods/* | sort -h | tail -10
6. Check Certificates
Expired certificates prevent the kubelet from communicating with the API server.
# Check kubelet certificate expiration
openssl x509 -in /var/lib/kubelet/pki/kubelet-client-current.pem -noout -dates
# Check API server CA
openssl x509 -in /etc/kubernetes/pki/ca.crt -noout -dates
# If using kubeadm, check all certificates
kubeadm certs check-expiration
If certificates are expired:
# Rotate kubelet certificates (if auto-rotation is enabled)
# The kubelet will automatically request new certificates
# For kubeadm clusters
kubeadm certs renew all
systemctl restart kubelet
7. Check Network Connectivity to API Server
# From the node, test API server connectivity
curl -k https://<api-server-ip>:6443/healthz
# Check kubelet config for API server endpoint
cat /var/lib/kubelet/config.yaml | grep -i server
# Check DNS resolution
nslookup kubernetes.default.svc.cluster.local
8. Check Kernel and Hardware
# Check for kernel panics or hardware errors
dmesg | tail -50
dmesg | grep -iE "error|panic|hardware|mce|memory"
# Check system journal for critical messages
journalctl -p err --since "1 hour ago"
If the node had a kernel panic, it may have rebooted. Check uptime and boot logs.
9. Drain and Replace the Node
If the node cannot be recovered, drain it and replace it.
# Cordon the node (prevent new scheduling)
kubectl cordon <node-name>
# Drain the node (evict all pods)
kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data --timeout=120s
# Delete the node from the cluster (after draining)
kubectl delete node <node-name>
# Replace with a new node (cloud-specific)
# The Cluster Autoscaler may do this automatically
10. Monitor Node Recovery
If you fixed the issue (restarted kubelet, freed resources, etc.):
# Watch node status
kubectl get nodes -w
# Check the node transitions to Ready
kubectl describe node <node-name> | grep -A5 "Conditions"
# Verify pods are rescheduled
kubectl get pods -A -o wide | grep <node-name>
The node should transition from NotReady to Ready within 30-60 seconds of the kubelet resuming healthy operation. Evicted pods will be rescheduled by their controllers.
How to Explain This in an Interview
I would explain the node heartbeat mechanism — the kubelet sends status updates to the API server every 10 seconds by default. If the node controller does not receive an update for 40 seconds (node-monitor-grace-period), it marks the node as NotReady. After another 5 minutes (pod-eviction-timeout), it starts evicting pods. I would describe my debugging approach: first check if the node is reachable (SSH), then check kubelet and container runtime status, then check resources and certificates. I would also mention the impact on workloads and the importance of PodDisruptionBudgets.
Prevention
- Monitor node status and set up alerts for NotReady conditions
- Use node auto-repair features in managed Kubernetes services
- Set up certificate auto-rotation for kubelet certificates
- Reserve resources for system daemons with --system-reserved
- Use node-problem-detector to catch issues before they cause NotReady
- Run redundant replicas across multiple nodes with anti-affinity