Kubernetes Node Not Ready

Causes and Fixes

A node in NotReady status means the kubelet on that node has stopped reporting healthy status to the API server. Pods on a NotReady node continue running but are not monitored, and new pods will not be scheduled there. After the pod-eviction-timeout (default 5 minutes), pods on the node are evicted.

Symptoms

  • kubectl get nodes shows one or more nodes with STATUS NotReady
  • Pods on the node stop receiving traffic (endpoints removed)
  • New pods are not scheduled to the NotReady node
  • After timeout, pods are evicted and rescheduled to healthy nodes
  • kubectl describe node shows 'KubeletNotReady' or condition Ready=False

Common Causes

1
Kubelet process is down
The kubelet service has stopped, crashed, or is unable to start. Check systemd status and kubelet logs on the node.
2
Node is out of resources
The node has exhausted memory, disk, or PIDs, causing the kubelet to report NotReady. Check node conditions for pressure taints.
3
Network connectivity issue
The node cannot reach the API server due to network partition, firewall rules, or DNS failure.
4
Container runtime is down
containerd or CRI-O is not running, so the kubelet cannot manage containers and reports NotReady.
5
Certificate expired
The kubelet's client certificate has expired and it cannot authenticate with the API server.
6
Node hardware failure
The underlying VM or bare metal server has a hardware issue (disk failure, memory error, kernel panic).

Step-by-Step Troubleshooting

1. Identify NotReady Nodes

# Check node status
kubectl get nodes

# Get detailed conditions for a NotReady node
kubectl describe node <node-name>

Look at the Conditions section:

Conditions:
  Type    Status  Reason              Message
  ----    ------  ------              -------
  Ready   False   KubeletNotReady     PLEG is not healthy

The Reason and Message tell you what is wrong.

2. Check if the Node is Reachable

# Ping the node (if you have network access)
ping <node-ip>

# SSH to the node
ssh <node-ip>

# If using a managed service, use the cloud provider's console
# AWS: aws ssm start-session --target <instance-id>
# GCP: gcloud compute ssh <instance-name>

If the node is unreachable, the issue is likely a network partition or the VM/machine is down.

3. Check the Kubelet Service

Once on the node:

# Check kubelet status
systemctl status kubelet

# If kubelet is not running, check the logs
journalctl -u kubelet --since "30 minutes ago" | tail -100

# Check for common kubelet errors
journalctl -u kubelet | grep -i "error\|failed\|unable" | tail -20

Common kubelet issues:

  • Cannot reach API server: Certificate or network issue
  • PLEG not healthy: Container runtime is unresponsive
  • Disk pressure: Not enough disk space

If kubelet is stopped, restart it:

systemctl restart kubelet
systemctl status kubelet

4. Check the Container Runtime

The kubelet depends on the container runtime (containerd or CRI-O).

# Check containerd status
systemctl status containerd

# Check CRI-O status
systemctl status crio

# Check runtime responsiveness
crictl info
crictl ps

# If the runtime is down, restart it
systemctl restart containerd

5. Check Node Resources

Resource exhaustion causes NotReady conditions.

# On the node, check disk space
df -h

# Check memory
free -h

# Check PIDs
ls /proc | grep -c '^[0-9]'
cat /proc/sys/kernel/pid_max

# Check system load
uptime
top -bn1 | head -20

Quick fixes for resource issues:

# Free disk space
crictl rmi --prune  # Remove unused images
docker system prune -af  # If using Docker

# Check for large log files
du -sh /var/log/pods/* | sort -h | tail -10

6. Check Certificates

Expired certificates prevent the kubelet from communicating with the API server.

# Check kubelet certificate expiration
openssl x509 -in /var/lib/kubelet/pki/kubelet-client-current.pem -noout -dates

# Check API server CA
openssl x509 -in /etc/kubernetes/pki/ca.crt -noout -dates

# If using kubeadm, check all certificates
kubeadm certs check-expiration

If certificates are expired:

# Rotate kubelet certificates (if auto-rotation is enabled)
# The kubelet will automatically request new certificates

# For kubeadm clusters
kubeadm certs renew all
systemctl restart kubelet

7. Check Network Connectivity to API Server

# From the node, test API server connectivity
curl -k https://<api-server-ip>:6443/healthz

# Check kubelet config for API server endpoint
cat /var/lib/kubelet/config.yaml | grep -i server

# Check DNS resolution
nslookup kubernetes.default.svc.cluster.local

8. Check Kernel and Hardware

# Check for kernel panics or hardware errors
dmesg | tail -50
dmesg | grep -iE "error|panic|hardware|mce|memory"

# Check system journal for critical messages
journalctl -p err --since "1 hour ago"

If the node had a kernel panic, it may have rebooted. Check uptime and boot logs.

9. Drain and Replace the Node

If the node cannot be recovered, drain it and replace it.

# Cordon the node (prevent new scheduling)
kubectl cordon <node-name>

# Drain the node (evict all pods)
kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data --timeout=120s

# Delete the node from the cluster (after draining)
kubectl delete node <node-name>

# Replace with a new node (cloud-specific)
# The Cluster Autoscaler may do this automatically

10. Monitor Node Recovery

If you fixed the issue (restarted kubelet, freed resources, etc.):

# Watch node status
kubectl get nodes -w

# Check the node transitions to Ready
kubectl describe node <node-name> | grep -A5 "Conditions"

# Verify pods are rescheduled
kubectl get pods -A -o wide | grep <node-name>

The node should transition from NotReady to Ready within 30-60 seconds of the kubelet resuming healthy operation. Evicted pods will be rescheduled by their controllers.

How to Explain This in an Interview

I would explain the node heartbeat mechanism — the kubelet sends status updates to the API server every 10 seconds by default. If the node controller does not receive an update for 40 seconds (node-monitor-grace-period), it marks the node as NotReady. After another 5 minutes (pod-eviction-timeout), it starts evicting pods. I would describe my debugging approach: first check if the node is reachable (SSH), then check kubelet and container runtime status, then check resources and certificates. I would also mention the impact on workloads and the importance of PodDisruptionBudgets.

Prevention

  • Monitor node status and set up alerts for NotReady conditions
  • Use node auto-repair features in managed Kubernetes services
  • Set up certificate auto-rotation for kubelet certificates
  • Reserve resources for system daemons with --system-reserved
  • Use node-problem-detector to catch issues before they cause NotReady
  • Run redundant replicas across multiple nodes with anti-affinity

Related Errors