What causes Node Not Ready in Kubernetes?

Kubernetes Node Not Ready

Causes and Fixes

A node in NotReady status means the kubelet on that node has stopped reporting healthy status to the API server. Pods on a NotReady node continue running but are not monitored, and new pods will not be scheduled there. After the pod-eviction-timeout (default 5 minutes), pods on the node are evicted.

Symptoms

kubectl get nodes shows one or more nodes with STATUS NotReady
Pods on the node stop receiving traffic (endpoints removed)
New pods are not scheduled to the NotReady node
After timeout, pods are evicted and rescheduled to healthy nodes
kubectl describe node shows 'KubeletNotReady' or condition Ready=False

Common Causes

Kubelet process is down

The kubelet service has stopped, crashed, or is unable to start. Check systemd status and kubelet logs on the node.

Node is out of resources

The node has exhausted memory, disk, or PIDs, causing the kubelet to report NotReady. Check node conditions for pressure taints.

Network connectivity issue

The node cannot reach the API server due to network partition, firewall rules, or DNS failure.

Container runtime is down

containerd or CRI-O is not running, so the kubelet cannot manage containers and reports NotReady.

Certificate expired

The kubelet's client certificate has expired and it cannot authenticate with the API server.

Node hardware failure

The underlying VM or bare metal server has a hardware issue (disk failure, memory error, kernel panic).

Step-by-Step Troubleshooting

1. Identify NotReady Nodes

# Check node status
kubectl get nodes

# Get detailed conditions for a NotReady node
kubectl describe node <node-name>

Look at the Conditions section:

Conditions:
  Type    Status  Reason              Message
  ----    ------  ------              -------
  Ready   False   KubeletNotReady     PLEG is not healthy

The Reason and Message tell you what is wrong.

2. Check if the Node is Reachable

# Ping the node (if you have network access)
ping <node-ip>

# SSH to the node
ssh <node-ip>

# If using a managed service, use the cloud provider's console
# AWS: aws ssm start-session --target <instance-id>
# GCP: gcloud compute ssh <instance-name>

If the node is unreachable, the issue is likely a network partition or the VM/machine is down.

3. Check the Kubelet Service

Once on the node:

# Check kubelet status
systemctl status kubelet

# If kubelet is not running, check the logs
journalctl -u kubelet --since "30 minutes ago" | tail -100

# Check for common kubelet errors
journalctl -u kubelet | grep -i "error\|failed\|unable" | tail -20

Common kubelet issues:

Cannot reach API server: Certificate or network issue
PLEG not healthy: Container runtime is unresponsive
Disk pressure: Not enough disk space

If kubelet is stopped, restart it:

systemctl restart kubelet
systemctl status kubelet

4. Check the Container Runtime

The kubelet depends on the container runtime (containerd or CRI-O).

# Check containerd status
systemctl status containerd

# Check CRI-O status
systemctl status crio

# Check runtime responsiveness
crictl info
crictl ps

# If the runtime is down, restart it
systemctl restart containerd

5. Check Node Resources

Resource exhaustion causes NotReady conditions.

# On the node, check disk space
df -h

# Check memory
free -h

# Check PIDs
ls /proc | grep -c '^[0-9]'
cat /proc/sys/kernel/pid_max

# Check system load
uptime
top -bn1 | head -20

Quick fixes for resource issues:

# Free disk space
crictl rmi --prune  # Remove unused images
docker system prune -af  # If using Docker

# Check for large log files
du -sh /var/log/pods/* | sort -h | tail -10

6. Check Certificates

Expired certificates prevent the kubelet from communicating with the API server.

# Check kubelet certificate expiration
openssl x509 -in /var/lib/kubelet/pki/kubelet-client-current.pem -noout -dates

# Check API server CA
openssl x509 -in /etc/kubernetes/pki/ca.crt -noout -dates

# If using kubeadm, check all certificates
kubeadm certs check-expiration

If certificates are expired:

# Rotate kubelet certificates (if auto-rotation is enabled)
# The kubelet will automatically request new certificates

# For kubeadm clusters
kubeadm certs renew all
systemctl restart kubelet

7. Check Network Connectivity to API Server

# From the node, test API server connectivity
curl -k https://<api-server-ip>:6443/healthz

# Check kubelet config for API server endpoint
cat /var/lib/kubelet/config.yaml | grep -i server

# Check DNS resolution
nslookup kubernetes.default.svc.cluster.local

8. Check Kernel and Hardware

# Check for kernel panics or hardware errors
dmesg | tail -50
dmesg | grep -iE "error|panic|hardware|mce|memory"

# Check system journal for critical messages
journalctl -p err --since "1 hour ago"

If the node had a kernel panic, it may have rebooted. Check uptime and boot logs.

9. Drain and Replace the Node

If the node cannot be recovered, drain it and replace it.

# Cordon the node (prevent new scheduling)
kubectl cordon <node-name>

# Drain the node (evict all pods)
kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data --timeout=120s

# Delete the node from the cluster (after draining)
kubectl delete node <node-name>

# Replace with a new node (cloud-specific)
# The Cluster Autoscaler may do this automatically

10. Monitor Node Recovery

If you fixed the issue (restarted kubelet, freed resources, etc.):

# Watch node status
kubectl get nodes -w

# Check the node transitions to Ready
kubectl describe node <node-name> | grep -A5 "Conditions"

# Verify pods are rescheduled
kubectl get pods -A -o wide | grep <node-name>

The node should transition from NotReady to Ready within 30-60 seconds of the kubelet resuming healthy operation. Evicted pods will be rescheduled by their controllers.

How to Explain This in an Interview

I would explain the node heartbeat mechanism — the kubelet sends status updates to the API server every 10 seconds by default. If the node controller does not receive an update for 40 seconds (node-monitor-grace-period), it marks the node as NotReady. After another 5 minutes (pod-eviction-timeout), it starts evicting pods. I would describe my debugging approach: first check if the node is reachable (SSH), then check kubelet and container runtime status, then check resources and certificates. I would also mention the impact on workloads and the importance of PodDisruptionBudgets.

Prevention

Monitor node status and set up alerts for NotReady conditions
Use node auto-repair features in managed Kubernetes services
Set up certificate auto-rotation for kubelet certificates
Reserve resources for system daemons with --system-reserved
Use node-problem-detector to catch issues before they cause NotReady
Run redundant replicas across multiple nodes with anti-affinity