Kubernetes Errors & Troubleshooting

51 common errors with causes, fixes, and how to discuss them in interviews.

The 'Back-off restarting failed container' event indicates that a container has failed and the kubelet is waiting before restarting it, using an exponential backoff delay. This is the mechanism behind the CrashLoopBackOff status and means the container keeps crashing after each restart attempt.

Connection Refused

Connection Refused errors in Kubernetes occur when a TCP connection attempt to a service, pod, or API endpoint is actively rejected. Unlike timeouts where the connection hangs, a refused connection means the target is reachable but nothing is listening on the specified port, or the connection is being explicitly rejected.

ContainerCannotRun

ContainerCannotRun indicates the container runtime attempted to start the container but the process could not execute. This is typically caused by an invalid entrypoint binary, wrong executable format, or permission issues that prevent the container's main process from launching.

CrashLoopBackOff

CrashLoopBackOff means a container in the Pod is repeatedly crashing and Kubernetes is restarting it with exponential backoff delays. It is the most common Pod error and indicates the container process is exiting with a non-zero code.

CreateContainerConfigError

CreateContainerConfigError occurs when the kubelet cannot configure a container before starting it. This typically means a referenced ConfigMap, Secret, or ServiceAccount does not exist, or a volume mount is misconfigured. The container never starts because its configuration is invalid.

CreateContainerError

CreateContainerError indicates the container runtime failed to create the container. Unlike CreateContainerConfigError which is a configuration issue, this error means the runtime itself encountered a problem during container creation, such as a failed volume mount, device plugin issue, or runtime bug.

DeadlineExceeded

DeadlineExceeded means a Kubernetes resource exceeded its configured time limit. This most commonly applies to Jobs that exceed their activeDeadlineSeconds, Deployments that exceed their progressDeadlineSeconds during a rollout, or pods that exceed their activeDeadlineSeconds.

DiskPressure

DiskPressure is a node condition that indicates the node is running low on available disk space. When active, the kubelet stops accepting new pods, garbage collects unused images and dead containers, and may evict pods to reclaim disk space. This condition affects both the root filesystem and the container image filesystem.

DNS Resolution Failure

DNS resolution failures in Kubernetes occur when pods cannot resolve service names, external hostnames, or cluster DNS entries. This typically manifests as connection errors mentioning 'name resolution failed' or 'no such host' and is most commonly caused by issues with CoreDNS or the pod's DNS configuration.

Endpoints Not Found

Endpoints Not Found means a Kubernetes Service has no backing endpoints, so traffic sent to the Service has nowhere to go. The Endpoints object associated with the Service is either missing or empty, which causes connection failures for any client trying to reach the Service.

ErrImagePull

ErrImagePull indicates the kubelet's first attempt to pull a container image has failed. If the pull continues to fail, Kubernetes transitions the pod to ImagePullBackOff. This error surfaces immediately and points to issues with the image reference, registry credentials, or network access.

Exit Code 0

Exit code 0 means the container's main process terminated successfully. While this is not an error, it can cause unexpected behavior when a Deployment or ReplicaSet pod exits with code 0 and the restartPolicy is set to Always, resulting in a CrashLoopBackOff as Kubernetes continuously restarts a container that keeps 'completing' successfully.

Exit Code 1

Exit code 1 is the most common non-zero exit code and indicates a general application error. The container's main process encountered an unhandled exception, failed assertion, or explicit error exit. This is a catch-all code that applications use when something goes wrong during execution.

Exit Code 126

Exit code 126 means 'command cannot execute' — the file specified as the entrypoint or command exists but cannot be executed. This typically happens when a binary file lacks execute permissions, is in an incompatible format, or a shell script references an interpreter that cannot run it.

Exit Code 127

Exit code 127 means 'command not found' — the shell could not locate the binary or script specified as the container's entrypoint or command. This typically indicates a typo in the command, a missing binary in the container image, or an incorrect PATH configuration.

Exit Code 128

Exit code 128 indicates an invalid exit code was returned by the container process, or the container runtime encountered a fatal error during execution. In practice, exit codes above 128 usually mean the process was killed by a signal (exit code = 128 + signal number), but exit code 128 itself suggests the runtime could not determine the actual exit status.

Exit Code 137

Exit code 137 means the container process was killed by SIGKILL (signal 9). The formula is 128 + 9 = 137. This is most commonly caused by the Linux OOM killer terminating the process for exceeding its memory limit, but it can also result from kubectl delete, preemption, or a failed liveness probe.

Exit Code 139

Exit code 139 means the container process received SIGSEGV (signal 11, segmentation fault). The formula is 128 + 11 = 139. A segmentation fault occurs when the process tries to access memory it is not allowed to, typically due to a bug in the application code, a corrupt binary, or incompatible native libraries.

Exit Code 143

Exit code 143 means the container process was terminated by SIGTERM (signal 15). The formula is 128 + 15 = 143. SIGTERM is the graceful termination signal sent by Kubernetes when a pod is deleted, scaled down, or rolling-updated. This exit code is normal during planned shutdowns but can indicate a problem if it occurs unexpectedly.

FailedAttachVolume

FailedAttachVolume is a warning event indicating that a volume could not be attached to the node where a pod is scheduled. This is common with cloud block storage (EBS, PD, Azure Disk) and occurs when the volume attachment operation fails at the infrastructure level, preventing the pod from starting.

FailedMount

FailedMount is a warning event indicating that a volume could not be mounted into a pod's container after being attached to the node. This occurs at the kubelet level during the volume setup phase and prevents the container from starting. Common causes include filesystem errors, wrong mount options, and permission issues.

FailedScheduling

FailedScheduling is a pod event indicating the Kubernetes scheduler could not find a suitable node to place the pod. The pod remains in Pending state until the scheduling constraints can be satisfied. This is one of the most common reasons pods fail to start.

403 Forbidden

A 403 Forbidden error in Kubernetes means the API server authenticated the request but the user, service account, or group does not have the necessary RBAC permissions to perform the requested action. The request is rejected because no Role, ClusterRole, RoleBinding, or ClusterRoleBinding grants the required access.

ImageInspectError

ImageInspectError occurs when the container runtime fails to inspect a container image after it has been pulled. This means the image was downloaded but the runtime cannot read its metadata, typically due to a corrupt image, incompatible image format, or a container runtime issue on the node.

ImagePullBackOff

ImagePullBackOff means Kubernetes tried to pull a container image and failed, and is now waiting with exponential backoff before retrying. This typically indicates the image does not exist, the tag is wrong, or the cluster lacks credentials to access a private registry.

Ingress 502/503/504

502 Bad Gateway, 503 Service Unavailable, and 504 Gateway Timeout errors from a Kubernetes Ingress indicate that the ingress controller cannot successfully proxy traffic to the backend pods. These errors mean the controller received the request but failed to get a valid response from the upstream Service.

Ingress Not Routing

Ingress Not Routing occurs when an Ingress resource is created but traffic is not being directed to the expected backend Service. Requests either return 404, hit the wrong backend, or never reach the cluster. This is commonly caused by missing ingress controller, incorrect rules, or misconfigured annotations.

InvalidImageName

InvalidImageName means the container image reference in the pod spec is malformed and cannot be parsed by the container runtime. This is a syntax-level error — the image string does not conform to the expected format of [registry/]repository[:tag|@digest].

Liveness Probe Failed

A liveness probe failure means the kubelet has determined that a container is no longer alive and healthy. When a liveness probe fails consecutively for the configured failureThreshold number of times, the kubelet kills the container and restarts it according to the pod's restart policy. This is the most common cause of container restarts in production.

MemoryPressure

MemoryPressure is a node condition that indicates the node is running low on available memory. When active, the kubelet begins evicting pods to reclaim memory, starting with BestEffort pods, then Burstable pods that exceed their requests. The node is also tainted to prevent new pods from being scheduled.

MinimumReplicasUnavailable

MinimumReplicasUnavailable is a Deployment condition indicating that the number of available replicas has fallen below the minimum required threshold. The Deployment's Available condition is set to False with this reason, meaning the application may not be serving traffic reliably.

Multi-Attach Error

Multi-Attach error occurs when a PersistentVolume backed by block storage (such as AWS EBS, GCP PD, or Azure Disk) is needed by a pod on one node but is still attached to a different node. Since most block storage devices can only be attached to one node at a time, the new attachment fails with a Multi-Attach error.

NetworkUnavailable

NetworkUnavailable is a node condition indicating that the network for the node is not correctly configured. This typically means the CNI plugin has not yet set up networking on the node, or the network plugin has failed. Pods scheduled on the node cannot communicate with the cluster network.

Node Not Ready

A node in NotReady status means the kubelet on that node has stopped reporting healthy status to the API server. Pods on a NotReady node continue running but are not monitored, and new pods will not be scheduled there. After the pod-eviction-timeout (default 5 minutes), pods on the node are evicted.

Node Selector Mismatch

A node selector mismatch occurs when a pod specifies a nodeSelector with label requirements that no node in the cluster satisfies. The pod remains in Pending state because the scheduler cannot find any node with the required labels, even if nodes have available resources.

OOMKilled

OOMKilled means a container was terminated because it exceeded its memory limit. The Linux kernel's Out-Of-Memory (OOM) killer sends SIGKILL (exit code 137) to the process consuming the most memory in the cgroup, and Kubernetes reports this as OOMKilled in the pod status.

PIDPressure

PIDPressure is a node condition that indicates the node is running too many processes and is at risk of exhausting its process ID (PID) limit. When this condition is true, the kubelet starts evicting pods to reclaim PIDs.

Pod Evicted

A pod eviction occurs when the kubelet terminates pods to reclaim resources on a node that is under pressure. Evictions are triggered by node conditions like memory pressure, disk pressure, or PID pressure, and pods are selected for eviction based on their QoS class and resource usage.

Pod Pending

A pod in the Pending state has been accepted by the cluster but is not yet running. This usually means the scheduler cannot find a suitable node due to insufficient resources, unsatisfied constraints, or missing dependencies like PersistentVolumeClaims.

PostStartHookError

PostStartHookError occurs when a container's postStart lifecycle hook fails. The postStart hook runs immediately after the container is created (concurrently with the main process), and if it fails or times out, Kubernetes kills the container. This results in a restart loop if the hook keeps failing.

ProgressDeadlineExceeded

ProgressDeadlineExceeded occurs when a Deployment fails to make progress within the specified progressDeadlineSeconds (default 600 seconds). The Deployment controller marks the rollout as failed, and the Deployment condition Progressing is set to False with this reason.

PVC Pending

A PersistentVolumeClaim (PVC) stuck in Pending status means Kubernetes cannot find or provision a PersistentVolume (PV) that satisfies the claim's requirements. The pod referencing this PVC will remain in Pending state until the volume is bound.

Readiness Probe Failed

A readiness probe failure means the kubelet has determined that a container is not ready to accept traffic. Unlike liveness probe failures which trigger restarts, readiness failures cause the pod to be removed from Service endpoints so it stops receiving traffic. The pod continues running but is marked as not ready until the probe passes again.

ReplicaFailure

ReplicaFailure is a Deployment condition that indicates the Deployment's ReplicaSet was unable to create or delete pods as needed. This is set when the ReplicaSet controller encounters errors trying to manage the desired number of replicas, often due to quota limits, admission webhook rejections, or API errors.

RunContainerError

RunContainerError means the container runtime successfully created the container but failed when trying to start it. This typically indicates problems with the container's entrypoint, command, working directory, or user configuration that prevent the process from launching.

Service Not Reachable

Service Not Reachable errors occur when clients cannot connect to a Kubernetes Service, even though the backing pods may be running. This can manifest as connection timeouts, connection refused, or empty responses when trying to access a Service by its ClusterIP, DNS name, or NodePort.

ServiceAccount Token Issues

Service account token issues occur when pods cannot authenticate to the Kubernetes API server using their mounted service account token. This can manifest as authentication failures, missing tokens, expired tokens, or tokens that do not match any known service account. Since Kubernetes 1.22, projected service account tokens with automatic expiration and rotation are the default, which introduced new failure modes.

Startup Probe Failed

A startup probe failure means the kubelet determined that a container did not start successfully within the allowed time. The startup probe runs before liveness and readiness probes. When it fails (after failureThreshold consecutive failures), the kubelet kills the container, which typically results in CrashLoopBackOff. Startup probes were designed for slow-starting applications that need more time to initialize.

Taint Not Tolerated

A 'taint not tolerated' scheduling failure occurs when all nodes in the cluster have taints that the pod does not tolerate. Taints are applied to nodes to repel pods that lack corresponding tolerations. Without a matching toleration, the scheduler excludes the node, and if all nodes are tainted, the pod remains in Pending state.

401 Unauthorized

A 401 Unauthorized error in Kubernetes means the API server could not authenticate the request. Unlike 403 Forbidden where the identity is known but lacks permissions, 401 means the identity itself could not be verified — the credentials are missing, invalid, or expired.

VolumeResizeFailed

VolumeResizeFailed occurs when Kubernetes cannot expand a PersistentVolumeClaim to the requested larger size. This can happen at the storage backend level (the underlying volume cannot be resized), at the filesystem level (the filesystem on the volume cannot be expanded), or because volume expansion is not enabled for the StorageClass.