What causes ReplicaFailure in Kubernetes?

Kubernetes ReplicaFailure

Causes and Fixes

ReplicaFailure is a Deployment condition that indicates the Deployment's ReplicaSet was unable to create or delete pods as needed. This is set when the ReplicaSet controller encounters errors trying to manage the desired number of replicas, often due to quota limits, admission webhook rejections, or API errors.

Symptoms

Deployment condition ReplicaFailure is True in kubectl describe deployment
ReplicaSet events show FailedCreate errors
Desired replica count is never reached despite sufficient node resources
kubectl get events shows 'Error creating: ...' messages from the ReplicaSet controller
Pod count remains below the desired number indefinitely

Common Causes

ResourceQuota exceeded

The namespace's ResourceQuota has been exhausted (CPU, memory, or pod count), and the ReplicaSet controller cannot create new pods.

LimitRange rejection

A LimitRange in the namespace imposes constraints that the pod spec does not satisfy, causing pod creation to be rejected.

Admission webhook rejection

A ValidatingAdmissionWebhook or MutatingAdmissionWebhook is rejecting the pod creation requests due to policy violations.

SecurityContext or PodSecurityPolicy restrictions

The pod's security context does not meet the requirements of PodSecurity admission (or legacy PSP), and pod creation is denied.

Service account does not exist

The pod spec references a service account that does not exist in the namespace, causing the API server to reject the pod creation.

Invalid pod template

The Deployment's pod template contains an invalid configuration such as a nonexistent ConfigMap reference, invalid resource values, or malformed volume specification that the API server rejects at creation time.

Step-by-Step Troubleshooting

ReplicaFailure means the ReplicaSet controller cannot create pods for your Deployment. Unlike most pod errors where pods exist but are unhealthy, here the pods are never created at all. This guide focuses on identifying what is rejecting the pod creation requests.

1. Check Deployment Conditions

Start by examining the Deployment's conditions to confirm ReplicaFailure.

kubectl describe deployment <deployment-name>

Look for the ReplicaFailure condition in the Conditions section. The message field often contains the specific rejection reason.

# Get the condition details programmatically
kubectl get deployment <deployment-name> -o jsonpath='{range .status.conditions[*]}{.type}: {.status} - {.message}{"\n"}{end}'

2. Check ReplicaSet Events

The ReplicaSet managed by the Deployment will have detailed events showing why pod creation failed.

# Find the ReplicaSet
kubectl get replicaset -l app=<app-label> --sort-by=.metadata.creationTimestamp

# Describe the most recent ReplicaSet
kubectl describe replicaset <replicaset-name>

Look at the Events section for entries with reason FailedCreate. The message will contain the specific error from the API server, such as:

forbidden: exceeded quota
admission webhook denied the request
pods "name" is forbidden: violates PodSecurity

# Get all FailedCreate events in the namespace
kubectl get events --field-selector reason=FailedCreate --sort-by=.lastTimestamp

3. Check Resource Quotas

If the error mentions quota, check the namespace's ResourceQuota usage.

# List all quotas in the namespace
kubectl get resourcequota -n <namespace>

# Get detailed quota usage
kubectl describe resourcequota -n <namespace>

The output shows used versus hard limits for resources like pods, CPU, memory, and storage. If any resource is at its limit, new pods cannot be created.

# Check the pod's resource requests against remaining quota
kubectl get deployment <deployment-name> -o jsonpath='{.spec.template.spec.containers[0].resources}' | jq .

To resolve a quota issue, either increase the quota, reduce resource requests per pod, scale down other workloads, or request a quota increase from the cluster administrator.

# Example: increase the quota (requires appropriate RBAC)
kubectl patch resourcequota <quota-name> -n <namespace> -p '{"spec":{"hard":{"pods":"50","requests.cpu":"20","requests.memory":"40Gi"}}}'

4. Check LimitRange Constraints

A LimitRange can enforce minimum or maximum resource requirements that your pod spec may not meet.

kubectl get limitrange -n <namespace>
kubectl describe limitrange -n <namespace>

If the LimitRange requires minimum CPU of 100m but your pod does not specify a CPU request, the creation will fail. Either update the pod spec to comply with the LimitRange or adjust the LimitRange.

# Check what the pod template specifies
kubectl get deployment <deployment-name> -o jsonpath='{.spec.template.spec.containers[*].resources}' | jq .

5. Check Admission Webhooks

Admission webhooks can reject pod creation for policy reasons.

# List validating webhooks
kubectl get validatingwebhookconfigurations

# List mutating webhooks
kubectl get mutatingwebhookconfigurations

# Describe a specific webhook for its rules
kubectl describe validatingwebhookconfiguration <webhook-name>

If a webhook is rejecting pods, the FailedCreate event message will include the webhook name and its rejection reason. Common sources include OPA/Gatekeeper policies, Kyverno policies, and custom admission controllers.

# Check if a specific webhook is causing issues by looking at its failure policy
kubectl get validatingwebhookconfiguration <webhook-name> -o jsonpath='{.webhooks[*].failurePolicy}'

To debug further, check the logs of the webhook service.

# Find the webhook service
kubectl get validatingwebhookconfiguration <webhook-name> -o jsonpath='{.webhooks[0].clientConfig.service.name}'

# Check the webhook pod logs
kubectl logs -n <webhook-namespace> -l app=<webhook-label> --tail=100

6. Check Pod Security Admission

If the cluster uses Pod Security Standards (Kubernetes 1.25+), the namespace's security policy may reject pods that do not meet the required level.

# Check namespace labels for pod security enforcement
kubectl get namespace <namespace> -o jsonpath='{.metadata.labels}' | jq . | grep pod-security

Labels like pod-security.kubernetes.io/enforce: restricted mean pods must meet the restricted security standard. If your pod runs as root, uses privileged containers, or does not set required security context fields, creation will be rejected.

# Check pod security context in the deployment
kubectl get deployment <deployment-name> -o jsonpath='{.spec.template.spec}' | jq '{securityContext, containers: [.containers[] | {name, securityContext}]}'

7. Verify Referenced Resources Exist

The pod template may reference resources that do not exist.

# Check if the service account exists
kubectl get serviceaccount <sa-name> -n <namespace>

# Check if referenced ConfigMaps exist
kubectl get configmap <configmap-name> -n <namespace>

# Check if referenced Secrets exist
kubectl get secret <secret-name> -n <namespace>

# Check if referenced PVCs exist
kubectl get pvc <pvc-name> -n <namespace>

If any referenced resource is missing, create it or update the Deployment to reference an existing resource.

8. Test Pod Creation Manually

To isolate whether the issue is with the pod spec or something else, try creating a pod manually using the Deployment's template.

# Extract the pod template and try a dry-run creation
kubectl get deployment <deployment-name> -o jsonpath='{.spec.template}' > /tmp/pod-template.json

# Attempt to create a pod with server-side dry-run
kubectl run test-pod --image=<same-image> --dry-run=server -o yaml --overrides="$(kubectl get deployment <deployment-name> -o jsonpath='{.spec.template.spec}' | jq -c .)"

The dry-run with --dry-run=server will run through all admission controllers and return any rejection errors without actually creating the pod.

9. Fix the Issue and Verify

After identifying and resolving the root cause, trigger a new rollout or wait for the ReplicaSet controller to retry.

# If you updated the deployment spec
kubectl apply -f deployment.yaml

# Force a re-evaluation by adding an annotation
kubectl patch deployment <deployment-name> -p '{"spec":{"template":{"metadata":{"annotations":{"kubectl.kubernetes.io/restartedAt":"'$(date -u +%Y-%m-%dT%H:%M:%SZ)'"}}}}}'

# Watch the rollout
kubectl rollout status deployment/<deployment-name> --watch

10. Confirm Recovery

Verify that the ReplicaFailure condition has cleared and all replicas are running.

# Check deployment status
kubectl get deployment <deployment-name>

# Verify no ReplicaFailure condition
kubectl get deployment <deployment-name> -o jsonpath='{.status.conditions[?(@.type=="ReplicaFailure")]}'

# Confirm all pods are running
kubectl get pods -l app=<app-label>

# Check that no FailedCreate events are recent
kubectl get events --field-selector reason=FailedCreate --sort-by=.lastTimestamp

The Deployment is healthy when the ready replica count matches the desired count, the ReplicaFailure condition is gone or False, and no new FailedCreate events are appearing. If the issue was quota-related, consider setting up monitoring and alerts on quota usage to prevent recurrence.

How to Explain This in an Interview

I would explain that ReplicaFailure is distinct from pod-level errors because the pods never actually get created. The issue is at the API level — the ReplicaSet controller's request to create pods is being rejected before any scheduling or container runtime involvement. I'd discuss how to diagnose this by checking ReplicaSet events for FailedCreate reasons, and I'd walk through common admission-time rejections including quotas, LimitRanges, admission webhooks, and PodSecurity. I'd also explain how this differs from ProgressDeadlineExceeded, where pods are created but fail to become ready.

Prevention

Monitor namespace resource quotas and set alerts before limits are reached
Test deployment manifests with kubectl apply --dry-run=server before deploying
Review admission webhook configurations and their failure policies
Ensure referenced service accounts, ConfigMaps, and Secrets exist before deploying
Use CI/CD validation to catch manifest errors before they reach the cluster