Kubernetes ReplicaFailure
Causes and Fixes
ReplicaFailure is a Deployment condition that indicates the Deployment's ReplicaSet was unable to create or delete pods as needed. This is set when the ReplicaSet controller encounters errors trying to manage the desired number of replicas, often due to quota limits, admission webhook rejections, or API errors.
Symptoms
- Deployment condition ReplicaFailure is True in kubectl describe deployment
- ReplicaSet events show FailedCreate errors
- Desired replica count is never reached despite sufficient node resources
- kubectl get events shows 'Error creating: ...' messages from the ReplicaSet controller
- Pod count remains below the desired number indefinitely
Common Causes
Step-by-Step Troubleshooting
ReplicaFailure means the ReplicaSet controller cannot create pods for your Deployment. Unlike most pod errors where pods exist but are unhealthy, here the pods are never created at all. This guide focuses on identifying what is rejecting the pod creation requests.
1. Check Deployment Conditions
Start by examining the Deployment's conditions to confirm ReplicaFailure.
kubectl describe deployment <deployment-name>
Look for the ReplicaFailure condition in the Conditions section. The message field often contains the specific rejection reason.
# Get the condition details programmatically
kubectl get deployment <deployment-name> -o jsonpath='{range .status.conditions[*]}{.type}: {.status} - {.message}{"\n"}{end}'
2. Check ReplicaSet Events
The ReplicaSet managed by the Deployment will have detailed events showing why pod creation failed.
# Find the ReplicaSet
kubectl get replicaset -l app=<app-label> --sort-by=.metadata.creationTimestamp
# Describe the most recent ReplicaSet
kubectl describe replicaset <replicaset-name>
Look at the Events section for entries with reason FailedCreate. The message will contain the specific error from the API server, such as:
forbidden: exceeded quotaadmission webhook denied the requestpods "name" is forbidden: violates PodSecurity
# Get all FailedCreate events in the namespace
kubectl get events --field-selector reason=FailedCreate --sort-by=.lastTimestamp
3. Check Resource Quotas
If the error mentions quota, check the namespace's ResourceQuota usage.
# List all quotas in the namespace
kubectl get resourcequota -n <namespace>
# Get detailed quota usage
kubectl describe resourcequota -n <namespace>
The output shows used versus hard limits for resources like pods, CPU, memory, and storage. If any resource is at its limit, new pods cannot be created.
# Check the pod's resource requests against remaining quota
kubectl get deployment <deployment-name> -o jsonpath='{.spec.template.spec.containers[0].resources}' | jq .
To resolve a quota issue, either increase the quota, reduce resource requests per pod, scale down other workloads, or request a quota increase from the cluster administrator.
# Example: increase the quota (requires appropriate RBAC)
kubectl patch resourcequota <quota-name> -n <namespace> -p '{"spec":{"hard":{"pods":"50","requests.cpu":"20","requests.memory":"40Gi"}}}'
4. Check LimitRange Constraints
A LimitRange can enforce minimum or maximum resource requirements that your pod spec may not meet.
kubectl get limitrange -n <namespace>
kubectl describe limitrange -n <namespace>
If the LimitRange requires minimum CPU of 100m but your pod does not specify a CPU request, the creation will fail. Either update the pod spec to comply with the LimitRange or adjust the LimitRange.
# Check what the pod template specifies
kubectl get deployment <deployment-name> -o jsonpath='{.spec.template.spec.containers[*].resources}' | jq .
5. Check Admission Webhooks
Admission webhooks can reject pod creation for policy reasons.
# List validating webhooks
kubectl get validatingwebhookconfigurations
# List mutating webhooks
kubectl get mutatingwebhookconfigurations
# Describe a specific webhook for its rules
kubectl describe validatingwebhookconfiguration <webhook-name>
If a webhook is rejecting pods, the FailedCreate event message will include the webhook name and its rejection reason. Common sources include OPA/Gatekeeper policies, Kyverno policies, and custom admission controllers.
# Check if a specific webhook is causing issues by looking at its failure policy
kubectl get validatingwebhookconfiguration <webhook-name> -o jsonpath='{.webhooks[*].failurePolicy}'
To debug further, check the logs of the webhook service.
# Find the webhook service
kubectl get validatingwebhookconfiguration <webhook-name> -o jsonpath='{.webhooks[0].clientConfig.service.name}'
# Check the webhook pod logs
kubectl logs -n <webhook-namespace> -l app=<webhook-label> --tail=100
6. Check Pod Security Admission
If the cluster uses Pod Security Standards (Kubernetes 1.25+), the namespace's security policy may reject pods that do not meet the required level.
# Check namespace labels for pod security enforcement
kubectl get namespace <namespace> -o jsonpath='{.metadata.labels}' | jq . | grep pod-security
Labels like pod-security.kubernetes.io/enforce: restricted mean pods must meet the restricted security standard. If your pod runs as root, uses privileged containers, or does not set required security context fields, creation will be rejected.
# Check pod security context in the deployment
kubectl get deployment <deployment-name> -o jsonpath='{.spec.template.spec}' | jq '{securityContext, containers: [.containers[] | {name, securityContext}]}'
7. Verify Referenced Resources Exist
The pod template may reference resources that do not exist.
# Check if the service account exists
kubectl get serviceaccount <sa-name> -n <namespace>
# Check if referenced ConfigMaps exist
kubectl get configmap <configmap-name> -n <namespace>
# Check if referenced Secrets exist
kubectl get secret <secret-name> -n <namespace>
# Check if referenced PVCs exist
kubectl get pvc <pvc-name> -n <namespace>
If any referenced resource is missing, create it or update the Deployment to reference an existing resource.
8. Test Pod Creation Manually
To isolate whether the issue is with the pod spec or something else, try creating a pod manually using the Deployment's template.
# Extract the pod template and try a dry-run creation
kubectl get deployment <deployment-name> -o jsonpath='{.spec.template}' > /tmp/pod-template.json
# Attempt to create a pod with server-side dry-run
kubectl run test-pod --image=<same-image> --dry-run=server -o yaml --overrides="$(kubectl get deployment <deployment-name> -o jsonpath='{.spec.template.spec}' | jq -c .)"
The dry-run with --dry-run=server will run through all admission controllers and return any rejection errors without actually creating the pod.
9. Fix the Issue and Verify
After identifying and resolving the root cause, trigger a new rollout or wait for the ReplicaSet controller to retry.
# If you updated the deployment spec
kubectl apply -f deployment.yaml
# Force a re-evaluation by adding an annotation
kubectl patch deployment <deployment-name> -p '{"spec":{"template":{"metadata":{"annotations":{"kubectl.kubernetes.io/restartedAt":"'$(date -u +%Y-%m-%dT%H:%M:%SZ)'"}}}}}'
# Watch the rollout
kubectl rollout status deployment/<deployment-name> --watch
10. Confirm Recovery
Verify that the ReplicaFailure condition has cleared and all replicas are running.
# Check deployment status
kubectl get deployment <deployment-name>
# Verify no ReplicaFailure condition
kubectl get deployment <deployment-name> -o jsonpath='{.status.conditions[?(@.type=="ReplicaFailure")]}'
# Confirm all pods are running
kubectl get pods -l app=<app-label>
# Check that no FailedCreate events are recent
kubectl get events --field-selector reason=FailedCreate --sort-by=.lastTimestamp
The Deployment is healthy when the ready replica count matches the desired count, the ReplicaFailure condition is gone or False, and no new FailedCreate events are appearing. If the issue was quota-related, consider setting up monitoring and alerts on quota usage to prevent recurrence.
How to Explain This in an Interview
I would explain that ReplicaFailure is distinct from pod-level errors because the pods never actually get created. The issue is at the API level — the ReplicaSet controller's request to create pods is being rejected before any scheduling or container runtime involvement. I'd discuss how to diagnose this by checking ReplicaSet events for FailedCreate reasons, and I'd walk through common admission-time rejections including quotas, LimitRanges, admission webhooks, and PodSecurity. I'd also explain how this differs from ProgressDeadlineExceeded, where pods are created but fail to become ready.
Prevention
- Monitor namespace resource quotas and set alerts before limits are reached
- Test deployment manifests with kubectl apply --dry-run=server before deploying
- Review admission webhook configurations and their failure policies
- Ensure referenced service accounts, ConfigMaps, and Secrets exist before deploying
- Use CI/CD validation to catch manifest errors before they reach the cluster