How Do You Scale a Kubernetes Deployment?
Scale a Deployment by changing the replicas field using kubectl scale, kubectl edit, or kubectl apply. For automatic scaling, use a HorizontalPodAutoscaler that adjusts replicas based on CPU, memory, or custom metrics.
Detailed Answer
Scaling is one of the core capabilities that makes Kubernetes powerful. A Deployment can be scaled manually by adjusting the replica count or automatically using a HorizontalPodAutoscaler (HPA).
Manual Scaling
Using kubectl scale
# Scale to 5 replicas
kubectl scale deployment/web-app --replicas=5
# Verify the scaling
kubectl get deployment web-app
Using kubectl apply
Update the manifest and re-apply:
apiVersion: apps/v1
kind: Deployment
metadata:
name: web-app
spec:
replicas: 5 # Changed from 3 to 5
selector:
matchLabels:
app: web-app
template:
metadata:
labels:
app: web-app
spec:
containers:
- name: web-app
image: web-app:1.0
ports:
- containerPort: 8080
resources:
requests:
cpu: 100m
memory: 128Mi
limits:
cpu: 500m
memory: 256Mi
kubectl apply -f deployment.yaml
Changing replicas does not trigger a rollout. No new ReplicaSet is created. The existing ReplicaSet simply adjusts its Pod count.
Using kubectl patch
kubectl patch deployment web-app -p '{"spec":{"replicas":5}}'
What Happens When You Scale Up
- The Deployment controller updates the desired replica count on the current ReplicaSet.
- The ReplicaSet controller detects the difference between desired and actual Pod count.
- New Pods are created and scheduled to available nodes.
- The scheduler considers resource requests, node affinity, anti-affinity rules, and taints/tolerations.
- New Pods become Ready once their readiness probes pass.
What Happens When You Scale Down
- The ReplicaSet controller selects Pods to terminate.
- Selected Pods are removed from Service endpoints.
- Each Pod receives SIGTERM.
- The Pod has
terminationGracePeriodSeconds(default 30s) to shut down cleanly. - After the grace period, the Pod receives SIGKILL.
Automatic Scaling with HPA
The HorizontalPodAutoscaler watches metrics and adjusts the replica count automatically:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: web-app-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: web-app
minReplicas: 2
maxReplicas: 20
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80
behavior:
scaleUp:
stabilizationWindowSeconds: 60
policies:
- type: Percent
value: 50
periodSeconds: 60
scaleDown:
stabilizationWindowSeconds: 300
policies:
- type: Percent
value: 25
periodSeconds: 120
This HPA:
- Keeps replicas between 2 and 20.
- Scales up when average CPU exceeds 70% or memory exceeds 80%.
- Scales up aggressively (50% more Pods per minute).
- Scales down conservatively (25% fewer Pods every 2 minutes) with a 5-minute stabilization window.
Prerequisites for HPA
- The Metrics Server must be installed in the cluster.
- Pods must have resource requests defined (CPU and/or memory). Without requests, the HPA cannot calculate utilization percentages.
# Create an HPA imperatively
kubectl autoscale deployment web-app --min=2 --max=20 --cpu-percent=70
# Check HPA status
kubectl get hpa web-app-hpa
# See detailed scaling decisions
kubectl describe hpa web-app-hpa
Scaling with Custom Metrics
For workloads where CPU/memory do not reflect load accurately, use custom metrics:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: web-app-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: web-app
minReplicas: 2
maxReplicas: 50
metrics:
- type: Pods
pods:
metric:
name: http_requests_per_second
target:
type: AverageValue
averageValue: 100
This scales based on a custom metric (http_requests_per_second) exposed through a metrics adapter like Prometheus Adapter.
Scaling to Zero
kubectl scale deployment/web-app --replicas=0
This terminates all Pods while preserving the Deployment. Useful for:
- Cost savings in non-production environments during off-hours.
- Maintenance windows where the application should not run.
- KEDA (Kubernetes Event-Driven Autoscaling) can scale from zero based on queue depth, event count, or other triggers.
Best Practices
- Always set resource requests so the HPA and scheduler work correctly.
- Use PodDisruptionBudgets to prevent scaling down below a safe threshold.
- Configure scale-down stabilization to prevent flapping between replica counts.
- Set appropriate min and max replicas -- too low risks outages, too high wastes resources.
- Monitor HPA decisions with
kubectl describe hpaand alert on sustained scaling at maxReplicas.
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: web-app-pdb
spec:
minAvailable: 2
selector:
matchLabels:
app: web-app
Summary
Kubernetes provides both manual scaling (via kubectl scale or manifest changes) and automatic scaling (via HPA). Manual scaling is immediate and does not trigger a rollout. Automatic scaling reacts to real-time metrics and adjusts replica count within configured bounds. Combining HPA with proper resource requests, PodDisruptionBudgets, and custom metrics gives you a production-ready scaling strategy.
Why Interviewers Ask This
Scaling is one of the primary reasons teams adopt Kubernetes. Interviewers want to know if you can handle both manual and automatic scaling and understand the mechanics behind it.
Common Follow-Up Questions
Key Takeaways
- kubectl scale is the fastest way to manually change replica count.
- HPA provides automatic scaling based on observed metrics.
- Scaling does not trigger a rollout -- it only changes the replica count on the current ReplicaSet.