How Do You Scale a Kubernetes Deployment?

Q: How Do You Scale a Kubernetes Deployment?

Scale a Deployment by changing the replicas field using kubectl scale, kubectl edit, or kubectl apply. For automatic scaling, use a HorizontalPodAutoscaler that adjusts replicas based on CPU, memory, or custom metrics.

Detailed Answer

Scaling is one of the core capabilities that makes Kubernetes powerful. A Deployment can be scaled manually by adjusting the replica count or automatically using a HorizontalPodAutoscaler (HPA).

Manual Scaling

Using kubectl scale

# Scale to 5 replicas
kubectl scale deployment/web-app --replicas=5

# Verify the scaling
kubectl get deployment web-app

Using kubectl apply

Update the manifest and re-apply:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: web-app
spec:
  replicas: 5    # Changed from 3 to 5
  selector:
    matchLabels:
      app: web-app
  template:
    metadata:
      labels:
        app: web-app
    spec:
      containers:
        - name: web-app
          image: web-app:1.0
          ports:
            - containerPort: 8080
          resources:
            requests:
              cpu: 100m
              memory: 128Mi
            limits:
              cpu: 500m
              memory: 256Mi

kubectl apply -f deployment.yaml

Changing replicas does not trigger a rollout. No new ReplicaSet is created. The existing ReplicaSet simply adjusts its Pod count.

Using kubectl patch

kubectl patch deployment web-app -p '{"spec":{"replicas":5}}'

What Happens When You Scale Up

The Deployment controller updates the desired replica count on the current ReplicaSet.
The ReplicaSet controller detects the difference between desired and actual Pod count.
New Pods are created and scheduled to available nodes.
The scheduler considers resource requests, node affinity, anti-affinity rules, and taints/tolerations.
New Pods become Ready once their readiness probes pass.

What Happens When You Scale Down

The ReplicaSet controller selects Pods to terminate.
Selected Pods are removed from Service endpoints.
Each Pod receives SIGTERM.
The Pod has terminationGracePeriodSeconds (default 30s) to shut down cleanly.
After the grace period, the Pod receives SIGKILL.

Automatic Scaling with HPA

The HorizontalPodAutoscaler watches metrics and adjusts the replica count automatically:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: web-app-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: web-app
  minReplicas: 2
  maxReplicas: 20
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70
    - type: Resource
      resource:
        name: memory
        target:
          type: Utilization
          averageUtilization: 80
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 60
      policies:
        - type: Percent
          value: 50
          periodSeconds: 60
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
        - type: Percent
          value: 25
          periodSeconds: 120

This HPA:

Keeps replicas between 2 and 20.
Scales up when average CPU exceeds 70% or memory exceeds 80%.
Scales up aggressively (50% more Pods per minute).
Scales down conservatively (25% fewer Pods every 2 minutes) with a 5-minute stabilization window.

Prerequisites for HPA

The Metrics Server must be installed in the cluster.
Pods must have resource requests defined (CPU and/or memory). Without requests, the HPA cannot calculate utilization percentages.

# Create an HPA imperatively
kubectl autoscale deployment web-app --min=2 --max=20 --cpu-percent=70

# Check HPA status
kubectl get hpa web-app-hpa

# See detailed scaling decisions
kubectl describe hpa web-app-hpa

Scaling with Custom Metrics

For workloads where CPU/memory do not reflect load accurately, use custom metrics:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: web-app-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: web-app
  minReplicas: 2
  maxReplicas: 50
  metrics:
    - type: Pods
      pods:
        metric:
          name: http_requests_per_second
        target:
          type: AverageValue
          averageValue: 100

This scales based on a custom metric (http_requests_per_second) exposed through a metrics adapter like Prometheus Adapter.

Scaling to Zero

kubectl scale deployment/web-app --replicas=0

This terminates all Pods while preserving the Deployment. Useful for:

Cost savings in non-production environments during off-hours.
Maintenance windows where the application should not run.
KEDA (Kubernetes Event-Driven Autoscaling) can scale from zero based on queue depth, event count, or other triggers.

Best Practices

Always set resource requests so the HPA and scheduler work correctly.
Use PodDisruptionBudgets to prevent scaling down below a safe threshold.
Configure scale-down stabilization to prevent flapping between replica counts.
Set appropriate min and max replicas -- too low risks outages, too high wastes resources.
Monitor HPA decisions with kubectl describe hpa and alert on sustained scaling at maxReplicas.

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: web-app-pdb
spec:
  minAvailable: 2
  selector:
    matchLabels:
      app: web-app

Summary

Kubernetes provides both manual scaling (via kubectl scale or manifest changes) and automatic scaling (via HPA). Manual scaling is immediate and does not trigger a rollout. Automatic scaling reacts to real-time metrics and adjusts replica count within configured bounds. Combining HPA with proper resource requests, PodDisruptionBudgets, and custom metrics gives you a production-ready scaling strategy.