How Does the Horizontal Pod Autoscaler (HPA) Work?

intermediate|autoscalingdevopssreCKA
TL;DR

The HPA automatically scales the number of Pod replicas based on observed CPU, memory, or custom metrics. It periodically queries the Metrics API, computes the desired replica count using a target utilization formula, and updates the Deployment or StatefulSet accordingly.

Detailed Answer

How the HPA Works

The Horizontal Pod Autoscaler runs as a control loop in the kube-controller-manager. Every 15 seconds (configurable), it:

  1. Queries the Metrics API for current metric values.
  2. Computes the desired replica count based on the target value.
  3. Updates the scale subresource of the target Deployment or StatefulSet.

The scaling formula is:

desiredReplicas = ceil(currentReplicas * (currentMetricValue / desiredMetricValue))

For example, if 3 replicas are running with average CPU at 90% and the target is 50%:

desiredReplicas = ceil(3 * (90 / 50)) = ceil(5.4) = 6

Prerequisites

The HPA requires the Metrics Server to be installed:

# Install Metrics Server
kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml

# Verify it is running
kubectl get deployment metrics-server -n kube-system

# Test the metrics API
kubectl top nodes
kubectl top pods

Pods must also have resource requests defined, since the HPA calculates utilization as a percentage of the request:

resources:
  requests:
    cpu: 200m    # HPA uses this as the 100% baseline
    memory: 256Mi

Basic HPA Configuration

Using kubectl

# Create an HPA for a Deployment
kubectl autoscale deployment web-app \
  --cpu-percent=50 \
  --min=2 \
  --max=20

# Check HPA status
kubectl get hpa
# NAME      REFERENCE            TARGETS   MINPODS   MAXPODS   REPLICAS   AGE
# web-app   Deployment/web-app   35%/50%   2         20        3          5m

Using YAML (v2 API)

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: web-app
  namespace: production
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: web-app
  minReplicas: 2
  maxReplicas: 20
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 50
    - type: Resource
      resource:
        name: memory
        target:
          type: Utilization
          averageUtilization: 70

When multiple metrics are specified, the HPA computes the desired replica count for each metric and uses the highest value.

Custom Metrics

For application-specific scaling (HTTP requests per second, queue depth), use custom metrics via the Prometheus Adapter:

# Install Prometheus Adapter
helm install prometheus-adapter prometheus-community/prometheus-adapter \
  --namespace monitoring \
  --set prometheus.url=http://prometheus.monitoring.svc:9090
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: api-server
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api-server
  minReplicas: 3
  maxReplicas: 50
  metrics:
    - type: Pods
      pods:
        metric:
          name: http_requests_per_second
        target:
          type: AverageValue
          averageValue: 100
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 60

External Metrics

Scale based on metrics not tied to any Pod, such as a cloud message queue depth:

metrics:
  - type: External
    external:
      metric:
        name: sqs_queue_messages
        selector:
          matchLabels:
            queue: order-processing
      target:
        type: AverageValue
        averageValue: 10

Scaling Behavior and Stabilization

The behavior field (autoscaling/v2) controls how fast the HPA scales up and down:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: web-app
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: web-app
  minReplicas: 2
  maxReplicas: 50
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 50
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 60
      policies:
        - type: Percent
          value: 100          # Double the replicas
          periodSeconds: 60
        - type: Pods
          value: 5            # Or add 5 Pods
          periodSeconds: 60
      selectPolicy: Max       # Use whichever adds more Pods
    scaleDown:
      stabilizationWindowSeconds: 300  # Wait 5 min before scaling down
      policies:
        - type: Percent
          value: 10           # Remove at most 10% of replicas
          periodSeconds: 60
      selectPolicy: Min       # Use the most conservative policy

Key settings:

  • stabilizationWindowSeconds: How long to wait before applying a scale decision. Prevents flapping.
  • policies: Define the rate of change (by percentage or absolute count).
  • selectPolicy: Max (most aggressive), Min (most conservative), or Disabled.

Monitoring HPA Decisions

# View HPA details and events
kubectl describe hpa web-app

# Key events to watch:
# "New size: 6; reason: cpu resource utilization above target"
# "New size: 3; reason: All metrics below target"

# View HPA metrics
kubectl get hpa web-app -o yaml

# Check HPA conditions
kubectl get hpa web-app -o jsonpath='{.status.conditions[*].message}'

Common Issues

| Problem | Cause | Fix | |---|---|---| | TARGETS shows <unknown>/50% | Metrics Server not installed or Pods lack resource requests | Install Metrics Server and set resource requests | | HPA never scales up | Target utilization is higher than actual usage | Lower the target percentage or check if requests are too high | | HPA never scales down | Stabilization window is too long | Reduce scaleDown.stabilizationWindowSeconds | | Flapping between replica counts | No stabilization window configured | Add behavior.scaleDown.stabilizationWindowSeconds |

HPA with VPA

The Vertical Pod Autoscaler (VPA) adjusts resource requests/limits, while HPA adjusts replica count. They should not both target the same metric (e.g., CPU). A common pattern is to use VPA in recommendation mode to right-size requests and HPA to scale based on custom metrics or CPU utilization.

Best Practices

  1. Always set resource requests on Pods targeted by HPA.
  2. Use custom metrics for business-aware scaling (requests/sec, queue depth).
  3. Configure scale-down stabilization (300s minimum) to prevent flapping.
  4. Set sensible min/max replica bounds based on capacity planning.
  5. Monitor HPA events to understand scaling decisions.

Why Interviewers Ask This

Interviewers ask this to verify that you can configure auto-scaling for production workloads and understand how Kubernetes responds to changing load.

Common Follow-Up Questions

What is the scaling formula the HPA uses?
desiredReplicas = ceil(currentReplicas * (currentMetricValue / desiredMetricValue)). If current CPU is 80% and target is 50%, it scales up by a factor of 80/50.
What metrics server does HPA require?
The Metrics Server (metrics.k8s.io API) for CPU/memory. For custom metrics, a custom metrics adapter like Prometheus Adapter is needed.
How do you prevent flapping (rapid scale up/down)?
Configure stabilization windows and scaling policies. The default cooldown is 5 minutes for scale-down and 0 for scale-up.

Key Takeaways

  • HPA requires Metrics Server to be installed for CPU/memory scaling
  • Pods must have resource requests defined for CPU-based scaling to work
  • Use behavior policies to control scale-up and scale-down rates