How Does the Cluster Autoscaler Work?

intermediate|autoscalingdevopssreplatform engineerCKA
TL;DR

The Cluster Autoscaler automatically adjusts the number of nodes in a cluster. It adds nodes when Pods are pending due to insufficient resources and removes underutilized nodes when their Pods can be rescheduled elsewhere.

Detailed Answer

The Cluster Autoscaler dynamically adjusts the number of nodes in your cluster based on workload demand. It adds nodes when Pods cannot be scheduled and removes nodes when they are underutilized.

Scale-Up Process

1. HPA (or manual) scales Deployment from 3 to 10 replicas
2. Scheduler creates 7 new Pods → 4 go Pending (no room)
3. Cluster Autoscaler detects Pending Pods
4. Simulates scheduling on each node group's template
5. Selects the cheapest node group that would allow scheduling
6. Requests new nodes from the cloud provider
7. New nodes register with the cluster (1-5 minutes)
8. Scheduler places Pending Pods on new nodes

Scale-Down Process

1. Autoscaler checks utilization every 10 seconds
2. Node-3 utilization: 15% CPU, 20% memory (below 50% threshold)
3. Autoscaler simulates rescheduling Node-3's Pods elsewhere
4. All Pods can fit on other nodes → Node-3 is a candidate
5. After scale-down-unneeded-time (default 10 min), drain begins
6. Pods are evicted (respecting PDBs)
7. Node is removed from the cloud provider

Installation and Configuration

# Helm installation
helm install cluster-autoscaler autoscaler/cluster-autoscaler \
  --namespace kube-system \
  --set autoDiscovery.clusterName=my-cluster \
  --set awsRegion=us-east-1

Key configuration parameters:

# Common flags
--scale-down-enabled=true
--scale-down-utilization-threshold=0.5    # Remove nodes below 50% utilization
--scale-down-delay-after-add=10m          # Wait after adding a node
--scale-down-unneeded-time=10m            # Node must be idle this long
--scale-down-delay-after-delete=0s
--max-node-provision-time=15m             # Timeout for new nodes
--scan-interval=10s                       # How often to check
--max-nodes-total=100                     # Maximum cluster size
--expander=least-waste                    # How to choose node group

Node Group Configuration

The autoscaler works with node groups (AWS Auto Scaling Groups, GCE Managed Instance Groups, Azure VMSS):

# AWS: Tag-based auto-discovery
# Tags on ASG:
#   k8s.io/cluster-autoscaler/enabled: "true"
#   k8s.io/cluster-autoscaler/my-cluster: "owned"

# Node groups with different instance types
# ASG: compute-nodes    (m5.xlarge,  min=2, max=20)
# ASG: memory-nodes     (r5.2xlarge, min=0, max=10)
# ASG: gpu-nodes        (p3.2xlarge, min=0, max=5)

Expander Strategies

When multiple node groups can accommodate Pending Pods, the expander decides which to use:

| Expander | Strategy | |----------|----------| | random | Pick a random matching node group | | most-pods | Choose the group that schedules the most Pending Pods | | least-waste | Choose the group with least resource waste after scheduling | | price | Choose the cheapest node group (requires pricing info) | | priority | Use a user-defined priority list |

# Priority-based expander
apiVersion: v1
kind: ConfigMap
metadata:
  name: cluster-autoscaler-priority-expander
  namespace: kube-system
data:
  priorities: |-
    10:
      - spot-nodes-.*
    50:
      - compute-nodes-.*
    100:
      - gpu-nodes-.*

Preventing Node Removal

Annotation on Pod

metadata:
  annotations:
    cluster-autoscaler.kubernetes.io/safe-to-evict: "false"

Annotation on Node

metadata:
  annotations:
    cluster-autoscaler.kubernetes.io/scale-down-disabled: "true"

Conditions That Block Scale-Down

  1. Pods with local storage (emptyDir backed by disk)
  2. Pods without a controller (bare Pods cannot be rescheduled)
  3. PDB would be violated by eviction
  4. Pod has safe-to-evict: false annotation
  5. System Pods (kube-system namespace, mirror Pods)
  6. Pods with restrictive node affinity that cannot run elsewhere

Scale-Up Optimization: Pod Priority and Preemption

The Cluster Autoscaler considers Pod priority. High-priority Pending Pods trigger scale-up before low-priority ones. Combine with preemption for responsive scaling:

apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: high-priority
value: 1000000
globalDefault: false
description: "High priority for customer-facing workloads"

Over-Provisioning

To reduce scale-up latency, use a low-priority "placeholder" Deployment that keeps spare capacity:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: overprovisioner
spec:
  replicas: 3
  selector:
    matchLabels:
      app: overprovisioner
  template:
    metadata:
      labels:
        app: overprovisioner
    spec:
      priorityClassName: overprovisioner-priority  # Very low priority
      containers:
        - name: pause
          image: registry.k8s.io/pause:3.9
          resources:
            requests:
              cpu: "2"
              memory: "4Gi"

When a real workload needs resources, it preempts the overprovisioner Pods instantly (no cloud API wait). The preempted overprovisioner Pods go Pending, triggering scale-up for future headroom.

Monitoring the Cluster Autoscaler

# Check autoscaler status
kubectl get configmap cluster-autoscaler-status -n kube-system -o yaml

# View logs
kubectl logs -n kube-system -l app=cluster-autoscaler

# Check for scale-up/down events
kubectl get events -n kube-system --field-selector reason=ScaleUp
kubectl get events -n kube-system --field-selector reason=ScaleDown

Managed Kubernetes Autoscaling

| Provider | Implementation | Configuration | |----------|---------------|---------------| | EKS | Cluster Autoscaler or Karpenter | Auto-discovery via ASG tags | | GKE | Built-in node auto-provisioning | Enable in GKE console | | AKS | Built-in cluster autoscaler | Enable per node pool |

Karpenter (AWS Alternative)

Karpenter is a newer alternative to Cluster Autoscaler for AWS, providing faster provisioning and better bin packing:

apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
  name: default
spec:
  template:
    spec:
      requirements:
        - key: kubernetes.io/arch
          operator: In
          values: ["amd64"]
        - key: karpenter.sh/capacity-type
          operator: In
          values: ["spot", "on-demand"]
      nodeClassRef:
        name: default
  limits:
    cpu: "1000"
  disruption:
    consolidationPolicy: WhenUnderutilized

Karpenter provisions nodes directly (no ASG), chooses optimal instance types per Pod, and consolidates more aggressively.

Why Interviewers Ask This

Cluster autoscaling directly impacts infrastructure costs and workload availability. Understanding how it works, including its limitations and tuning parameters, is essential for production operations.

Common Follow-Up Questions

How does the Cluster Autoscaler decide to scale up?
When a Pod is in Pending state because no node has sufficient resources, the autoscaler simulates scheduling on a new node from each configured node group. If the simulation succeeds, it adds the node.
How does scale-down work?
The autoscaler monitors node utilization. If a node's requests total less than the utilization threshold (default 50%) for a configurable period, and all its Pods can be rescheduled elsewhere, the node is drained and removed.
What prevents the Cluster Autoscaler from removing a node?
Pods with local storage, Pods without a controller, Pods with restrictive PDBs, system Pods, and Pods with scale-down-disabled annotation all prevent node removal.

Key Takeaways

  • Scale-up is triggered by Pending Pods that cannot be scheduled on existing nodes.
  • Scale-down removes underutilized nodes after a configurable cool-down period.
  • PodDisruptionBudgets and annotations give you fine-grained control over which nodes can be removed.

Related Questions

You Might Also Like