How Does the Cluster Autoscaler Work?
The Cluster Autoscaler automatically adjusts the number of nodes in a cluster. It adds nodes when Pods are pending due to insufficient resources and removes underutilized nodes when their Pods can be rescheduled elsewhere.
Detailed Answer
The Cluster Autoscaler dynamically adjusts the number of nodes in your cluster based on workload demand. It adds nodes when Pods cannot be scheduled and removes nodes when they are underutilized.
Scale-Up Process
1. HPA (or manual) scales Deployment from 3 to 10 replicas
2. Scheduler creates 7 new Pods → 4 go Pending (no room)
3. Cluster Autoscaler detects Pending Pods
4. Simulates scheduling on each node group's template
5. Selects the cheapest node group that would allow scheduling
6. Requests new nodes from the cloud provider
7. New nodes register with the cluster (1-5 minutes)
8. Scheduler places Pending Pods on new nodes
Scale-Down Process
1. Autoscaler checks utilization every 10 seconds
2. Node-3 utilization: 15% CPU, 20% memory (below 50% threshold)
3. Autoscaler simulates rescheduling Node-3's Pods elsewhere
4. All Pods can fit on other nodes → Node-3 is a candidate
5. After scale-down-unneeded-time (default 10 min), drain begins
6. Pods are evicted (respecting PDBs)
7. Node is removed from the cloud provider
Installation and Configuration
# Helm installation
helm install cluster-autoscaler autoscaler/cluster-autoscaler \
--namespace kube-system \
--set autoDiscovery.clusterName=my-cluster \
--set awsRegion=us-east-1
Key configuration parameters:
# Common flags
--scale-down-enabled=true
--scale-down-utilization-threshold=0.5 # Remove nodes below 50% utilization
--scale-down-delay-after-add=10m # Wait after adding a node
--scale-down-unneeded-time=10m # Node must be idle this long
--scale-down-delay-after-delete=0s
--max-node-provision-time=15m # Timeout for new nodes
--scan-interval=10s # How often to check
--max-nodes-total=100 # Maximum cluster size
--expander=least-waste # How to choose node group
Node Group Configuration
The autoscaler works with node groups (AWS Auto Scaling Groups, GCE Managed Instance Groups, Azure VMSS):
# AWS: Tag-based auto-discovery
# Tags on ASG:
# k8s.io/cluster-autoscaler/enabled: "true"
# k8s.io/cluster-autoscaler/my-cluster: "owned"
# Node groups with different instance types
# ASG: compute-nodes (m5.xlarge, min=2, max=20)
# ASG: memory-nodes (r5.2xlarge, min=0, max=10)
# ASG: gpu-nodes (p3.2xlarge, min=0, max=5)
Expander Strategies
When multiple node groups can accommodate Pending Pods, the expander decides which to use:
| Expander | Strategy |
|----------|----------|
| random | Pick a random matching node group |
| most-pods | Choose the group that schedules the most Pending Pods |
| least-waste | Choose the group with least resource waste after scheduling |
| price | Choose the cheapest node group (requires pricing info) |
| priority | Use a user-defined priority list |
# Priority-based expander
apiVersion: v1
kind: ConfigMap
metadata:
name: cluster-autoscaler-priority-expander
namespace: kube-system
data:
priorities: |-
10:
- spot-nodes-.*
50:
- compute-nodes-.*
100:
- gpu-nodes-.*
Preventing Node Removal
Annotation on Pod
metadata:
annotations:
cluster-autoscaler.kubernetes.io/safe-to-evict: "false"
Annotation on Node
metadata:
annotations:
cluster-autoscaler.kubernetes.io/scale-down-disabled: "true"
Conditions That Block Scale-Down
- Pods with local storage (emptyDir backed by disk)
- Pods without a controller (bare Pods cannot be rescheduled)
- PDB would be violated by eviction
- Pod has
safe-to-evict: falseannotation - System Pods (kube-system namespace, mirror Pods)
- Pods with restrictive node affinity that cannot run elsewhere
Scale-Up Optimization: Pod Priority and Preemption
The Cluster Autoscaler considers Pod priority. High-priority Pending Pods trigger scale-up before low-priority ones. Combine with preemption for responsive scaling:
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: high-priority
value: 1000000
globalDefault: false
description: "High priority for customer-facing workloads"
Over-Provisioning
To reduce scale-up latency, use a low-priority "placeholder" Deployment that keeps spare capacity:
apiVersion: apps/v1
kind: Deployment
metadata:
name: overprovisioner
spec:
replicas: 3
selector:
matchLabels:
app: overprovisioner
template:
metadata:
labels:
app: overprovisioner
spec:
priorityClassName: overprovisioner-priority # Very low priority
containers:
- name: pause
image: registry.k8s.io/pause:3.9
resources:
requests:
cpu: "2"
memory: "4Gi"
When a real workload needs resources, it preempts the overprovisioner Pods instantly (no cloud API wait). The preempted overprovisioner Pods go Pending, triggering scale-up for future headroom.
Monitoring the Cluster Autoscaler
# Check autoscaler status
kubectl get configmap cluster-autoscaler-status -n kube-system -o yaml
# View logs
kubectl logs -n kube-system -l app=cluster-autoscaler
# Check for scale-up/down events
kubectl get events -n kube-system --field-selector reason=ScaleUp
kubectl get events -n kube-system --field-selector reason=ScaleDown
Managed Kubernetes Autoscaling
| Provider | Implementation | Configuration | |----------|---------------|---------------| | EKS | Cluster Autoscaler or Karpenter | Auto-discovery via ASG tags | | GKE | Built-in node auto-provisioning | Enable in GKE console | | AKS | Built-in cluster autoscaler | Enable per node pool |
Karpenter (AWS Alternative)
Karpenter is a newer alternative to Cluster Autoscaler for AWS, providing faster provisioning and better bin packing:
apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
name: default
spec:
template:
spec:
requirements:
- key: kubernetes.io/arch
operator: In
values: ["amd64"]
- key: karpenter.sh/capacity-type
operator: In
values: ["spot", "on-demand"]
nodeClassRef:
name: default
limits:
cpu: "1000"
disruption:
consolidationPolicy: WhenUnderutilized
Karpenter provisions nodes directly (no ASG), chooses optimal instance types per Pod, and consolidates more aggressively.
Why Interviewers Ask This
Cluster autoscaling directly impacts infrastructure costs and workload availability. Understanding how it works, including its limitations and tuning parameters, is essential for production operations.
Common Follow-Up Questions
Key Takeaways
- Scale-up is triggered by Pending Pods that cannot be scheduled on existing nodes.
- Scale-down removes underutilized nodes after a configurable cool-down period.
- PodDisruptionBudgets and annotations give you fine-grained control over which nodes can be removed.