How does the kube-scheduler work and how does it decide where to place pods?
The kube-scheduler watches for unscheduled pods and assigns them to nodes through a two-phase process: filtering (eliminating nodes that cannot run the pod) and scoring (ranking remaining nodes by preference). It considers resource requests, affinity rules, taints, tolerations, and topology constraints.
Detailed Answer
The kube-scheduler is the control plane component responsible for assigning pods to nodes. It watches the API server for newly created pods that have no spec.nodeName set and determines the best node for each pod to run on.
The Scheduling Cycle
The scheduler follows a two-phase approach for every pod:
Phase 1: Filtering (Predicates) The scheduler eliminates nodes that cannot satisfy the pod's requirements. Common filters include:
- Resource availability -- Does the node have enough CPU and memory to satisfy the pod's resource requests?
- NodeSelector -- Does the node match the labels specified in the pod's
nodeSelector? - Taints and tolerations -- Does the pod tolerate the node's taints?
- Node affinity -- Does the node satisfy the pod's
nodeAffinityrules? - Pod anti-affinity -- Would placing this pod violate any anti-affinity rules?
- Volume constraints -- Can the requested persistent volumes be mounted on this node?
If no nodes pass filtering, the pod remains in a Pending state.
Phase 2: Scoring (Priorities) The remaining nodes are scored on a 0-100 scale across multiple scoring functions. The node with the highest aggregate score wins. Scoring factors include:
- LeastRequestedPriority -- Prefers nodes with more available resources
- BalancedResourceAllocation -- Prefers nodes where CPU and memory usage is balanced
- ImageLocalityPriority -- Prefers nodes that already have the container image cached
- InterPodAffinityPriority -- Prefers nodes that satisfy pod affinity preferences
- TopologySpreadConstraints -- Prefers nodes that improve workload distribution
Practical Examples
Using nodeSelector for simple placement:
apiVersion: v1
kind: Pod
metadata:
name: gpu-workload
spec:
nodeSelector:
accelerator: nvidia-a100
containers:
- name: training
image: pytorch/pytorch:latest
resources:
requests:
cpu: "4"
memory: "16Gi"
Using node affinity for more expressive rules:
apiVersion: v1
kind: Pod
metadata:
name: web-frontend
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: topology.kubernetes.io/zone
operator: In
values:
- us-east-1a
- us-east-1b
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 80
preference:
matchExpressions:
- key: node-type
operator: In
values:
- spot
containers:
- name: nginx
image: nginx:1.27
Using topology spread constraints for even distribution:
apiVersion: apps/v1
kind: Deployment
metadata:
name: web
spec:
replicas: 6
selector:
matchLabels:
app: web
template:
metadata:
labels:
app: web
spec:
topologySpreadConstraints:
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
app: web
containers:
- name: nginx
image: nginx:1.27
Troubleshooting Scheduling
When a pod is stuck in Pending, the scheduler adds events explaining why:
# Check pod events for scheduling information
kubectl describe pod my-pending-pod
# Look for events like:
# "0/5 nodes are available: 2 Insufficient cpu, 3 node(s) had taint
# {node-role.kubernetes.io/control-plane: }, that the pod didn't tolerate."
# View allocatable resources across all nodes
kubectl get nodes -o custom-columns=\
NAME:.metadata.name,\
CPU:.status.allocatable.cpu,\
MEM:.status.allocatable.memory
# Check resource usage per node
kubectl top nodes
# View scheduler logs for detailed decision info
kubectl logs -n kube-system kube-scheduler-controlplane
Scheduler Profiles and Plugins
Starting with Kubernetes 1.19, the scheduler uses a plugin-based architecture called the Scheduling Framework. You can configure scheduler profiles to customize which plugins are enabled at each extension point:
apiVersion: kubescheduler.config.k8s.io/v1
kind: KubeSchedulerConfiguration
profiles:
- schedulerName: default-scheduler
plugins:
score:
disabled:
- name: ImageLocality
enabled:
- name: MyCustomPlugin
weight: 5
You can also run multiple schedulers simultaneously. Pods specify which scheduler to use:
spec:
schedulerName: my-custom-scheduler
Priority and Preemption
When a high-priority pod cannot be scheduled, the scheduler can preempt lower-priority pods to make room. This is controlled through PriorityClasses:
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: critical-workload
value: 1000000
globalDefault: false
preemptionPolicy: PreemptLowerPriority
description: "For mission-critical workloads"
Why Interviewers Ask This
This question tests whether a candidate understands pod placement mechanics, which is essential for optimizing resource utilization, ensuring high availability, and troubleshooting scheduling failures in production clusters.
Common Follow-Up Questions
Key Takeaways
- Scheduling is a two-phase process: filtering eliminates ineligible nodes, scoring ranks the rest
- Resource requests (not limits) drive scheduling decisions
- Taints, tolerations, affinity rules, and topology spread constraints provide fine-grained control