How does the kube-scheduler work and how does it decide where to place pods?

Q: How does the kube-scheduler work and how does it decide where to place pods?

The kube-scheduler watches for unscheduled pods and assigns them to nodes through a two-phase process: filtering (eliminating nodes that cannot run the pod) and scoring (ranking remaining nodes by preference). It considers resource requests, affinity rules, taints, tolerations, and topology constraints.

Detailed Answer

The kube-scheduler is the control plane component responsible for assigning pods to nodes. It watches the API server for newly created pods that have no spec.nodeName set and determines the best node for each pod to run on.

The Scheduling Cycle

The scheduler follows a two-phase approach for every pod:

Phase 1: Filtering (Predicates) The scheduler eliminates nodes that cannot satisfy the pod's requirements. Common filters include:

Resource availability -- Does the node have enough CPU and memory to satisfy the pod's resource requests?
NodeSelector -- Does the node match the labels specified in the pod's nodeSelector?
Taints and tolerations -- Does the pod tolerate the node's taints?
Node affinity -- Does the node satisfy the pod's nodeAffinity rules?
Pod anti-affinity -- Would placing this pod violate any anti-affinity rules?
Volume constraints -- Can the requested persistent volumes be mounted on this node?

If no nodes pass filtering, the pod remains in a Pending state.

Phase 2: Scoring (Priorities) The remaining nodes are scored on a 0-100 scale across multiple scoring functions. The node with the highest aggregate score wins. Scoring factors include:

LeastRequestedPriority -- Prefers nodes with more available resources
BalancedResourceAllocation -- Prefers nodes where CPU and memory usage is balanced
ImageLocalityPriority -- Prefers nodes that already have the container image cached
InterPodAffinityPriority -- Prefers nodes that satisfy pod affinity preferences
TopologySpreadConstraints -- Prefers nodes that improve workload distribution

Practical Examples

Using nodeSelector for simple placement:

apiVersion: v1
kind: Pod
metadata:
  name: gpu-workload
spec:
  nodeSelector:
    accelerator: nvidia-a100
  containers:
  - name: training
    image: pytorch/pytorch:latest
    resources:
      requests:
        cpu: "4"
        memory: "16Gi"

Using node affinity for more expressive rules:

apiVersion: v1
kind: Pod
metadata:
  name: web-frontend
spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: topology.kubernetes.io/zone
            operator: In
            values:
            - us-east-1a
            - us-east-1b
      preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 80
        preference:
          matchExpressions:
          - key: node-type
            operator: In
            values:
            - spot
  containers:
  - name: nginx
    image: nginx:1.27

Using topology spread constraints for even distribution:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: web
spec:
  replicas: 6
  selector:
    matchLabels:
      app: web
  template:
    metadata:
      labels:
        app: web
    spec:
      topologySpreadConstraints:
      - maxSkew: 1
        topologyKey: topology.kubernetes.io/zone
        whenUnsatisfiable: DoNotSchedule
        labelSelector:
          matchLabels:
            app: web
      containers:
      - name: nginx
        image: nginx:1.27

Troubleshooting Scheduling

When a pod is stuck in Pending, the scheduler adds events explaining why:

# Check pod events for scheduling information
kubectl describe pod my-pending-pod

# Look for events like:
# "0/5 nodes are available: 2 Insufficient cpu, 3 node(s) had taint
#  {node-role.kubernetes.io/control-plane: }, that the pod didn't tolerate."

# View allocatable resources across all nodes
kubectl get nodes -o custom-columns=\
NAME:.metadata.name,\
CPU:.status.allocatable.cpu,\
MEM:.status.allocatable.memory

# Check resource usage per node
kubectl top nodes

# View scheduler logs for detailed decision info
kubectl logs -n kube-system kube-scheduler-controlplane

Scheduler Profiles and Plugins

Starting with Kubernetes 1.19, the scheduler uses a plugin-based architecture called the Scheduling Framework. You can configure scheduler profiles to customize which plugins are enabled at each extension point:

apiVersion: kubescheduler.config.k8s.io/v1
kind: KubeSchedulerConfiguration
profiles:
- schedulerName: default-scheduler
  plugins:
    score:
      disabled:
      - name: ImageLocality
      enabled:
      - name: MyCustomPlugin
        weight: 5

You can also run multiple schedulers simultaneously. Pods specify which scheduler to use:

spec:
  schedulerName: my-custom-scheduler

Priority and Preemption

When a high-priority pod cannot be scheduled, the scheduler can preempt lower-priority pods to make room. This is controlled through PriorityClasses:

apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: critical-workload
value: 1000000
globalDefault: false
preemptionPolicy: PreemptLowerPriority
description: "For mission-critical workloads"