How Do Pod Topology Spread Constraints Affect Scheduling?

advanced|schedulingsreplatform engineerCKA

TL;DR

Pod topology spread constraints control how Pods are distributed across topology domains during scheduling. They interact with other scheduling rules like node affinity and taints, and can be configured as cluster-wide defaults to enforce even distribution without per-Deployment configuration.

Detailed Answer

While the pods topic covers the basics of topology spread constraints, this answer focuses on how they interact with the scheduler, advanced parameters, cluster defaults, and real-world scheduling scenarios.

Scheduling Pipeline Interaction

Topology spread constraints are evaluated during both the Filter and Score phases of scheduling:

1. PreFilter: Calculate existing Pod distribution
2. Filter:    Eliminate nodes where maxSkew would be violated (DoNotSchedule)
3. Score:     Prefer nodes that minimize skew (ScheduleAnyway)

The constraint works after node affinity and taint filtering. If node affinity limits eligible nodes to zone-a, a zone-level spread constraint has no nodes in other zones to spread to.

Interaction with Node Affinity

spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
          - matchExpressions:
              - key: node-type
                operator: In
                values: ["compute"]
  topologySpreadConstraints:
    - maxSkew: 1
      topologyKey: topology.kubernetes.io/zone
      whenUnsatisfiable: DoNotSchedule
      labelSelector:
        matchLabels:
          app: web

If "compute" nodes exist in zones a, b, and c, Pods spread across all three. If "compute" nodes exist only in zone-a, the spread constraint effectively does nothing — there is only one domain.

minDomains Parameter

minDomains (beta since 1.25) prevents the constraint from being vacuously satisfied when there are too few topology domains:

topologySpreadConstraints:
  - maxSkew: 1
    topologyKey: topology.kubernetes.io/zone
    whenUnsatisfiable: DoNotSchedule
    minDomains: 3
    labelSelector:
      matchLabels:
        app: web

Without minDomains, if the cluster has only 1 zone, all Pods land there and maxSkew is trivially satisfied (0 skew). With minDomains: 3, the scheduler treats missing domains as having 0 Pods, potentially making the skew exceed maxSkew and blocking scheduling until 3 zones exist.

matchLabelKeys for Rolling Updates

During a rolling update, the Deployment creates a new ReplicaSet. Old and new Pods have different pod-template-hash labels:

topologySpreadConstraints:
  - maxSkew: 1
    topologyKey: topology.kubernetes.io/zone
    whenUnsatisfiable: DoNotSchedule
    labelSelector:
      matchLabels:
        app: web
    matchLabelKeys:
      - pod-template-hash

matchLabelKeys tells the constraint to only count Pods with the same pod-template-hash as the Pod being scheduled. This means:

New Pods are spread evenly across zones independently of old Pods
Old Pods being terminated do not affect new Pod placement

Cluster-Wide Default Constraints

Configure default topology spread constraints in the scheduler config to enforce zone balance across all workloads:

apiVersion: kubescheduler.config.k8s.io/v1
kind: KubeSchedulerConfiguration
profiles:
  - schedulerName: default-scheduler
    pluginConfig:
      - name: PodTopologySpread
        args:
          defaultConstraints:
            - maxSkew: 3
              topologyKey: topology.kubernetes.io/zone
              whenUnsatisfiable: ScheduleAnyway
            - maxSkew: 5
              topologyKey: kubernetes.io/hostname
              whenUnsatisfiable: ScheduleAnyway
          defaultingType: List

These apply to all Pods that do not define their own topology spread constraints. Use ScheduleAnyway for defaults to avoid blocking Pod scheduling unexpectedly.

nodeAffinityPolicy and nodeTaintsPolicy

These fields (GA in 1.26) control how node filtering interacts with topology spread:

topologySpreadConstraints:
  - maxSkew: 1
    topologyKey: topology.kubernetes.io/zone
    whenUnsatisfiable: DoNotSchedule
    labelSelector:
      matchLabels:
        app: web
    nodeAffinityPolicy: Honor    # Only count Pods on nodes matching Pod's nodeAffinity
    nodeTaintsPolicy: Honor      # Only count Pods on nodes the Pod tolerates

| Policy | Behavior | |--------|----------| | Honor | Exclude nodes that don't match affinity/taints from skew calculation | | Ignore | Include all nodes in skew calculation (default) |

Scheduling Performance Impact

Topology spread adds computational cost to scheduling:

Filter phase: The scheduler must evaluate Pod distribution across all topology domains
Score phase: The scheduler ranks nodes by how much they improve balance

For large clusters (10,000+ Pods), heavy use of topology spread can slow scheduling. Mitigate by:

Using ScheduleAnyway instead of DoNotSchedule where possible
Limiting constraints to 1-2 topology keys
Scoping labelSelector narrowly

Debugging Topology Spread Scheduling Failures

# Pod stuck pending — check events
kubectl describe pod web-abc -n production
# Events: "2 node(s) didn't match pod topology spread constraints"

# Check current distribution
kubectl get pods -l app=web -o wide --sort-by='.spec.nodeName'

# Check node topology labels
kubectl get nodes --show-labels | grep topology.kubernetes.io/zone

# Verify the constraint configuration
kubectl get deployment web -o jsonpath='{.spec.template.spec.topologySpreadConstraints}' | jq .

Real-World Configuration Example

A production Deployment with comprehensive scheduling constraints:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: api-server
spec:
  replicas: 9
  selector:
    matchLabels:
      app: api
  template:
    metadata:
      labels:
        app: api
    spec:
      topologySpreadConstraints:
        - maxSkew: 1
          topologyKey: topology.kubernetes.io/zone
          whenUnsatisfiable: DoNotSchedule
          labelSelector:
            matchLabels:
              app: api
          matchLabelKeys:
            - pod-template-hash
        - maxSkew: 2
          topologyKey: kubernetes.io/hostname
          whenUnsatisfiable: ScheduleAnyway
          labelSelector:
            matchLabels:
              app: api
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
              - matchExpressions:
                  - key: node-type
                    operator: In
                    values: ["compute"]
      containers:
        - name: api
          image: api-server:3.0
          resources:
            requests:
              cpu: "500m"
              memory: "512Mi"

This ensures:

3 Pods per zone (hard constraint, per ReplicaSet)
At most 2-Pod difference between nodes (soft constraint)
Only runs on "compute" nodes

Why Interviewers Ask This

This question explores the scheduling implications of topology spread — how it interacts with other constraints, impacts scheduling performance, and can be set as cluster defaults.

Common Follow-Up Questions

How do topology spread constraints interact with node affinity?

Node affinity filters eligible nodes first, then topology spread distributes Pods evenly within the filtered set. If affinity reduces nodes to a single zone, zone-level spread has no effect.

What is the minDomains parameter?

minDomains (beta in 1.25+) specifies the minimum number of topology domains that must exist before the constraint takes effect. This prevents uneven distribution when new zones are added.

How does topology spread perform during rolling updates?

Without matchLabelKeys, old and new Pods are counted together, which can cause imbalance. Use matchLabelKeys: [pod-template-hash] to scope counting per ReplicaSet.

Key Takeaways

Topology spread constraints work after node affinity filtering — they only spread across nodes that pass all other filters.
Cluster-wide default constraints provide a safety net without requiring per-Deployment configuration.
Use matchLabelKeys to ensure correct behavior during rolling updates.