How Do Pod Affinity and Anti-Affinity Work?

advanced|schedulingdevopssreCKA
TL;DR

Pod affinity schedules Pods near other Pods that match a label selector, while pod anti-affinity ensures Pods are spread apart. Both operate within a topology domain (node, zone, rack) and support required (hard) and preferred (soft) rules. Anti-affinity is commonly used to spread replicas across failure domains.

Detailed Answer

Pod Affinity: Co-locating Pods

Pod affinity attracts a Pod to nodes that already run Pods matching a specific label selector, within a defined topology domain. This is useful for placing related services close together to reduce network latency.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: web-frontend
spec:
  replicas: 3
  selector:
    matchLabels:
      app: web-frontend
  template:
    metadata:
      labels:
        app: web-frontend
    spec:
      affinity:
        podAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            - labelSelector:
                matchExpressions:
                  - key: app
                    operator: In
                    values:
                      - cache
              topologyKey: kubernetes.io/hostname
      containers:
        - name: web
          image: web-app:latest

This ensures every web-frontend Pod runs on a node that also has a Pod labeled app=cache. The topologyKey: kubernetes.io/hostname scopes the affinity to the individual node level.

Pod Anti-Affinity: Spreading Pods Apart

Pod anti-affinity ensures Pods are not co-located with other Pods matching a selector. This is the standard pattern for high availability.

Spread Replicas Across Nodes

apiVersion: apps/v1
kind: Deployment
metadata:
  name: redis
spec:
  replicas: 3
  selector:
    matchLabels:
      app: redis
  template:
    metadata:
      labels:
        app: redis
    spec:
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            - labelSelector:
                matchExpressions:
                  - key: app
                    operator: In
                    values:
                      - redis
              topologyKey: kubernetes.io/hostname
      containers:
        - name: redis
          image: redis:7

This guarantees that no two Redis Pods run on the same node. If there are only 2 nodes and 3 replicas, the third replica stays Pending.

Spread Replicas Across Zones

affinity:
  podAntiAffinity:
    requiredDuringSchedulingIgnoredDuringExecution:
      - labelSelector:
          matchExpressions:
            - key: app
              operator: In
              values:
                - api-gateway
        topologyKey: topology.kubernetes.io/zone

This ensures each api-gateway replica is in a different availability zone, surviving a single zone failure.

Soft Anti-Affinity (Preferred)

When strict spreading is not possible (e.g., more replicas than zones), use preferred anti-affinity:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: worker
spec:
  replicas: 10
  selector:
    matchLabels:
      app: worker
  template:
    metadata:
      labels:
        app: worker
    spec:
      affinity:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
            - weight: 100
              podAffinityTerm:
                labelSelector:
                  matchExpressions:
                    - key: app
                      operator: In
                      values:
                        - worker
                topologyKey: kubernetes.io/hostname
      containers:
        - name: worker
          image: worker:latest

The scheduler will try to spread workers across nodes but will allow multiple Pods per node if necessary.

Understanding topologyKey

The topologyKey is a node label that defines the scope of the affinity/anti-affinity rule:

| topologyKey | Scope | Use Case | |---|---|---| | kubernetes.io/hostname | Per node | Spread across individual nodes | | topology.kubernetes.io/zone | Per AZ | Survive AZ failure | | topology.kubernetes.io/region | Per region | Survive regional failure | | kubernetes.io/os | Per OS | Separate Linux/Windows | | Custom label (e.g., rack) | Per rack | Spread across racks |

# View topology labels on nodes
kubectl get nodes -L topology.kubernetes.io/zone,kubernetes.io/hostname

Combining Affinity and Anti-Affinity

A common pattern: co-locate frontend with cache (affinity) while spreading frontend replicas across zones (anti-affinity):

apiVersion: apps/v1
kind: Deployment
metadata:
  name: frontend
spec:
  replicas: 3
  selector:
    matchLabels:
      app: frontend
  template:
    metadata:
      labels:
        app: frontend
    spec:
      affinity:
        podAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
            - weight: 80
              podAffinityTerm:
                labelSelector:
                  matchExpressions:
                    - key: app
                      operator: In
                      values:
                        - memcached
                topologyKey: kubernetes.io/hostname
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            - labelSelector:
                matchExpressions:
                  - key: app
                    operator: In
                    values:
                      - frontend
              topologyKey: topology.kubernetes.io/zone
      containers:
        - name: frontend
          image: frontend:v2

Namespace Considerations

By default, pod affinity/anti-affinity only considers Pods in the same namespace as the Pod being scheduled. To match Pods in other namespaces:

podAffinity:
  requiredDuringSchedulingIgnoredDuringExecution:
    - labelSelector:
        matchExpressions:
          - key: app
            operator: In
            values:
              - database
      topologyKey: kubernetes.io/hostname
      namespaces:
        - database-namespace
      # Or use namespaceSelector to match by namespace labels:
      # namespaceSelector:
      #   matchLabels:
      #     team: backend

Performance Considerations

Pod affinity and anti-affinity rules require the scheduler to evaluate all existing Pods that match the label selector in the relevant namespaces. In clusters with thousands of Pods, this can significantly slow scheduling. Best practices:

  1. Keep label selectors narrow to reduce the number of Pods evaluated.
  2. Prefer preferredDuringScheduling over requiredDuringScheduling when possible.
  3. Consider using topology spread constraints instead of anti-affinity for even distribution, as they are more efficient for the scheduler.

Debugging Scheduling Failures

# Check why a Pod is not being scheduled
kubectl describe pod frontend-abc123 | grep -A 10 Events

# Common messages:
# "didn't match pod affinity rules"
# "didn't match pod anti-affinity rules"
# "node(s) didn't match pod topology spread constraints"

Why Interviewers Ask This

Interviewers test whether you can design highly available deployments that spread replicas across zones and co-locate related services for performance.

Common Follow-Up Questions

What is a topologyKey?
A node label that defines the topology domain. kubernetes.io/hostname means per-node, topology.kubernetes.io/zone means per-zone.
What is the performance impact of pod affinity rules?
Pod affinity requires the scheduler to examine all Pods in the cluster that match the selector, which can slow scheduling in large clusters.
How is pod anti-affinity different from topology spread constraints?
Anti-affinity is binary (allow/deny). Topology spread constraints try to achieve an even distribution with a configurable maxSkew.

Key Takeaways

  • Pod affinity co-locates related Pods; anti-affinity separates them
  • The topologyKey defines the failure domain scope (node, zone, region)
  • Required anti-affinity with zone topologyKey is the standard HA pattern