How Do Pod Priority and Preemption Work?

Q: How Do Pod Priority and Preemption Work?

Pod Priority assigns a numerical priority value to Pods via PriorityClasses. Preemption allows the scheduler to evict lower-priority Pods to make room for higher-priority Pods when no node has sufficient resources. This ensures critical workloads can always be scheduled.

Detailed Answer

Pod Priority and Preemption is a Kubernetes scheduling feature that lets you assign relative importance to Pods. When the cluster runs out of resources, the scheduler can evict (preempt) lower-priority Pods to free up space for higher-priority ones.

PriorityClasses

A PriorityClass is a cluster-scoped resource that defines a priority value (an integer) and a preemption policy.

apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: high-priority
value: 1000000
globalDefault: false
preemptionPolicy: PreemptLowerPriority
description: "Used for production-critical services"
---
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: low-priority
value: 100
globalDefault: false
preemptionPolicy: Never
description: "Used for batch jobs that should not preempt other workloads"

Key fields:

value: An integer from -2,147,483,648 to 1,000,000,000. Higher values mean higher priority.
globalDefault: If true, this PriorityClass is assigned to Pods that do not specify one. Only one global default can exist.
preemptionPolicy: PreemptLowerPriority (default) or Never.

Assigning Priority to Pods

apiVersion: apps/v1
kind: Deployment
metadata:
  name: critical-api
spec:
  replicas: 3
  selector:
    matchLabels:
      app: critical-api
  template:
    metadata:
      labels:
        app: critical-api
    spec:
      priorityClassName: high-priority
      containers:
        - name: api
          image: myapp/api:3.0
          ports:
            - containerPort: 8080
          resources:
            requests:
              cpu: "500m"
              memory: "512Mi"
            limits:
              cpu: "1"
              memory: "1Gi"

How Preemption Works

When the scheduler cannot find a node with enough resources for a pending Pod:

The scheduler evaluates each node to determine if evicting lower-priority Pods would free enough resources.
It selects the node where preemption causes the least disruption (fewest evictions, lowest priority Pods).
Lower-priority Pods are evicted -- they receive a graceful termination signal and their terminationGracePeriodSeconds is respected.
The higher-priority Pod is scheduled once the resources are freed.

High-priority Pod pending (no resources available)
    |
    v
Scheduler identifies node with low-priority Pods
    |
    v
Low-priority Pods are gracefully evicted
    |
    v
High-priority Pod is scheduled on the freed node

Built-in System Priority Classes

Kubernetes includes two built-in PriorityClasses for system components:

| PriorityClass | Value | Use | |--------------|-------|-----| | system-node-critical | 2,000,001,000 | Node-level critical Pods (kube-proxy, CNI) | | system-cluster-critical | 2,000,000,000 | Cluster-level critical Pods (CoreDNS, metrics-server) |

These values exceed the maximum user-configurable value (1,000,000,000), ensuring system components always take precedence.

# View all PriorityClasses
kubectl get priorityclasses

# See which priority a specific Pod has
kubectl get pod my-pod -o jsonpath='{.spec.priority}'

Non-Preempting Priority Classes

Setting preemptionPolicy: Never creates a priority class that influences scheduling queue order but does not evict other Pods. This is useful for workloads that are more important than others in the queue but should not displace running workloads.

apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: medium-no-preempt
value: 500000
preemptionPolicy: Never
description: "Higher scheduling priority but no preemption"

With this class, Pods are scheduled ahead of lower-priority pending Pods but will wait if no resources are available rather than evicting running Pods.

Preemption and PodDisruptionBudgets

Preemption respects PodDisruptionBudgets (PDBs) on a best-effort basis. The scheduler tries to avoid violating PDBs during preemption, but if no other option exists to schedule the higher-priority Pod, it may still evict Pods protected by a PDB.

Priority vs. QoS Eviction

Priority and QoS serve different purposes at different layers:

| Mechanism | Layer | Trigger | Action | |-----------|-------|---------|--------| | Priority/Preemption | Scheduler | No resources for a high-priority Pod | Evict lower-priority Pods | | QoS Eviction | Kubelet | Node under resource pressure | Evict BestEffort, then Burstable, then Guaranteed |

During kubelet eviction (node pressure), the eviction order considers both QoS class and priority. Among Pods of the same QoS class, lower-priority Pods are evicted first.

Common Priority Architecture

A typical production cluster defines three to five priority tiers:

# Tier 1: Infrastructure (value: 1000000)
# CoreDNS, ingress controllers, monitoring

# Tier 2: Production services (value: 500000)
# Customer-facing APIs, databases

# Tier 3: Internal services (value: 250000)
# Internal tools, staging environments

# Tier 4: Batch jobs (value: 100, preemptionPolicy: Never)
# Data pipelines, ML training, CI/CD

Best Practices

Define a clear priority hierarchy across your organization with 3-5 tiers.
Use preemptionPolicy: Never for batch workloads so they do not disrupt running services.
Always set resource requests on Pods using priority classes -- the scheduler needs them to evaluate preemption.
Combine with PodDisruptionBudgets to protect critical workloads from aggressive preemption.
Monitor preemption events using kubectl get events --field-selector reason=Preempted to detect unexpected evictions.
Do not set all Pods to high priority -- if everything is high priority, nothing is. Priority is relative.