What Is the Kubernetes Descheduler?

advanced|schedulingsreplatform engineerCKA
TL;DR

The Descheduler is a tool that evicts Pods that violate scheduling policies after they have been placed. It rebalances workloads when new nodes are added, removes Pods from underutilized nodes for consolidation, and enforces topology spread after cluster changes.

Detailed Answer

The Kubernetes Descheduler solves a problem the scheduler cannot: rebalancing Pods after initial placement. The scheduler makes decisions at Pod creation time, but the cluster state changes constantly — nodes are added and removed, workload patterns shift, and topology constraints may no longer be satisfied.

Why Descheduling Is Needed

Scenario: 3-node cluster, Deployment with 6 replicas (2 per node)

Node-1: [Pod1, Pod2]  Node-2: [Pod3, Pod4]  Node-3: [Pod5, Pod6]

→ New Node-4 added to the cluster
→ All Pods remain on Nodes 1-3 — Node-4 is empty
→ Descheduler evicts Pod6 from Node-3
→ Scheduler places Pod6 on Node-4
→ Result: Balanced distribution

Installation

helm repo add descheduler https://kubernetes-sigs.github.io/descheduler
helm install descheduler descheduler/descheduler \
  --namespace kube-system \
  --set schedule="*/5 * * * *"  # Run every 5 minutes

Descheduler Strategies

LowNodeUtilization

Evicts Pods from overutilized nodes so they can be rescheduled on underutilized nodes:

apiVersion: "descheduler/v1alpha2"
kind: "DeschedulerPolicy"
profiles:
  - name: default
    pluginConfig:
      - name: LowNodeUtilization
        args:
          thresholds:
            cpu: 20
            memory: 20
            pods: 20
          targetThresholds:
            cpu: 50
            memory: 50
            pods: 50
          numberOfNodes: 3
    plugins:
      balance:
        enabled:
          - LowNodeUtilization

Nodes below thresholds are underutilized; nodes above targetThresholds are overutilized. Pods are evicted from overutilized nodes.

RemoveDuplicates

Ensures that only one Pod from a ReplicaSet runs on each node:

pluginConfig:
  - name: RemoveDuplicates
    args:
      excludeOwnerKinds:
        - DaemonSet
plugins:
  balance:
    enabled:
      - RemoveDuplicates

RemovePodsViolatingTopologySpreadConstraint

Evicts Pods that no longer satisfy their topology spread constraints (e.g., after a node failure causes imbalance):

pluginConfig:
  - name: RemovePodsViolatingTopologySpreadConstraint
    args:
      constraints:
        - DoNotSchedule
        - ScheduleAnyway
plugins:
  balance:
    enabled:
      - RemovePodsViolatingTopologySpreadConstraint

RemovePodsViolatingNodeAffinity

Evicts Pods whose node affinity rules are no longer satisfied (e.g., a node label was changed):

plugins:
  deschedule:
    enabled:
      - RemovePodsViolatingNodeAffinity

RemovePodsViolatingInterPodAntiAffinity

Evicts Pods whose inter-pod anti-affinity rules are violated (e.g., due to manual Pod scheduling):

plugins:
  deschedule:
    enabled:
      - RemovePodsViolatingInterPodAntiAffinity

HighNodeUtilization (Bin-Packing)

The opposite of LowNodeUtilization — evicts Pods from underutilized nodes to consolidate onto fewer nodes (useful for cost optimization with cluster autoscaler):

pluginConfig:
  - name: HighNodeUtilization
    args:
      thresholds:
        cpu: 20
        memory: 20
plugins:
  balance:
    enabled:
      - HighNodeUtilization

Full Configuration Example

apiVersion: "descheduler/v1alpha2"
kind: "DeschedulerPolicy"
profiles:
  - name: production
    pluginConfig:
      - name: LowNodeUtilization
        args:
          thresholds:
            cpu: 20
            memory: 20
          targetThresholds:
            cpu: 60
            memory: 60
      - name: RemovePodsViolatingTopologySpreadConstraint
        args:
          constraints:
            - DoNotSchedule
      - name: RemoveDuplicates
    plugins:
      balance:
        enabled:
          - LowNodeUtilization
          - RemovePodsViolatingTopologySpreadConstraint
          - RemoveDuplicates
      deschedule:
        enabled:
          - RemovePodsViolatingNodeAffinity
          - RemovePodsViolatingInterPodAntiAffinity
      filter:
        enabled:
          - DefaultEvictor
    pluginConfig:
      - name: DefaultEvictor
        args:
          evictLocalStoragePods: false
          evictSystemCriticalPods: false
          evictFailedBarePods: true
          nodeFit: true

Safety Mechanisms

The Descheduler respects several safety boundaries:

  1. PodDisruptionBudgets: Will not evict Pods if it would violate a PDB
  2. System-critical Pods: Skips Pods in kube-system by default
  3. Local storage: Skips Pods with local storage by default
  4. DaemonSet Pods: Never evicts DaemonSet-managed Pods
  5. Static Pods: Never evicts static Pods
  6. Pods without controllers: Skips bare Pods (no owner reference)

Deployment Modes

| Mode | How | When | |------|-----|------| | CronJob | Runs periodically (every 2-5 min) | Continuous rebalancing | | Job | Runs once | After cluster scaling events | | Deployment | Runs continuously | Real-time descheduling (newer approach) |

Monitoring the Descheduler

# Check descheduler logs
kubectl logs -n kube-system -l app=descheduler

# Monitor eviction events
kubectl get events -A --field-selector reason=Descheduled

# Check Pod disruption budget status
kubectl get pdb -A

Best Practices

  1. Always use PDBs on production workloads before enabling the Descheduler
  2. Start conservative — use a long interval and moderate thresholds
  3. Exclude stateful workloads initially until you understand the impact
  4. Monitor eviction rates — too many evictions indicate the thresholds are wrong
  5. Combine with Cluster Autoscaler — HighNodeUtilization + autoscaler scale-down can reduce costs

Why Interviewers Ask This

The default scheduler only makes placement decisions at scheduling time. The Descheduler addresses drift that occurs after initial placement — new nodes, changed topology, or violated policies.

Common Follow-Up Questions

Does the Descheduler reschedule Pods?
No — it only evicts Pods. The default scheduler then reschedules them. This means Pods must be managed by a controller (Deployment, StatefulSet) to be recreated.
How do you prevent the Descheduler from evicting critical Pods?
Use PodDisruptionBudgets to protect minimum availability. The Descheduler respects PDBs and will not evict Pods that would violate them.
What strategies does the Descheduler support?
LowNodeUtilization, RemoveDuplicates, RemovePodsViolatingTopologySpreadConstraint, RemovePodsViolatingNodeAffinity, RemovePodsViolatingInterPodAntiAffinity, and more.

Key Takeaways

  • The Descheduler evicts Pods that no longer satisfy scheduling policies — the scheduler then reschedules them.
  • It is essential for maintaining balance after cluster scaling events.
  • Always use PodDisruptionBudgets to protect workload availability during descheduling.

Related Questions

You Might Also Like