What Is the Kubernetes Descheduler?

Q: What Is the Kubernetes Descheduler?

The Descheduler is a tool that evicts Pods that violate scheduling policies after they have been placed. It rebalances workloads when new nodes are added, removes Pods from underutilized nodes for consolidation, and enforces topology spread after cluster changes.

Detailed Answer

The Kubernetes Descheduler solves a problem the scheduler cannot: rebalancing Pods after initial placement. The scheduler makes decisions at Pod creation time, but the cluster state changes constantly — nodes are added and removed, workload patterns shift, and topology constraints may no longer be satisfied.

Why Descheduling Is Needed

Scenario: 3-node cluster, Deployment with 6 replicas (2 per node)

Node-1: [Pod1, Pod2]  Node-2: [Pod3, Pod4]  Node-3: [Pod5, Pod6]

→ New Node-4 added to the cluster
→ All Pods remain on Nodes 1-3 — Node-4 is empty
→ Descheduler evicts Pod6 from Node-3
→ Scheduler places Pod6 on Node-4
→ Result: Balanced distribution

Installation

helm repo add descheduler https://kubernetes-sigs.github.io/descheduler
helm install descheduler descheduler/descheduler \
  --namespace kube-system \
  --set schedule="*/5 * * * *"  # Run every 5 minutes

Descheduler Strategies

LowNodeUtilization

Evicts Pods from overutilized nodes so they can be rescheduled on underutilized nodes:

apiVersion: "descheduler/v1alpha2"
kind: "DeschedulerPolicy"
profiles:
  - name: default
    pluginConfig:
      - name: LowNodeUtilization
        args:
          thresholds:
            cpu: 20
            memory: 20
            pods: 20
          targetThresholds:
            cpu: 50
            memory: 50
            pods: 50
          numberOfNodes: 3
    plugins:
      balance:
        enabled:
          - LowNodeUtilization

Nodes below thresholds are underutilized; nodes above targetThresholds are overutilized. Pods are evicted from overutilized nodes.

RemoveDuplicates

Ensures that only one Pod from a ReplicaSet runs on each node:

pluginConfig:
  - name: RemoveDuplicates
    args:
      excludeOwnerKinds:
        - DaemonSet
plugins:
  balance:
    enabled:
      - RemoveDuplicates

RemovePodsViolatingTopologySpreadConstraint

Evicts Pods that no longer satisfy their topology spread constraints (e.g., after a node failure causes imbalance):

pluginConfig:
  - name: RemovePodsViolatingTopologySpreadConstraint
    args:
      constraints:
        - DoNotSchedule
        - ScheduleAnyway
plugins:
  balance:
    enabled:
      - RemovePodsViolatingTopologySpreadConstraint

RemovePodsViolatingNodeAffinity

Evicts Pods whose node affinity rules are no longer satisfied (e.g., a node label was changed):

plugins:
  deschedule:
    enabled:
      - RemovePodsViolatingNodeAffinity

RemovePodsViolatingInterPodAntiAffinity

Evicts Pods whose inter-pod anti-affinity rules are violated (e.g., due to manual Pod scheduling):

plugins:
  deschedule:
    enabled:
      - RemovePodsViolatingInterPodAntiAffinity

HighNodeUtilization (Bin-Packing)

The opposite of LowNodeUtilization — evicts Pods from underutilized nodes to consolidate onto fewer nodes (useful for cost optimization with cluster autoscaler):

pluginConfig:
  - name: HighNodeUtilization
    args:
      thresholds:
        cpu: 20
        memory: 20
plugins:
  balance:
    enabled:
      - HighNodeUtilization

Full Configuration Example

apiVersion: "descheduler/v1alpha2"
kind: "DeschedulerPolicy"
profiles:
  - name: production
    pluginConfig:
      - name: LowNodeUtilization
        args:
          thresholds:
            cpu: 20
            memory: 20
          targetThresholds:
            cpu: 60
            memory: 60
      - name: RemovePodsViolatingTopologySpreadConstraint
        args:
          constraints:
            - DoNotSchedule
      - name: RemoveDuplicates
    plugins:
      balance:
        enabled:
          - LowNodeUtilization
          - RemovePodsViolatingTopologySpreadConstraint
          - RemoveDuplicates
      deschedule:
        enabled:
          - RemovePodsViolatingNodeAffinity
          - RemovePodsViolatingInterPodAntiAffinity
      filter:
        enabled:
          - DefaultEvictor
    pluginConfig:
      - name: DefaultEvictor
        args:
          evictLocalStoragePods: false
          evictSystemCriticalPods: false
          evictFailedBarePods: true
          nodeFit: true

Safety Mechanisms

The Descheduler respects several safety boundaries:

PodDisruptionBudgets: Will not evict Pods if it would violate a PDB
System-critical Pods: Skips Pods in kube-system by default
Local storage: Skips Pods with local storage by default
DaemonSet Pods: Never evicts DaemonSet-managed Pods
Static Pods: Never evicts static Pods
Pods without controllers: Skips bare Pods (no owner reference)

Deployment Modes

| Mode | How | When | |------|-----|------| | CronJob | Runs periodically (every 2-5 min) | Continuous rebalancing | | Job | Runs once | After cluster scaling events | | Deployment | Runs continuously | Real-time descheduling (newer approach) |

Monitoring the Descheduler

# Check descheduler logs
kubectl logs -n kube-system -l app=descheduler

# Monitor eviction events
kubectl get events -A --field-selector reason=Descheduled

# Check Pod disruption budget status
kubectl get pdb -A

Best Practices

Always use PDBs on production workloads before enabling the Descheduler
Start conservative — use a long interval and moderate thresholds
Exclude stateful workloads initially until you understand the impact
Monitor eviction rates — too many evictions indicate the thresholds are wrong
Combine with Cluster Autoscaler — HighNodeUtilization + autoscaler scale-down can reduce costs