What Is the Kubernetes Descheduler?
The Descheduler is a tool that evicts Pods that violate scheduling policies after they have been placed. It rebalances workloads when new nodes are added, removes Pods from underutilized nodes for consolidation, and enforces topology spread after cluster changes.
Detailed Answer
The Kubernetes Descheduler solves a problem the scheduler cannot: rebalancing Pods after initial placement. The scheduler makes decisions at Pod creation time, but the cluster state changes constantly — nodes are added and removed, workload patterns shift, and topology constraints may no longer be satisfied.
Why Descheduling Is Needed
Scenario: 3-node cluster, Deployment with 6 replicas (2 per node)
Node-1: [Pod1, Pod2] Node-2: [Pod3, Pod4] Node-3: [Pod5, Pod6]
→ New Node-4 added to the cluster
→ All Pods remain on Nodes 1-3 — Node-4 is empty
→ Descheduler evicts Pod6 from Node-3
→ Scheduler places Pod6 on Node-4
→ Result: Balanced distribution
Installation
helm repo add descheduler https://kubernetes-sigs.github.io/descheduler
helm install descheduler descheduler/descheduler \
--namespace kube-system \
--set schedule="*/5 * * * *" # Run every 5 minutes
Descheduler Strategies
LowNodeUtilization
Evicts Pods from overutilized nodes so they can be rescheduled on underutilized nodes:
apiVersion: "descheduler/v1alpha2"
kind: "DeschedulerPolicy"
profiles:
- name: default
pluginConfig:
- name: LowNodeUtilization
args:
thresholds:
cpu: 20
memory: 20
pods: 20
targetThresholds:
cpu: 50
memory: 50
pods: 50
numberOfNodes: 3
plugins:
balance:
enabled:
- LowNodeUtilization
Nodes below thresholds are underutilized; nodes above targetThresholds are overutilized. Pods are evicted from overutilized nodes.
RemoveDuplicates
Ensures that only one Pod from a ReplicaSet runs on each node:
pluginConfig:
- name: RemoveDuplicates
args:
excludeOwnerKinds:
- DaemonSet
plugins:
balance:
enabled:
- RemoveDuplicates
RemovePodsViolatingTopologySpreadConstraint
Evicts Pods that no longer satisfy their topology spread constraints (e.g., after a node failure causes imbalance):
pluginConfig:
- name: RemovePodsViolatingTopologySpreadConstraint
args:
constraints:
- DoNotSchedule
- ScheduleAnyway
plugins:
balance:
enabled:
- RemovePodsViolatingTopologySpreadConstraint
RemovePodsViolatingNodeAffinity
Evicts Pods whose node affinity rules are no longer satisfied (e.g., a node label was changed):
plugins:
deschedule:
enabled:
- RemovePodsViolatingNodeAffinity
RemovePodsViolatingInterPodAntiAffinity
Evicts Pods whose inter-pod anti-affinity rules are violated (e.g., due to manual Pod scheduling):
plugins:
deschedule:
enabled:
- RemovePodsViolatingInterPodAntiAffinity
HighNodeUtilization (Bin-Packing)
The opposite of LowNodeUtilization — evicts Pods from underutilized nodes to consolidate onto fewer nodes (useful for cost optimization with cluster autoscaler):
pluginConfig:
- name: HighNodeUtilization
args:
thresholds:
cpu: 20
memory: 20
plugins:
balance:
enabled:
- HighNodeUtilization
Full Configuration Example
apiVersion: "descheduler/v1alpha2"
kind: "DeschedulerPolicy"
profiles:
- name: production
pluginConfig:
- name: LowNodeUtilization
args:
thresholds:
cpu: 20
memory: 20
targetThresholds:
cpu: 60
memory: 60
- name: RemovePodsViolatingTopologySpreadConstraint
args:
constraints:
- DoNotSchedule
- name: RemoveDuplicates
plugins:
balance:
enabled:
- LowNodeUtilization
- RemovePodsViolatingTopologySpreadConstraint
- RemoveDuplicates
deschedule:
enabled:
- RemovePodsViolatingNodeAffinity
- RemovePodsViolatingInterPodAntiAffinity
filter:
enabled:
- DefaultEvictor
pluginConfig:
- name: DefaultEvictor
args:
evictLocalStoragePods: false
evictSystemCriticalPods: false
evictFailedBarePods: true
nodeFit: true
Safety Mechanisms
The Descheduler respects several safety boundaries:
- PodDisruptionBudgets: Will not evict Pods if it would violate a PDB
- System-critical Pods: Skips Pods in kube-system by default
- Local storage: Skips Pods with local storage by default
- DaemonSet Pods: Never evicts DaemonSet-managed Pods
- Static Pods: Never evicts static Pods
- Pods without controllers: Skips bare Pods (no owner reference)
Deployment Modes
| Mode | How | When | |------|-----|------| | CronJob | Runs periodically (every 2-5 min) | Continuous rebalancing | | Job | Runs once | After cluster scaling events | | Deployment | Runs continuously | Real-time descheduling (newer approach) |
Monitoring the Descheduler
# Check descheduler logs
kubectl logs -n kube-system -l app=descheduler
# Monitor eviction events
kubectl get events -A --field-selector reason=Descheduled
# Check Pod disruption budget status
kubectl get pdb -A
Best Practices
- Always use PDBs on production workloads before enabling the Descheduler
- Start conservative — use a long interval and moderate thresholds
- Exclude stateful workloads initially until you understand the impact
- Monitor eviction rates — too many evictions indicate the thresholds are wrong
- Combine with Cluster Autoscaler — HighNodeUtilization + autoscaler scale-down can reduce costs
Why Interviewers Ask This
The default scheduler only makes placement decisions at scheduling time. The Descheduler addresses drift that occurs after initial placement — new nodes, changed topology, or violated policies.
Common Follow-Up Questions
Key Takeaways
- The Descheduler evicts Pods that no longer satisfy scheduling policies — the scheduler then reschedules them.
- It is essential for maintaining balance after cluster scaling events.
- Always use PodDisruptionBudgets to protect workload availability during descheduling.