How Do Custom Schedulers Work in Kubernetes?
Kubernetes supports running multiple schedulers simultaneously. You can deploy a custom scheduler that implements specialized placement logic and direct specific Pods to use it via the schedulerName field.
Detailed Answer
The default Kubernetes scheduler handles most workloads well, but some scenarios require custom scheduling logic — GPU-aware placement, gang scheduling for batch jobs, or cost-optimized node selection. Kubernetes supports multiple schedulers running simultaneously.
Directing Pods to a Custom Scheduler
apiVersion: v1
kind: Pod
metadata:
name: gpu-training
spec:
schedulerName: gpu-scheduler
containers:
- name: trainer
image: training-job:1.0
resources:
requests:
cpu: "4"
memory: "16Gi"
nvidia.com/gpu: "2"
If schedulerName is not set, the Pod uses default-scheduler. If the named scheduler does not exist, the Pod remains Pending indefinitely.
Approaches to Custom Scheduling
There are three levels of customization, from simplest to most complex:
1. Scheduler Extenders (Deprecated Path)
Extenders are HTTP webhooks that the default scheduler calls at filter and score phases:
apiVersion: kubescheduler.config.k8s.io/v1
kind: KubeSchedulerConfiguration
extenders:
- urlPrefix: "http://gpu-extender.kube-system:8080"
filterVerb: "filter"
prioritizeVerb: "prioritize"
weight: 5
enableHTTPS: false
managedResources:
- name: "nvidia.com/gpu"
ignoredByScheduler: true
Extenders are simple but add HTTP round-trip latency to every scheduling decision.
2. Scheduling Framework Plugins (Recommended)
The scheduling framework (introduced in 1.19) defines extension points throughout the scheduling cycle:
Scheduling Cycle:
PreFilter → Filter → PostFilter → PreScore → Score → Reserve → Permit
Binding Cycle:
PreBind → Bind → PostBind
You implement a Go interface for the desired extension point:
package main
import (
"context"
v1 "k8s.io/api/core/v1"
"k8s.io/kubernetes/pkg/scheduler/framework"
)
type GPUAwarePlugin struct{}
func (p *GPUAwarePlugin) Name() string {
return "GPUAware"
}
func (p *GPUAwarePlugin) Score(
ctx context.Context,
state *framework.CycleState,
pod *v1.Pod,
nodeName string,
) (int64, *framework.Status) {
// Custom scoring logic:
// Prefer nodes with matching GPU type
nodeInfo, _ := state.Read("nodeInfo")
gpuType := getGPUType(nodeInfo)
requestedGPU := pod.Annotations["preferred-gpu-type"]
if gpuType == requestedGPU {
return 100, nil
}
return 50, nil
}
func (p *GPUAwarePlugin) ScoreExtensions() framework.ScoreExtensions {
return nil
}
3. Scheduler Profiles (Multiple Schedulers in One Binary)
Instead of deploying multiple scheduler binaries, you can configure multiple profiles in a single scheduler:
apiVersion: kubescheduler.config.k8s.io/v1
kind: KubeSchedulerConfiguration
profiles:
- schedulerName: default-scheduler
plugins:
score:
enabled:
- name: NodeResourcesFit
- name: InterPodAffinity
- schedulerName: gpu-scheduler
plugins:
score:
enabled:
- name: NodeResourcesFit
- name: GPUAware
disabled:
- name: InterPodAffinity
- schedulerName: batch-scheduler
plugins:
score:
enabled:
- name: NodeResourcesFit
preFilter:
enabled:
- name: GangScheduling
Pods set schedulerName to default-scheduler, gpu-scheduler, or batch-scheduler, and the single scheduler binary routes them to the appropriate profile.
Deploying a Custom Scheduler
When running a fully custom scheduler as a separate Deployment:
apiVersion: apps/v1
kind: Deployment
metadata:
name: custom-scheduler
namespace: kube-system
spec:
replicas: 2
selector:
matchLabels:
component: custom-scheduler
template:
metadata:
labels:
component: custom-scheduler
spec:
serviceAccountName: custom-scheduler
containers:
- name: scheduler
image: my-custom-scheduler:1.0
command:
- /usr/local/bin/kube-scheduler
- --config=/etc/scheduler/config.yaml
- --leader-elect=true
- --leader-elect-resource-name=custom-scheduler
volumeMounts:
- name: config
mountPath: /etc/scheduler
resources:
requests:
cpu: "100m"
memory: "256Mi"
volumes:
- name: config
configMap:
name: custom-scheduler-config
RBAC for Custom Schedulers
The scheduler needs permissions to read Pods, Nodes, and PersistentVolumes, and to create Bindings and Events:
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: custom-scheduler
rules:
- apiGroups: [""]
resources: ["pods"]
verbs: ["get", "list", "watch", "update", "patch"]
- apiGroups: [""]
resources: ["pods/binding"]
verbs: ["create"]
- apiGroups: [""]
resources: ["nodes"]
verbs: ["get", "list", "watch"]
- apiGroups: [""]
resources: ["events"]
verbs: ["create", "patch"]
- apiGroups: ["coordination.k8s.io"]
resources: ["leases"]
verbs: ["get", "create", "update"]
When to Use Custom Scheduling
| Use Case | Approach | |----------|----------| | Simple priority adjustments | Pod priority classes (no custom scheduler needed) | | GPU or hardware-aware placement | Scheduler plugin or extender | | Gang scheduling (all-or-nothing) | Custom scheduler with coscheduling plugin | | Cost-optimized spot instance placement | Score plugin preferring cheaper nodes | | Multi-tenant fairness | Custom queue-based scheduler |
Debugging Custom Schedulers
# Check which scheduler is assigned to a Pod
kubectl get pod gpu-training -o jsonpath='{.spec.schedulerName}'
# Check scheduler logs
kubectl logs -n kube-system -l component=custom-scheduler
# Verify the scheduler is running
kubectl get pods -n kube-system -l component=custom-scheduler
# Check for scheduling events
kubectl describe pod gpu-training | grep -A 5 Events
Why Interviewers Ask This
This question evaluates your understanding of the scheduler's extensibility model, which is relevant for specialized workloads like GPU scheduling, batch processing, and multi-tenant clusters.
Common Follow-Up Questions
Key Takeaways
- Use the schedulerName field on a Pod to direct it to a specific scheduler.
- The scheduling framework with plugins is the modern approach — prefer it over writing a scheduler from scratch.
- Multiple scheduler profiles can run in a single scheduler binary, reducing operational complexity.