How Do Custom Schedulers Work in Kubernetes?

advanced|architecturesreplatform engineerCKA
TL;DR

Kubernetes supports running multiple schedulers simultaneously. You can deploy a custom scheduler that implements specialized placement logic and direct specific Pods to use it via the schedulerName field.

Detailed Answer

The default Kubernetes scheduler handles most workloads well, but some scenarios require custom scheduling logic — GPU-aware placement, gang scheduling for batch jobs, or cost-optimized node selection. Kubernetes supports multiple schedulers running simultaneously.

Directing Pods to a Custom Scheduler

apiVersion: v1
kind: Pod
metadata:
  name: gpu-training
spec:
  schedulerName: gpu-scheduler
  containers:
    - name: trainer
      image: training-job:1.0
      resources:
        requests:
          cpu: "4"
          memory: "16Gi"
          nvidia.com/gpu: "2"

If schedulerName is not set, the Pod uses default-scheduler. If the named scheduler does not exist, the Pod remains Pending indefinitely.

Approaches to Custom Scheduling

There are three levels of customization, from simplest to most complex:

1. Scheduler Extenders (Deprecated Path)

Extenders are HTTP webhooks that the default scheduler calls at filter and score phases:

apiVersion: kubescheduler.config.k8s.io/v1
kind: KubeSchedulerConfiguration
extenders:
  - urlPrefix: "http://gpu-extender.kube-system:8080"
    filterVerb: "filter"
    prioritizeVerb: "prioritize"
    weight: 5
    enableHTTPS: false
    managedResources:
      - name: "nvidia.com/gpu"
        ignoredByScheduler: true

Extenders are simple but add HTTP round-trip latency to every scheduling decision.

2. Scheduling Framework Plugins (Recommended)

The scheduling framework (introduced in 1.19) defines extension points throughout the scheduling cycle:

Scheduling Cycle:
  PreFilter → Filter → PostFilter → PreScore → Score → Reserve → Permit

Binding Cycle:
  PreBind → Bind → PostBind

You implement a Go interface for the desired extension point:

package main

import (
    "context"
    v1 "k8s.io/api/core/v1"
    "k8s.io/kubernetes/pkg/scheduler/framework"
)

type GPUAwarePlugin struct{}

func (p *GPUAwarePlugin) Name() string {
    return "GPUAware"
}

func (p *GPUAwarePlugin) Score(
    ctx context.Context,
    state *framework.CycleState,
    pod *v1.Pod,
    nodeName string,
) (int64, *framework.Status) {
    // Custom scoring logic:
    // Prefer nodes with matching GPU type
    nodeInfo, _ := state.Read("nodeInfo")
    gpuType := getGPUType(nodeInfo)

    requestedGPU := pod.Annotations["preferred-gpu-type"]
    if gpuType == requestedGPU {
        return 100, nil
    }
    return 50, nil
}

func (p *GPUAwarePlugin) ScoreExtensions() framework.ScoreExtensions {
    return nil
}

3. Scheduler Profiles (Multiple Schedulers in One Binary)

Instead of deploying multiple scheduler binaries, you can configure multiple profiles in a single scheduler:

apiVersion: kubescheduler.config.k8s.io/v1
kind: KubeSchedulerConfiguration
profiles:
  - schedulerName: default-scheduler
    plugins:
      score:
        enabled:
          - name: NodeResourcesFit
          - name: InterPodAffinity
  - schedulerName: gpu-scheduler
    plugins:
      score:
        enabled:
          - name: NodeResourcesFit
          - name: GPUAware
        disabled:
          - name: InterPodAffinity
  - schedulerName: batch-scheduler
    plugins:
      score:
        enabled:
          - name: NodeResourcesFit
      preFilter:
        enabled:
          - name: GangScheduling

Pods set schedulerName to default-scheduler, gpu-scheduler, or batch-scheduler, and the single scheduler binary routes them to the appropriate profile.

Deploying a Custom Scheduler

When running a fully custom scheduler as a separate Deployment:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: custom-scheduler
  namespace: kube-system
spec:
  replicas: 2
  selector:
    matchLabels:
      component: custom-scheduler
  template:
    metadata:
      labels:
        component: custom-scheduler
    spec:
      serviceAccountName: custom-scheduler
      containers:
        - name: scheduler
          image: my-custom-scheduler:1.0
          command:
            - /usr/local/bin/kube-scheduler
            - --config=/etc/scheduler/config.yaml
            - --leader-elect=true
            - --leader-elect-resource-name=custom-scheduler
          volumeMounts:
            - name: config
              mountPath: /etc/scheduler
          resources:
            requests:
              cpu: "100m"
              memory: "256Mi"
      volumes:
        - name: config
          configMap:
            name: custom-scheduler-config

RBAC for Custom Schedulers

The scheduler needs permissions to read Pods, Nodes, and PersistentVolumes, and to create Bindings and Events:

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: custom-scheduler
rules:
  - apiGroups: [""]
    resources: ["pods"]
    verbs: ["get", "list", "watch", "update", "patch"]
  - apiGroups: [""]
    resources: ["pods/binding"]
    verbs: ["create"]
  - apiGroups: [""]
    resources: ["nodes"]
    verbs: ["get", "list", "watch"]
  - apiGroups: [""]
    resources: ["events"]
    verbs: ["create", "patch"]
  - apiGroups: ["coordination.k8s.io"]
    resources: ["leases"]
    verbs: ["get", "create", "update"]

When to Use Custom Scheduling

| Use Case | Approach | |----------|----------| | Simple priority adjustments | Pod priority classes (no custom scheduler needed) | | GPU or hardware-aware placement | Scheduler plugin or extender | | Gang scheduling (all-or-nothing) | Custom scheduler with coscheduling plugin | | Cost-optimized spot instance placement | Score plugin preferring cheaper nodes | | Multi-tenant fairness | Custom queue-based scheduler |

Debugging Custom Schedulers

# Check which scheduler is assigned to a Pod
kubectl get pod gpu-training -o jsonpath='{.spec.schedulerName}'

# Check scheduler logs
kubectl logs -n kube-system -l component=custom-scheduler

# Verify the scheduler is running
kubectl get pods -n kube-system -l component=custom-scheduler

# Check for scheduling events
kubectl describe pod gpu-training | grep -A 5 Events

Why Interviewers Ask This

This question evaluates your understanding of the scheduler's extensibility model, which is relevant for specialized workloads like GPU scheduling, batch processing, and multi-tenant clusters.

Common Follow-Up Questions

How do you tell a Pod to use a custom scheduler?
Set spec.schedulerName to the name of your custom scheduler. If omitted, the default scheduler (default-scheduler) is used.
What is the difference between a custom scheduler and a scheduler extender?
A custom scheduler replaces the scheduling pipeline entirely. A scheduler extender adds filter/score steps to the default scheduler via HTTP callbacks — simpler but less flexible.
What are scheduler plugins and profiles?
Since 1.19, the scheduling framework lets you write plugins that hook into specific phases (Filter, Score, Bind). Profiles let a single scheduler binary run multiple scheduling configurations.

Key Takeaways

  • Use the schedulerName field on a Pod to direct it to a specific scheduler.
  • The scheduling framework with plugins is the modern approach — prefer it over writing a scheduler from scratch.
  • Multiple scheduler profiles can run in a single scheduler binary, reducing operational complexity.

Related Questions

You Might Also Like