Preemption

Workload-aware preemption for gang-scheduled JobSets

When the WorkloadAwarePreemption feature gate is enabled, the scheduler can preempt entire pod groups to make room for higher-priority gang-scheduled workloads.

The Problem

Without workload-aware preemption, the scheduler preempts individual pods. This can cause problems for gang-scheduled workloads — preempting some pods from a running gang breaks the entire workload, and freeing only a subset of the required resources may not be enough to admit the pending gang.

How It Works

With workload-aware preemption, the scheduler understands that pods belong to a group and makes preemption decisions at the group level. This means:

  • All-or-nothing preemption: The scheduler will only preempt a pod group if doing so frees enough resources to admit the pending gang in full.
  • No partial disruption: Rather than evicting individual pods and breaking a running workload, the scheduler preempts the entire lower-priority pod group.

Two key fields on the PodGroup enable this:

  • priorityClassName: Associates the PodGroup with a Kubernetes PriorityClass, giving the scheduler a priority value to compare when deciding what to preempt.
  • disruptionMode: PodGroup: Ensures the entire gang is preempted together rather than individual pods.

Example: Gang Preemption

This example demonstrates PodGroup-level gang preemption using two JobSets — a low-priority JobSet that occupies the cluster, and a high-priority JobSet that preempts it.

Each JobSet creates 4 pods (2 replicas × 2 completions). The resource requests are sized so that only one JobSet can fit on the cluster at a time.

Step 1: Create the PriorityClasses

Create a low-priority and high-priority PriorityClass:

apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: low-priority
value: 1
globalDefault: false
description: "Low priority class with value 1"
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: high-priority
value: 100000
globalDefault: false
description: "High priority class with value 100000"

Step 2: Apply the low-priority JobSet

The low-priority JobSet includes its Workload and PodGroup. The PodGroup sets priorityClassName: low-priority and disruptionMode: PodGroup:

---
apiVersion: scheduling.k8s.io/v1alpha2
kind: Workload
metadata:
  name: lp-abc
spec:
  controllerRef:
    apiGroup: jobset.x-k8s.io
    kind: JobSet
    name: lp-js
  podGroupTemplates:
  - name: workers
    schedulingPolicy:
      gang:
        minCount: 4
---
apiVersion: scheduling.k8s.io/v1alpha2
kind: PodGroup
metadata:
  name: lp-abc-workers-def
  namespace: default
spec:
  podGroupTemplateRef:
    workload:
      workloadName: lp-abc
      podGroupTemplateName: workers
  priorityClassName: low-priority
  disruptionMode: PodGroup
  schedulingPolicy:
    gang:
      minCount: 4
---
apiVersion: jobset.x-k8s.io/v1alpha2
kind: JobSet
metadata:
  name: lp-js
spec:
  replicatedJobs:
  - name: rj
    replicas: 2
    template:
      spec:
        completions: 2
        parallelism: 2
        backoffLimit: 10
        template:
          spec:
            terminationGracePeriodSeconds: 0
            priorityClassName: low-priority
            schedulingGroup:
              podGroupName: lp-abc-workers-def
            containers:
            - name: worker
              image: busybox
              command: ["sleep", "infinity"]
              # Pick resource requests appropriate for your cluster and workload.
              # These values are just an example.
              resources:
                requests:
                  cpu: "3"

Wait for all 4 pods to be running:

kubectl get pods -l jobset.sigs.k8s.io/jobset-name=lp-js

Step 3: Apply the high-priority JobSet

The high-priority JobSet uses the same structure but with priorityClassName: high-priority:

---
apiVersion: scheduling.k8s.io/v1alpha2
kind: Workload
metadata:
  name: hp-abc
spec:
  controllerRef:
    apiGroup: jobset.x-k8s.io
    kind: JobSet
    name: hp-js
  podGroupTemplates:
  - name: workers
    schedulingPolicy:
      gang:
        minCount: 4
---
apiVersion: scheduling.k8s.io/v1alpha2
kind: PodGroup
metadata:
  name: hp-abc-workers-def
  namespace: default
spec:
  podGroupTemplateRef:
    workload:
      workloadName: hp-abc
      podGroupTemplateName: workers
  priorityClassName: high-priority
  disruptionMode: PodGroup
  schedulingPolicy:
    gang:
      minCount: 4
---
apiVersion: jobset.x-k8s.io/v1alpha2
kind: JobSet
metadata:
  name: hp-js
spec:
  replicatedJobs:
  - name: rj
    replicas: 2
    template:
      spec:
        completions: 2
        parallelism: 2
        backoffLimit: 10
        template:
          spec:
            terminationGracePeriodSeconds: 0
            priorityClassName: high-priority
            schedulingGroup:
              podGroupName: hp-abc-workers-def
            containers:
            - name: worker
              image: busybox
              command: ["sleep", "infinity"]
              # Pick resource requests appropriate for your cluster and workload.
              # These values are just an example.
              resources:
                requests:
                  cpu: "3"

Step 4: Observe preemption

The scheduler preempts the entire low-priority PodGroup as a gang to make room for the high-priority JobSet:

kubectl get pods -l 'jobset.sigs.k8s.io/jobset-name in (lp-js,hp-js)'
kubectl get events --field-selector reason=Preempted

All 4 low-priority pods are preempted together because of disruptionMode: PodGroup, and all 4 high-priority pods schedule and run.