Preemption

Workload-aware preemption for gang-scheduled JobSets

When the WorkloadAwarePreemption feature gate is enabled, the scheduler can preempt entire pod groups to make room for higher-priority gang-scheduled workloads.

The Problem

Without workload-aware preemption, the scheduler preempts individual pods. This can cause problems for gang-scheduled workloads — preempting some pods from a running gang breaks the entire workload, and freeing only a subset of the required resources may not be enough to admit the pending gang.

How It Works

With workload-aware preemption, the scheduler understands that pods belong to a group and makes preemption decisions at the group level. This means:

All-or-nothing preemption: The scheduler will only preempt a pod group if doing so frees enough resources to admit the pending gang in full.
No partial disruption: Rather than evicting individual pods and breaking a running workload, the scheduler preempts the entire lower-priority pod group.

Two key fields on the PodGroup enable this:

priorityClassName: Associates the PodGroup with a Kubernetes PriorityClass, giving the scheduler a priority value to compare when deciding what to preempt.
disruptionMode: PodGroup: Ensures the entire gang is preempted together rather than individual pods.

Example: Gang Preemption

This example demonstrates PodGroup-level gang preemption using two JobSets — a low-priority JobSet that occupies the cluster, and a high-priority JobSet that preempts it.

Each JobSet creates 4 pods (2 replicas × 2 completions). The resource requests are sized so that only one JobSet can fit on the cluster at a time.

Step 1: Create the PriorityClasses

Create a low-priority and high-priority PriorityClass:

apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: low-priority
value: 1
globalDefault: false
description: "Low priority class with value 1"

apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: high-priority
value: 100000
globalDefault: false
description: "High priority class with value 100000"

Step 2: Apply the low-priority JobSet

The low-priority JobSet includes its Workload and PodGroup. The PodGroup sets priorityClassName: low-priority and disruptionMode: PodGroup:

---
apiVersion: scheduling.k8s.io/v1alpha2
kind: Workload
metadata:
  name: lp-abc
spec:
  controllerRef:
    apiGroup: jobset.x-k8s.io
    kind: JobSet
    name: lp-js
  podGroupTemplates:
  - name: workers
    schedulingPolicy:
      gang:
        minCount: 4
---
apiVersion: scheduling.k8s.io/v1alpha2
kind: PodGroup
metadata:
  name: lp-abc-workers-def
  namespace: default
spec:
  podGroupTemplateRef:
    workload:
      workloadName: lp-abc
      podGroupTemplateName: workers
  priorityClassName: low-priority
  disruptionMode: PodGroup
  schedulingPolicy:
    gang:
      minCount: 4
---
apiVersion: jobset.x-k8s.io/v1alpha2
kind: JobSet
metadata:
  name: lp-js
spec:
  replicatedJobs:
  - name: rj
    replicas: 2
    template:
      spec:
        completions: 2
        parallelism: 2
        backoffLimit: 10
        template:
          spec:
            terminationGracePeriodSeconds: 0
            priorityClassName: low-priority
            schedulingGroup:
              podGroupName: lp-abc-workers-def
            containers:
            - name: worker
              image: busybox
              command: ["sleep", "infinity"]
              # Pick resource requests appropriate for your cluster and workload.
              # These values are just an example.
              resources:
                requests:
                  cpu: "3"

Wait for all 4 pods to be running:

kubectl get pods -l jobset.sigs.k8s.io/jobset-name=lp-js

Step 3: Apply the high-priority JobSet

The high-priority JobSet uses the same structure but with priorityClassName: high-priority:

---
apiVersion: scheduling.k8s.io/v1alpha2
kind: Workload
metadata:
  name: hp-abc
spec:
  controllerRef:
    apiGroup: jobset.x-k8s.io
    kind: JobSet
    name: hp-js
  podGroupTemplates:
  - name: workers
    schedulingPolicy:
      gang:
        minCount: 4
---
apiVersion: scheduling.k8s.io/v1alpha2
kind: PodGroup
metadata:
  name: hp-abc-workers-def
  namespace: default
spec:
  podGroupTemplateRef:
    workload:
      workloadName: hp-abc
      podGroupTemplateName: workers
  priorityClassName: high-priority
  disruptionMode: PodGroup
  schedulingPolicy:
    gang:
      minCount: 4
---
apiVersion: jobset.x-k8s.io/v1alpha2
kind: JobSet
metadata:
  name: hp-js
spec:
  replicatedJobs:
  - name: rj
    replicas: 2
    template:
      spec:
        completions: 2
        parallelism: 2
        backoffLimit: 10
        template:
          spec:
            terminationGracePeriodSeconds: 0
            priorityClassName: high-priority
            schedulingGroup:
              podGroupName: hp-abc-workers-def
            containers:
            - name: worker
              image: busybox
              command: ["sleep", "infinity"]
              # Pick resource requests appropriate for your cluster and workload.
              # These values are just an example.
              resources:
                requests:
                  cpu: "3"

Step 4: Observe preemption

The scheduler preempts the entire low-priority PodGroup as a gang to make room for the high-priority JobSet:

kubectl get pods -l 'jobset.sigs.k8s.io/jobset-name in (lp-js,hp-js)'
kubectl get events --field-selector reason=Preempted

All 4 low-priority pods are preempted together because of disruptionMode: PodGroup, and all 4 high-priority pods schedule and run.

Feedback

Was this page helpful?

Glad to hear it! Please tell us how we can improve.

Sorry to hear that. Please tell us how we can improve.

Last modified June 5, 2026: site: Add Workload Aware Scheduling documentation (#1241) (c6b198a)