Preemption
When the WorkloadAwarePreemption feature gate is enabled, the scheduler can preempt entire pod groups to make room for higher-priority gang-scheduled workloads.
The Problem
Without workload-aware preemption, the scheduler preempts individual pods. This can cause problems for gang-scheduled workloads — preempting some pods from a running gang breaks the entire workload, and freeing only a subset of the required resources may not be enough to admit the pending gang.
How It Works
With workload-aware preemption, the scheduler understands that pods belong to a group and makes preemption decisions at the group level. This means:
- All-or-nothing preemption: The scheduler will only preempt a pod group if doing so frees enough resources to admit the pending gang in full.
- No partial disruption: Rather than evicting individual pods and breaking a running workload, the scheduler preempts the entire lower-priority pod group.
Two key fields on the PodGroup enable this:
priorityClassName: Associates the PodGroup with a KubernetesPriorityClass, giving the scheduler a priority value to compare when deciding what to preempt.disruptionMode: PodGroup: Ensures the entire gang is preempted together rather than individual pods.
Example: Gang Preemption
This example demonstrates PodGroup-level gang preemption using two JobSets — a low-priority JobSet that occupies the cluster, and a high-priority JobSet that preempts it.
Each JobSet creates 4 pods (2 replicas × 2 completions). The resource requests are sized so that only one JobSet can fit on the cluster at a time.
Step 1: Create the PriorityClasses
Create a low-priority and high-priority PriorityClass:
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: low-priority
value: 1
globalDefault: false
description: "Low priority class with value 1"
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: high-priority
value: 100000
globalDefault: false
description: "High priority class with value 100000"
Step 2: Apply the low-priority JobSet
The low-priority JobSet includes its Workload and PodGroup. The PodGroup sets priorityClassName: low-priority and disruptionMode: PodGroup:
---
apiVersion: scheduling.k8s.io/v1alpha2
kind: Workload
metadata:
name: lp-abc
spec:
controllerRef:
apiGroup: jobset.x-k8s.io
kind: JobSet
name: lp-js
podGroupTemplates:
- name: workers
schedulingPolicy:
gang:
minCount: 4
---
apiVersion: scheduling.k8s.io/v1alpha2
kind: PodGroup
metadata:
name: lp-abc-workers-def
namespace: default
spec:
podGroupTemplateRef:
workload:
workloadName: lp-abc
podGroupTemplateName: workers
priorityClassName: low-priority
disruptionMode: PodGroup
schedulingPolicy:
gang:
minCount: 4
---
apiVersion: jobset.x-k8s.io/v1alpha2
kind: JobSet
metadata:
name: lp-js
spec:
replicatedJobs:
- name: rj
replicas: 2
template:
spec:
completions: 2
parallelism: 2
backoffLimit: 10
template:
spec:
terminationGracePeriodSeconds: 0
priorityClassName: low-priority
schedulingGroup:
podGroupName: lp-abc-workers-def
containers:
- name: worker
image: busybox
command: ["sleep", "infinity"]
# Pick resource requests appropriate for your cluster and workload.
# These values are just an example.
resources:
requests:
cpu: "3"
Wait for all 4 pods to be running:
kubectl get pods -l jobset.sigs.k8s.io/jobset-name=lp-js
Step 3: Apply the high-priority JobSet
The high-priority JobSet uses the same structure but with priorityClassName: high-priority:
---
apiVersion: scheduling.k8s.io/v1alpha2
kind: Workload
metadata:
name: hp-abc
spec:
controllerRef:
apiGroup: jobset.x-k8s.io
kind: JobSet
name: hp-js
podGroupTemplates:
- name: workers
schedulingPolicy:
gang:
minCount: 4
---
apiVersion: scheduling.k8s.io/v1alpha2
kind: PodGroup
metadata:
name: hp-abc-workers-def
namespace: default
spec:
podGroupTemplateRef:
workload:
workloadName: hp-abc
podGroupTemplateName: workers
priorityClassName: high-priority
disruptionMode: PodGroup
schedulingPolicy:
gang:
minCount: 4
---
apiVersion: jobset.x-k8s.io/v1alpha2
kind: JobSet
metadata:
name: hp-js
spec:
replicatedJobs:
- name: rj
replicas: 2
template:
spec:
completions: 2
parallelism: 2
backoffLimit: 10
template:
spec:
terminationGracePeriodSeconds: 0
priorityClassName: high-priority
schedulingGroup:
podGroupName: hp-abc-workers-def
containers:
- name: worker
image: busybox
command: ["sleep", "infinity"]
# Pick resource requests appropriate for your cluster and workload.
# These values are just an example.
resources:
requests:
cpu: "3"
Step 4: Observe preemption
The scheduler preempts the entire low-priority PodGroup as a gang to make room for the high-priority JobSet:
kubectl get pods -l 'jobset.sigs.k8s.io/jobset-name in (lp-js,hp-js)'
kubectl get events --field-selector reason=Preempted
All 4 low-priority pods are preempted together because of disruptionMode: PodGroup, and all 4 high-priority pods schedule and run.
Feedback
Was this page helpful?
Glad to hear it! Please tell us how we can improve.
Sorry to hear that. Please tell us how we can improve.