Volume Claim Policies

Managing stateful JobSet with shared persistent volumes

JobSet provides the VolumeClaimPolicies API to automatically create and manage shared PersistentVolumeClaims (PVCs) across multiple ReplicatedJobs within a JobSet. This enables stateful JobSets that require persistent storage for datasets, models, checkpoints, or intermediate results.

Basic Usage

To use VolumeClaimPolicies, define them in the volumeClaimPolicies field of your JobSet spec. Each policy can contain one or more PVC templates.

This example demonstrates creating shared PVCs with different retention policies:

In this example:

  1. An initializer ReplicatedJob downloads a model to the initializer volume
  2. A node ReplicatedJob reads the model and writes checkpoints which contain index of the ReplicatedJob
  3. The PVCs are automatically created with the naming convention: <claim-name>-<jobset-name>
    • initializer-volume-claim-trainjob (deleted when JobSet is deleted)
    • checkpoints-volume-claim-trainjob (retained after JobSet is deleted)
# This example creates two shared PVC for two ReplicatedJobs.
# The first PVC: initializer will be deleted after JobSet is deleted.
# The second PVC: checkpoints will be retained after JobSet is deleted.
# The second replicatedJob runs after the first is complete. After JobSet is complete or deleted,
# you can create a simple busybox Pod to mount this PVC: checkpoints-volume-claim-trainjob.
# You should see the following: $ /workspace/checkpoints # ls
# model_0.txt  model_1.txt  model_2.txt
apiVersion: jobset.x-k8s.io/v1alpha2
kind: JobSet
metadata:
  name: volume-claim-trainjob
spec:
  volumeClaimPolicies:
    - templates:
        - metadata:
            name: initializer
          spec:
            accessModes: ["ReadWriteOnce"]
            resources:
              requests:
                storage: 1Gi
    - templates:
        - metadata:
            name: checkpoints
          spec:
            accessModes: ["ReadWriteOnce"]
            resources:
              requests:
                storage: 1Gi
      retentionPolicy:
        whenDeleted: Retain
  replicatedJobs:
    - name: initializer
      template:
        spec:
          template:
            spec:
              containers:
                - name: initializer
                  image: busybox
                  command:
                    - /bin/sh
                    - -c
                    - |
                      echo "Download pre-trained model into /workspace/model/qwen3.txt"
                      echo 'Qwen3-30b' > /workspace/model/qwen3.txt
                  volumeMounts:
                    - mountPath: /workspace/model
                      name: initializer
    - name: node
      dependsOn:
        - name: initializer
          status: Complete
      template:
        spec:
          parallelism: 3
          completions: 3
          template:
            spec:
              containers:
                - name: node
                  image: busybox
                  command:
                    - /bin/sh
                    - -c
                    - |
                      echo "Read pre-trained model" && cat /workspace/model/qwen3.txt
                      echo "Write model checkpoint to /workspace/checkpoints/model_node_${JOB_COMPLETION_INDEX}"
                      echo "Checkpoint: model_${JOB_COMPLETION_INDEX}" > /workspace/checkpoints/model_${JOB_COMPLETION_INDEX}.txt
                  volumeMounts:
                    - mountPath: /workspace/model
                      name: initializer
                    - mountPath: /workspace/checkpoints
                      name: checkpoints

How Volumes Are Mounted

To mount a shared PVC in your pods:

  1. Define a volumeClaimPolicies template with a specific name (e.g., model-data)
  2. Add a volumeMount in your container with the same name
  3. JobSet automatically injects the PVC volume into your pod spec and creates the appropriate PVC

Retention Policies

VolumeClaimPolicies support retention policies to control what happens to PVCs when the JobSet is deleted.

Delete (Default)

The PVC is automatically deleted when the JobSet is deleted. This is the default behavior when no retention policy is specified.

volumeClaimPolicies:
  - templates:
      - metadata:
          name: temporary-data
        spec:
          accessModes: ["ReadWriteOnce"]
          resources:
            requests:
              storage: 10Gi
    retentionPolicy:
      whenDeleted: Delete

Retain

The PVC is kept after the JobSet is deleted, allowing you to access the data later or use it in subsequent JobSets.

volumeClaimPolicies:
  - templates:
      - metadata:
          name: checkpoints
        spec:
          accessModes: ["ReadWriteOnce"]
          resources:
            requests:
              storage: 50Gi
    retentionPolicy:
      whenDeleted: Retain # PVC survives JobSet deletion

When using Retain, you can access the persisted data by:

  • Creating a new JobSet with a volumeMount referencing the existing PVC name
  • Mounting the PVC directly in a debug pod
  • Using the PVC in other workloads

Custom Labels and Annotations

You can add custom labels and annotations to PVC templates for organization, monitoring, or integration with other tools. These labels and annotations are preserved on the created PVCs along with the automatically added jobset.sigs.k8s.io/jobset-name label.

volumeClaimPolicies:
  - templates:
      - metadata:
          name: my-volume
          labels:
            team: ml-platform
            environment: production
            content-type: model
          annotations:
            backup-policy: "daily"
            retention-days: "30"
        spec:
          accessModes: ["ReadWriteMany"]
          resources:
            requests:
              storage: 100Gi

Limitations

  • Maximum of 50 volume claim templates per JobSet
  • PVC templates cannot specify the namespace field (namespace is inherited from the JobSet)
  • ReplicatedJob templates must not define volumes with the same name as VolumeClaimPolicy templates
  • At least one container or initContainer in the ReplicatedJobs must have a volumeMount matching each volume claim template name
  • When defining the existing volume in VolumeClaimPolicies the spec must be equal to the pre-created PVC