Failure Policy

Configuring jobset failure policies

JobSet provides failure policy API to control how your workload behaves in response to child Job failures.

The failurePolicy is defined by a set of rules. For any job failure, the rules are evaluated in order, and the first matching rule’s action is executed. If no rule matches, the default action is RestartJobSet, which counts towards the maxRestarts limit.

Failure Policy Actions

FailJobSet

This action immediately marks the entire JobSet as failed.

In this example, the JobSet is configured to fail immediately if the leader job fails.

apiVersion: jobset.x-k8s.io/v1alpha2
kind: JobSet
metadata:
  name: failjobset-action-example
spec:
  failurePolicy:
    maxRestarts: 3
    rules:
      # The JobSet will fail immediately when the leader job fails.
      - action: FailJobSet
        targetReplicatedJobs:
        - leader
  replicatedJobs:
  - name: leader
    replicas: 1
    template:
      spec:
        # Set backoff limit to 0 so job will immediately fail if any pod fails.
        backoffLimit: 0
        completions: 2
        parallelism: 2
        template:
          spec:
            containers:
            - name: leader
              image: bash:latest
              command:
              - bash
              - -xc
              - |
                echo "JOB_COMPLETION_INDEX=$JOB_COMPLETION_INDEX"
                if [[ "$JOB_COMPLETION_INDEX" == "0" ]]; then
                  for i in $(seq 10 -1 1)
                  do
                    echo "Sleeping in $i"
                    sleep 1
                  done
                  exit 1
                fi
                for i in $(seq 1 1000)
                do
                  echo "$i"
                  sleep 1
                done
  - name: workers
    replicas: 1
    template:
      spec:
        backoffLimit: 0
        completions: 2
        parallelism: 2
        template:
          spec:
            containers:
            - name: worker
              image: bash:latest
              command:
              - bash
              - -xc
              - |
                sleep 1000

RestartJobSet

This action restarts the entire JobSet by recreating its child jobs. The number of restarts before failure is limited by failurePolicy.maxRestarts. This is the default action if no other rules match a failure.

In this example, the JobSet will restart up to 3 times if the leader job fails.

apiVersion: jobset.x-k8s.io/v1alpha2
kind: JobSet
metadata:
  name: restartjobset-action-example
spec:
  failurePolicy:
    maxRestarts: 3
    rules:
      # The JobSet will restart when the leader job fails.
      - action: RestartJobSet
        targetReplicatedJobs:
        - leader
  replicatedJobs:
  - name: leader
    replicas: 1
    template:
      spec:
        # Set backoff limit to 0 so job will immediately fail if any pod fails.
        backoffLimit: 0
        completions: 2
        parallelism: 2
        template:
          spec:
            containers:
            - name: leader
              image: bash:latest
              command:
              - bash
              - -xc
              - |
                echo "JOB_COMPLETION_INDEX=$JOB_COMPLETION_INDEX"
                if [[ "$JOB_COMPLETION_INDEX" == "0" ]]; then
                  for i in $(seq 10 -1 1)
                  do
                    echo "Sleeping in $i"
                    sleep 1
                  done
                  exit 1
                fi
                for i in $(seq 1 1000)
                do
                  echo "$i"
                  sleep 1
                done
  - name: workers
    replicas: 1
    template:
      spec:
        backoffLimit: 0
        completions: 2
        parallelism: 2
        template:
          spec:
            containers:
            - name: worker
              image: bash:latest
              command:
              - bash
              - -xc
              - |
                sleep 1000

RestartJobSetAndIgnoreMaxRestarts

This action is similar to RestartJobSet, but the restart does not count towards the maxRestarts limit.

In this example, the JobSet will restart an unlimited number of times if the leader job fails. However, workers jobs will be covered by the default RestartJobSet action and only restart up to maxRestarts times.

apiVersion: jobset.x-k8s.io/v1alpha2
kind: JobSet
metadata:
  # rjimr stands for "restartjobsetandignoremaxrestarts"
  name: rjimr-action-example
spec:
  failurePolicy:
    maxRestarts: 3
    rules:
      # The JobSet will restart an unlimited number of times
      # when the leader job fails.
      - action: RestartJobSetAndIgnoreMaxRestarts
        targetReplicatedJobs:
        - leader
  replicatedJobs:
  - name: leader
    replicas: 1
    template:
      spec:
        # Set backoff limit to 0 so job will immediately fail if any pod fails.
        backoffLimit: 0
        completions: 2
        parallelism: 2
        template:
          spec:
            containers:
            - name: leader
              image: bash:latest
              command:
              - bash
              - -xc
              - |
                echo "JOB_COMPLETION_INDEX=$JOB_COMPLETION_INDEX"
                if [[ "$JOB_COMPLETION_INDEX" == "0" ]]; then
                  for i in $(seq 10 -1 1)
                  do
                    echo "Sleeping in $i"
                    sleep 1
                  done
                  exit 1
                fi
                for i in $(seq 1 1000)
                do
                  echo "$i"
                  sleep 1
                done
  - name: workers
    replicas: 1
    template:
      spec:
        backoffLimit: 0
        completions: 2
        parallelism: 2
        template:
          spec:
            containers:
            - name: worker
              image: bash:latest
              command:
              - bash
              - -xc
              - |
                sleep 1000

RecreateJob

This action recreates only the specific job that failed, leaving the rest of the JobSet running. This is useful when individual jobs are independent and can be restarted without affecting others. Note: Individual job recreates count towards the maxRestarts limit.

In this example, if a job from the workers-a group fails, only that specific job is recreated. Failures in workers-b still cause a full JobSet restart (the defaul behavior).

# To manually fail a pod in this JobSet, execute:
#   kubectl exec <POD_NAME> -- touch /tmp/fail
# Similarly, to manually complete a pod in this JobSet, execute:
#   kubectl exec <POD_NAME> -- touch /tmp/succeed
apiVersion: jobset.x-k8s.io/v1alpha2
kind: JobSet
metadata:
  name: test
spec:
  failurePolicy:
    maxRestarts: 3
    rules:
    # If any Job within workers-a fails, only the failed Job will be recreated without restarting
    # the entire JobSet.
    - action: RecreateJob
      targetReplicatedJobs:
      - workers-a
  replicatedJobs:
  - name: workers-a
    replicas: 2
    template:
      spec:
        completions: 2
        parallelism: 2
        backoffLimit: 0
        template:
          spec:
            terminationGracePeriodSeconds: 2
            containers:
            - name: worker
              image: busybox
              command: ['/bin/sh', '-c']
              args:
              - |
                echo "Start"

                while true; do
                  if [ -f /tmp/fail ]; then
                    echo "Exiting 1"
                    exit 1;
                  fi;
                  if [ -f /tmp/succeed ]; then
                    echo "Exiting 0"
                    exit 0;
                  fi;
                  sleep 1;
                done

                echo "End"
  - name: workers-b 
    replicas: 3
    template:
      spec:
        completions: 2
        parallelism: 2
        backoffLimit: 0
        template:
          spec:
            terminationGracePeriodSeconds: 2
            containers:
            - name: worker
              image: busybox
              command: ['/bin/sh', '-c']
              args:
              - |
                echo "Start"

                while true; do
                  if [ -f /tmp/fail ]; then
                    echo "Exiting 1"
                    exit 1;
                  fi;
                  if [ -f /tmp/succeed ]; then
                    echo "Exiting 0"
                    exit 0;
                  fi;
                  sleep 1;
                done

                echo "End"

Targeting Specific Failures

You can make your failure policy rules more granular by using targetReplicatedJobs and onJobFailureReasons.

Targeting Replicated Jobs

The targetReplicatedJobs field allows you to apply a rule only to failures originating from specific replicated jobs. All the examples above use this field.

Targeting Job Failure Reasons

The onJobFailureReasons field allows you to trigger a rule based on the reason for the job failure. This is powerful for distinguishing between different kinds of errors. Valid reasons include BackoffLimitExceeded, DeadlineExceeded, and PodFailurePolicy.

Example: Handling BackoffLimitExceeded

This example configures the JobSet to perform unlimited restarts only when the leader job fails because its backoffLimit was exceeded.

apiVersion: jobset.x-k8s.io/v1alpha2
kind: JobSet
metadata:
  name: onjobfailurereasons-present-example
spec:
  failurePolicy:
    maxRestarts: 3
    rules:
      # The JobSet will restart an unlimited number of times when the
      # leader job fails with the failure reason BackoffLimitExceeded.
      - action: RestartJobSetAndIgnoreMaxRestarts 
        targetReplicatedJobs:
        - leader
        onJobFailureReasons:
        - BackoffLimitExceeded
  replicatedJobs:
  - name: leader
    replicas: 1
    template:
      spec:
        # Set backoff limit to 0 so job will immediately fail if any pod fails.
        backoffLimit: 0
        completions: 2
        parallelism: 2
        template:
          spec:
            containers:
            - name: leader
              image: bash:latest
              command:
              - bash
              - -xc
              - |
                echo "JOB_COMPLETION_INDEX=$JOB_COMPLETION_INDEX"
                if [[ "$JOB_COMPLETION_INDEX" == "0" ]]; then
                  for i in $(seq 10 -1 1)
                  do
                    echo "Sleeping in $i"
                    sleep 1
                  done
                  exit 1
                fi
                for i in $(seq 1 1000)
                do
                  echo "$i"
                  sleep 1
                done
  - name: workers
    replicas: 1
    template:
      spec:
        backoffLimit: 0
        completions: 2
        parallelism: 2
        template:
          spec:
            containers:
            - name: worker
              image: bash:latest
              command:
              - bash
              - -xc
              - |
                sleep 1000

Example: Handling PodFailurePolicy

This example shows how to use Kubernetes’ podFailurePolicy in conjunction with JobSet’s failurePolicy. The JobSet will restart an unlimited number of times only if the leader job fails due to its pod failure policy being triggered (in this case, a container exiting with code 1).

apiVersion: jobset.x-k8s.io/v1alpha2
kind: JobSet
metadata:
  name: onjobfailurereasons-podfailurepolicy-example
spec:
  failurePolicy:
    maxRestarts: 3
    rules:
      # The JobSet will restart an unlimited number of times
      # when the leader job fails with a failure reason matching
      # the pod failure policy.
      - action: RestartJobSetAndIgnoreMaxRestarts 
        targetReplicatedJobs:
        - leader
        onJobFailureReasons:
        - PodFailurePolicy
  replicatedJobs:
  - name: leader
    replicas: 1
    template:
      spec:
        # Set backoff limit to 0 so job will immediately fail if any pod fails.
        backoffLimit: 0
        completions: 2
        parallelism: 2
        template:
          spec:
            restartPolicy: Never
            containers:
            - name: leader
              image: bash:latest
              command:
              - bash
              - -xc
              - |
                echo "JOB_COMPLETION_INDEX=$JOB_COMPLETION_INDEX"
                if [[ "$JOB_COMPLETION_INDEX" == "0" ]]; then
                  for i in $(seq 10 -1 1)
                  do
                    echo "Sleeping in $i"
                    sleep 1
                  done
                  exit 1
                fi
                for i in $(seq 1 1000)
                do
                  echo "$i"
                  sleep 1
                done
        podFailurePolicy:
          rules:
            - action: FailJob
              onPodConditions: []
              onExitCodes:
                containerName: leader
                operator: In
                values: [1] 
  - name: workers
    replicas: 1
    template:
      spec:
        backoffLimit: 0
        completions: 2
        parallelism: 2
        template:
          spec:
            containers:
            - name: worker
              image: bash:latest
              command:
              - bash
              - -xc
              - |
                sleep 1000

Example: Handling Host Maintenance Events

This example models how to handle a host maintenance event. A pod eviction due to maintenance triggers a DisruptionTarget pod condition, which in turn triggers the job’s podFailurePolicy. The JobSet’s failure policy rule matches on the PodFailurePolicy reason and restarts the JobSet without counting it against maxRestarts. Any other type of failure in the leader job will trigger a normal restart that counts against maxRestarts (which is 0 in this case, meaning any other failure will cause the JobSet to fail immediately).

apiVersion: jobset.x-k8s.io/v1alpha2
kind: JobSet
metadata:
  name: host-maintenance-event-model
spec:
  failurePolicy:
    maxRestarts: 0
    rules:
      # The JobSet will restart an unlimited number of times when failure matches the pod failure policy.
      - action: RestartJobSetAndIgnoreMaxRestarts
        onJobFailureReasons:
        - PodFailurePolicy
      # The JobSet is restarted as normal when the leader job fails and the above rule is not matched.
      - action: RestartJobSet
        targetReplicatedJobs:
        - leader
  replicatedJobs:
  - name: leader
    replicas: 1
    template:
      spec:
        # Set backoff limit to 0 so job will immediately fail if any pod fails.
        backoffLimit: 0
        completions: 2
        parallelism: 2
        template:
          spec:
            restartPolicy: Never
            containers:
            - name: leader
              image: bash:latest
              command:
              - bash
              - -xc
              - |
                echo "JOB_COMPLETION_INDEX=$JOB_COMPLETION_INDEX"
                if [[ "$JOB_COMPLETION_INDEX" == "0" ]]; then
                  for i in $(seq 120 -1 1)
                  do
                    echo "Sleeping in $i"
                    sleep 1
                  done
                  exit 1
                fi
                for i in $(seq 1 1000)
                do
                  echo "$i"
                  sleep 1
                done
        # This failure policy is triggered when a node undergoes host maintenace.
        # In such a case, the pods are evicted and the job will fail with a condition
        # of type DisruptionTarget.
        podFailurePolicy:
          rules:
            - action: FailJob
              onPodConditions: 
              - type: DisruptionTarget
  - name: workers
    replicas: 1
    template:
      spec:
        backoffLimit: 0
        completions: 2
        parallelism: 2
        template:
          spec:
            containers:
            - name: worker
              image: bash:latest
              command:
              - bash
              - -xc
              - |
                sleep 1000