Welcome to JobSet

Read the docs Github

A unified API for deploying HPC and AI/ML training workloads on Kubernetes

JobSet is a K8s-native API for distributed ML training and HPC workloads.

Use JobSet to orchestrate large scale, distributed ML training and HPC workloads with out of the box support for fault tolerance, fast failure recovery, configurable success/failure policies, and more.

JobSet manages a group of Kubernetes Jobs as a unit, providing a simple way to orchestrate multiple types of Jobs as a single workload. For example, a parameter server and a group of workers can be managed as a single workload.

Contributions welcome!

We do a Pull Request contributions workflow on GitHub. New users are always welcome!

Read more …

Connect with us

Talk to contributors on #wg-batch channel

Read more …

Join the mailing group

Join the conversation on the mailing group

Read more …