JobSet is a K8s-native API for distributed ML training and HPC workloads.
Use JobSet to orchestrate large scale, distributed ML training and HPC workloads with out of the box support for fault tolerance, fast failure recovery, configurable success/failure policies, and more.
JobSet manages a group of Kubernetes Jobs as a unit, providing a simple way to orchestrate multiple types of Jobs as a single workload. For example, a parameter server and a group of workers can be managed as a single workload.
Contributions welcome!
We do a Pull Request contributions workflow on GitHub. New users are always welcome!