Creating a Virtual Pool of GPU

Abstract data science workloads from infrastructure to simplify workflows

Run:AI pools heterogeneous resources so they can be used within two logical environments, build and train, to natively support data scientists’ different compute characteristics and increase utilization. The virtual pool exists inside a Kubernetes cluster. The two logical environments interact with the Run:AI scheduler for build and training workloads.

  • Build environment – dedicated for building models interactively, typically using jupyter notebooks or Pycharm, or simply by SSH-ing into a container. Performance in build environments is typically less critical so build workloads can be run on workstations or low-end servers.

     

  • Training environment – dedicated for long training workloads. As performance is important in training, these workloads should run on high-end GPU servers. Containers for training can be supplemented with a checkpointing mechanism that allows automatic preemption and resume without losing the state of the training. Run:AI creates a virtual pool of GPUs which can easily be shared among all users. With Run:AI, users can actually go over their guaranteed quota and use more GPUs than they are assigned. 

By pooling the resources and managing them using the Run:AI scheduler, administrators gain control. They can easily onboard new users, maintain and add new hardware to the pool, and gain visibility, including a holistic view of GPU usage and utilization.

In addition, data scientists can automatically provision resources without depending on IT admins.

Run:AI Scheduling Mechanism

Simplify machine scheduling

Run:AI’s dedicated batch scheduler, running on Kubernetes, enables crucial features for the management of DL workloads like advanced queuing and quotas, managing priorities and policies, automatic pause/resume, multi-node training, and more. It provides an elegant solution to simplify complex scheduling processes.