Bridging the efficiency of High-Performance Computing and the simplicity of Kubernetes – the Run:AI Scheduler allows users to easily make use of fractional GPUs, integer GPUs, and multiple-nodes of GPUs, for distributed training on Kubernetes. In this way, AI workloads run based on needs, not capacity. See our guide “Kubernetes vs Slurm” for why we built Run:AI’s Scheduler as a simple plug-in to Kubernetes.
Built as a plug-in to K8s, Run:AI requires no advanced setup, and can work with any number of Kubernetes orchestration “flavors” such as Vanilla, RedHat OpenShift and HPE Container Platform.
Batch processing refers to grouping or “batching” together many processing jobs that run to completion in parallel without user intervention. Batch processing and scheduling are commonly used in High Performance Computing and the concepts are applicable to deep learning as well.
With batch scheduling, programs run through completion and free up resources upon completion, making the system much more efficient. Training models can be queued and then launched when resources become available. Workloads can also be stopped and restarted later if resources need to be reclaimed and allocated to more urgent jobs or to under-deserved users.
Often, when using distributed training to run compute intensive jobs on multiple GPU machines, all of the GPUs need to be synchronized to communicate and share information. Gang scheduling is used when containers need to be launched together, start together, recover from failures together, and end together. Networking and communication can be automated between machines by the cluster orchestrator.
Unfortunately, due to a concept known as ‘topology awareness’, a researcher can run a container once and get excellent performance and then the next time get poor performance on the same server. The problem comes from the topology of GPU, CPU, and the links between them. The same problem can occur for distributed workloads due to the topology of Network Interface Cards and the links between GPU servers.
The Run:AI Scheduler ensures that the physical properties of AI infrastructure are taken into account when running AI workloads, for ideal and consistent performance.
The Run:AI Scheduler manages tasks in batches using multiple queues on top of Kubernetes, allowing system admins to define different rules, policies, and requirements for each queue based on business priorities. Combined with an over-quota system and configurable fairness policies, the allocation of resources can be automated and optimized to allow maximum utilization of cluster resources.