Guaranteed Quotas Significantly Increase GPU Cluster Utilization
Workloads, tagged by their projects, enter the Run:AI system which in turn pushes each workload to the relevant queue that is assigned to a project. Each project has its own priority and resource quota.
Projects with a guaranteed quota of GPUs, as opposed to projects with a fixed quota, can use more GPUs than their quota allows, so as to minimize idle resource time. To do so, the system allocates available resources to a job, even if the job is assigned to a project which is over quota. In cases where a job is submitted and there are not enough available resources to launch it, the scheduler starts to become smarter and pause a job from a queue that is over quota, while taking priorities and fairness parameters into account.
Guaranteed Quotas ensure that resources will be available to data scientists when needed. By enabling researchers to run experiments without thinking about the underlying infrastructure at all, Run:AI removes hassles and helps increase productivity.
Data scientists’ dynamic workflow profile includes the following varied tasks:
Guaranteed quotas help data scientists run more experiments. In addition, Run:AI removes the limitations of the number of concurrent experiments they can run, and the number of GPUs they can use for multi-GPU workloads. This greatly increases utilization of the overall GPU cluster.