From Static GPU Allocations... To Guaranteed Quotas
Guaranteed Quotas Significantly Increase GPU Cluster Utilization
Workloads, tagged by their projects, enter the Run:AI system which in turn pushes each workload to the relevant queue that is assigned to a project. Each project has its own priority and resource quota.
Projects with a guaranteed quota of GPUs, as opposed to projects with a fixed quota, can use more GPUs than their quota allows, so as to minimize idle resource time. To do so, the system allocates available resources to a job, even if the job is assigned to a project which is over quota. In cases where a job is submitted and there are not enough available resources to launch it, the scheduler starts to become smarter and pause a job from a queue that is over quota, while taking priorities and fairness parameters into account.
Why Use Guaranteed Quotas?
Run More Experiments and Increase Productivity
Guaranteed Quotas ensure that resources will be available to data scientists when needed. By enabling researchers to run experiments without thinking about the underlying infrastructure at all, Run:AI removes hassles and helps increase productivity.
Data scientists’ dynamic workflow profile includes the following varied tasks:
- They may have weeks during which they run a small number of experiments (like when experimenting with new models or new data),
- Weeks during which they run a large number of experiments (for example when optimizing a specific model and trying to scratch a few percentages in accuracy, for example when using hyper parameter optimization),
- And weeks when they are relatively idle (writing documents, gathering data, etc).
Guaranteed quotas help data scientists run more experiments. In addition, Run:AI removes the limitations of the number of concurrent experiments they can run, and the number of GPUs they can use for multi-GPU workloads. This greatly increases utilization of the overall GPU cluster.
Data scientists are currently required to pre-define the exact GPU requirements of their jobs, however there is no real way for them to accurately know their needs until they begin training.
With Run:AI, subjectivity is removed in favor of system-wide, dynamic, and automated allocation of resources.