Unique Data Science Workflows Require a New Approach

Solving for the different needs of build, train and inference modeling

Dynamic resource allocations are mandatory for AI development.
To understand the problem of static allocation compute bottlenecks it helps to understand how data scientists are using accelerators today. At each stage of the deep learning process, data scientists have specific needs for compute resources. Data scientists’ compute needs are typically aligned to the following tasks:

  • Building models, users build Deep Learning (DL) models while consuming GPU power in interactive sessions. Physical GPUs are statically allocated to these sessions.

  • Training models, DL models are generally trained in long sessions. Training is highly compute-intensive, can run on multiple GPUs, and typically requires very high GPU utilization. Performance (in terms of time-to-train) is highly important. In a project lifecycle, there are long periods of time during which many concurrent training workloads are running (e.g. while optimizing hyperparameters) but also long periods of idle time in which only a small number of experiments are utilizing GPUs.

  • Inference, in this phase of development, trained DL models are inferencing upon requests from real-time applications or from periodical systems that apply offline batch inferencing. Inference typically induces low GPU utilization and memory footprint (compared to training sessions).

The Run:AI platform enables pooling of GPU into two logical environments, for build and for training workloads. A scheduler manages the requests for compute that come from data scientists.

  • For the build environment, fixed quotas of GPUs are assigned to users.
  • For the training environment, the greater proportion of the pool’s resources is assigned, so that all the resources can be easily shared among all the users.

In cases where a job is submitted to the queue and there are not enough available resources to launch it, the scheduler becomes smarter and can pause a job from a queue that is over quota, while taking priorities and fairness parameters into account.

In this way, the Run:AI platform greatly simplifies workflows for data scientists in the build and training phases of their DL initiatives.

Could Kubernetes Alone Solve the Challenge?

Data scientists use containers to support their need for agility and portability. As Kubernetes is the de-facto tool for orchestrating containerized applications in enterprise IT environments it seems like a suitable solution to use for data science. However, Kubernetes uses a scheduler that was built for services, not for data science experiments, and therefore lacks important capabilities for AI development. For example, it does not support guaranteed quotas or efficient orchestration of distributed computing workloads.

The Run:AI software was built as a Kubernetes plugin, an enhancement to the inherent scheduling capabilities of K8s, built in order to specifically support AI workloads.