In this post, we’ll address how fractionalizing GPU for deep learning inference workloads with lower computational needs can save 50-75% of the cost of deep learning.
How does Deep Learning inference differ from training?
Before we get into the value of fractionalizing GPU, it’s important to explain that at each stage of the deep learning process, data scientists complete different tasks that relate to how they are interacting with neural networks and GPU. The steps can be divided into four phases: data preparation, build, train and inference.
Data Preparation – this phase includes cleaning and manipulating data, understanding it, etc in order to enable the models we build to have the best chance of success. For deep learning this stage is typically done without GPUs.
Building models – this is where researchers create Deep Learning (DL) models – this involves things like model design and coding, debugging errors, etc. In this phase researchers consume GPU power interactively, in short bursts, occasionally leaving the GPU idle.
Training models – in this phase DL models are assigned weights that best map inputs to outputs. This phase is highly compute intensive and can run for days as researchers optimize their models on huge data sets. Training speed is therefore highly important.
Inference – in this phase of deep learning, trained DL models are literally inferring things from new data. Inference workloads fall into two categories, online and offline.
What’s the difference between offline and online inference?
- Offline Inference – In an offline use case, a model that has already been trained, runs on new data that has arrived since the previous training was completed. For example, take Facebook photos. Millions of pictures are uploaded to Facebook every day, and Facebook tags and organizes those pictures for you. This is a classic inference job – it uses new data, in this case a picture, and applies what was learned from an already trained model, to place a tag on the picture. In this case metrics like latency are less important and inference can run offline, at scheduled times or when compute resources are available.
- Online Inference – In an online scenario, the inference is running on data that needs to be used now, in real time. Examples of this are found in time-sensitive use cases like fraud detection, where an online transaction needs to be approved or rejected.
To summarize, training involves highly compute-intensive workloads, whereas inference workloads are “light” and consume significantly less GPU memory than training. In a project lifecycle, there are long periods of time during which many concurrent training workloads are running (e.g. while optimizing hyperparameters) but also long periods of idle time. Training is highly compute-intensive, runs on multiple GPUs, and typically requires very high GPU utilization.
Run two, four or more workloads on the same GPU – On premises or in Cloud
For inference workloads – both online and offline – only small amounts of compute power and memory are required, and yet a full GPU is typically allocated to each inference job, leaving as much as 80% of the GPU idle. Until now, there was no way to dynamically allocate a fraction of a GPU to a smaller inference workload. With fractional GPU from Run:AI, multiple inference workloads are able to run on the same GPU. Run:AI orchestration dynamically autoscales your inference workloads to run efficiently at scale across multiple GPU nodes and clusters.
Cut GPU costs significantly
Using Run:AI, multiple inference tasks can run on the same GPU and cost savings for on-premises GPU infrastructure becomes clear. But the savings are compounded on cloud infrastructure. For example, let’s assume four inference services run concurrently, each on a different GPU. Paying by the minute on cloud infrastructure, you’ll pay for the cost of four GPUs, multiplied by the duration of time that these services are up and running. With Run:AI software you can allocate these four services to the same GPU without compromising inference time and performance – you’ve now spent 75% less than what you were spending previously.
Share resources efficiently
Traditionally, researchers can be left without GPU access while they wait for other teams’ inference workloads to be completed despite those workloads using only a fraction of that GPU. With Run:AI fractionalization capabilities, this is no longer a limitation. Researchers are able to share GPU access and run multiple workloads on a single GPU. Data science jobs get completed faster, get to market quickly, and researchers are better able to share costly resources effectively.
Whether on prem or cloud, Run:AI can help reduce costs for inference workloads. To get a free trial of Run:AI’s fractional GPU, contact us at firstname.lastname@example.org