Many organizations today leverage data-driven artificial intelligence (AI) to reduce costs and increase revenues across a wide range of business use cases, from product development and manufacturing to marketing, sales, and customer service. Statista predicts that by 2025, global revenues from AI services, software, and hardware will be $126 billion, with the key drivers being ongoing improvements in machine learning (ML) and deep learning (DL) algorithms as well as in hardware accelerators that provide computational power.
Organizations have also learned that an AI engineering strategy is essential as AI initiatives proliferate and become more complex. Gartner includes AI engineering in its top strategic technology trends for 2021, noting its importance for scaling, maintaining, governing, and operationalizing AI-based projects. An engineering strategy ensures that AI becomes closely integrated into DevOps processes, allowing data scientists and operations to collaborate closely so that data scientists can focus on their work, free from infrastructure concerns.
This post is the first of a two-part series about a core infrastructure issue familiar to any organization undertaking sophisticated AI-based projects: how to dynamically share GPUs simply and effectively. In this article, we explore the challenges of using GPUs efficiently, while in the second article, we’ll describe how Run:AI’s GPU virtualization and orchestration solution helps meet specific GPU sharing challenges.
GPUs (graphic processing units) are the hardware accelerators that currently dominate the AI infrastructure landscape. Although originally designed to handle heavy graphical processing tasks, GPUs today are also used for accelerating parallel calculations carried out on very large quantities of data.
GPUs, with their SIMD (single instruction, multiple data) architecture, are well-suited to deep learning processes, which require the same process to be performed for numerous data items. With their high-bandwidth memory designed specifically for accelerating deep learning computations as well as their inherent scalability, GPUs support distributed training processes and, in general, can significantly speed up ML/DL operations.
Consumer-grade GPUs can be used to cost-effectively supplement existing computational resources for model building or low-level testing. Data-center GPUs, on the other hand, are designed for large-scale projects and deliver enterprise-grade performance across all environments, from modeling to production.
Whether purchased for on-premises machines or rented in the cloud, GPU resources are expensive, so organizations seek to optimize their utilization. Deep learning frameworks such as PyTorch and TensorFlow may make GPU programming and processing more accessible to modern data science implementations, but they are not designed to address the widespread infrastructure issue of GPUs being underutilized. In fact, when Run:AI starts work with a new customer, we typically see a GPU utilization rate of 25-30%—often to the surprise of our customer’s IT team.
These far-from-optimal utilization rates are due to GPUs that are idle:
On the one hand, GPUs are too big. Their memory capacity and compute power get better and more advanced all the time; the popular NVIDIA V100 GPU, for example, delivers 32GB of onboard memory. But when developing a model, hyperparameters such as batch size or number of layers tend to be small and do not consume anywhere near 32GB of GPU memory. There are also plenty of production models that do not require large GPU memory resources.
On the other hand, GPUs are too small. While it is easy to add RAM to a CPU, GPU memory can only be increased by purchasing or provisioning GPUs with more memory. And as of today, the upper limit of memory for even the most advanced GPUs stands at a few dozen GBs. Although GPU memory capacity will continue to get bigger, in the foreseeable future, it will not reach the hundreds of GBs of memory that can be made available on a CPU.
Also, GPU applications consume a lot more memory than CPU applications. Compared to the MBs or even hundreds of MBs of memory consumed by CPU applications, just creating the context for an application to use a GPU can take several hundreds of MBs of memory. This means that only a small number of applications can run simultaneously on a single GPU. Teams that want to develop and run more GPU applications or increase the hyperparameters of existing applications will need to spend more money purchasing or provisioning GPUs.
Applications running on the same GPU share its memory in a zero-sum model—every byte allocated by one app is one less byte available to the other apps. The only way for multiple applications to run simultaneously is to cooperate with each other. Each application running on the same GPU must know exactly how much memory it is allowed to allocate. And if it exceeds this memory allocation, it does so at the expense of the requirements of another application sharing the same GPU, which could cause Out of Memory failures.
This issue is exacerbated by the fact that many applications, such as TensorFlow, allocate the entire GPU memory upfront by default. Although this may seem arrogant, it sidesteps memory allocation issues and is the recommended best practice. In order to modify the application to allocate only a portion of the GPU memory, code changes are required, such as allow_growth in TensorFlow. Aside from the deep knowledge of application internals required to safely make such code changes, it is sometimes just not possible. For example, if an application is running inside a container, you do not always have access to its source code. In general, cooperation is extremely difficult to implement when running multiple containers on the same GPU since the containers should not be aware of each other.
And if we’re speaking about multiple users that want to share a GPU, the logistics can become very challenging. The users have to coordinate and decide how much memory each one gets so they know how to properly allocate it to their application(s).
A team can adopt a strict policy in which each member gets an equal and static share of the GPU. For example, in a team of three members, each one would receive one-third of the GPU memory. However, in such a static approach, any time one of the team members is not using her share of GPU memory, it lies idle instead of being allocated dynamically to another team member. Figure 2 clearly shows how even GPU-intense train workloads alternate between intense concurrent GPU usage and idle periods.
Last but not least, it should be noted that GPUs see only processes and applications, which makes it even harder to equitably share GPU memory based on factors to which the GPU is indifferent, such as users, projects, teams, and business priorities.
The challenges of sharing grow exponentially when there are multiple GPUs involved. Containers, processes, and applications are assigned to a single GPU for their entire execution lifetime. However, a node or cluster is very dynamic, with jobs starting and ending at different times.
With static applications running in dynamic environments, suboptimal decisions can cause fragmentation in free GPU memory. Applications running across different GPUs in a node or cluster might leave small fragments of memory free on each GPU. But the next executing application that requires GPU memory may not be able to find a single GPU with sufficient available memory by itself, even though there is enough free GPU memory in total in the environment.
Data science teams may also try to use Excel sheets to manage GPU allocations statically. In addition to assigning a certain portion of GPU memory per team member (as described above), each member is also statically assigned a set of GPU identifiers. These methods, however, are no match for the real-life erratic GPU consumption pattern shown in Figure 3, with six different users using the same GPUs for build, train, and inference workloads.
…data science teams had a platform that could automatically and dynamically orchestrate the allocation of GPU compute and memory resources across a cluster based on predefined user/project/workload priority rules? And wouldn’t it be even nicer if the same platform could logically divide a physical GPU into smaller virtual GPUs so that usage could be more granular?
Run:AI automates resource management and workload orchestration for machine learning infrastructure. With Run:AI, you can automatically run as many compute intensive experiments as needed.
Here are some of the capabilities you gain when using Run:AI:
Run:AI simplifies machine learning infrastructure pipelines, helping data scientists accelerate their productivity and the quality of their models.
Learn more about the Run:AI GPU virtualization platform.