By creating multiple logical GPUs on a single resource, Run:AI has built another key part of AI virtualization technology
Tel Aviv — Wednesday, May 6, 2020 — Run:AI, a company virtualizing AI infrastructure, today released the first fractional GPU sharing system for deep learning workloads on Kubernetes. Especially suited for lightweight AI tasks at scale such as inference, the fractional GPU system transparently gives data science and AI engineering teams the ability to run multiple workloads simultaneously on a single GPU, enabling companies to run more workloads such as computer vision, voice recognition and natural language processing on the same hardware, lowering costs.
Today’s de facto standard for deep learning workloads is to run them in containers orchestrated by Kubernetes. However, Kubernetes is only able to allocate whole physical GPUs to containers, lacking the isolation and virtualization capabilities needed to allow GPU resources to be shared without memory overflows or processing clashes.
Run:AI’s fractional GPU system effectively creates virtualized logical GPUs, with their own memory and computing space that containers can use and access as if they were self-contained processors. This enables several deep learning workloads to run in containers side-by-side on the same GPU without interfering with each other. The solution is transparent, simple and portable; it requires no changes to the containers themselves.
To create the fractional GPUs, Run:AI had to modify how Kubernetes handled them. “In Kubernetes, a GPU is handled as an integer,” said Dr. Ronen Dar, co-founder and CTO of Run:AI. “You either have one or you don’t. We had to turn GPUs into floats, allowing for fractions of GPUs to be assigned to containers.” Run:AI also solved the problem of memory isolation, so each virtual GPU can run securely without memory clashes.
A typical use-case could see 2-4 jobs running on the same GPU, meaning companies could do four times the work with the same hardware. For some lightweight workloads, such as inference, more than 8 jobs running in containers can comfortably share the same physical chip.
The addition of fractional GPU sharing is a key component in Run:AI’s mission to create a true virtualized AI infrastructure, combining with Run:AI’s existing technology that elastically stretches workloads over multiple GPUs and enables resource pooling and sharing.
“Some tasks, such as inference tasks, often don’t need a whole GPU, but all those unused processor cycles and RAM go to waste because containers don’t know how to take only part of a resource,” said Run:AI co-founder and CEO Omri Geller. “Run:AI’s fractional GPU system lets companies unleash the full capacity of their hardware so they can scale up their deep learning more quickly and efficiently.”
Run:AI has built the world’s first virtualization layer for AI workloads. By abstracting workloads from underlying infrastructure, Run:AI creates a shared pool of resources that can be dynamically provisioned, enabling full utilization of expensive GPU compute. IT teams retain control and gain real-time visibility – including seeing and provisioning run-time, queueing and GPU utilization – from a single web-based UI. This virtual pool of resources enables IT leaders to view and allocate compute resources across multiple sites – whether on premises or in the cloud. The Run:AI platform is built on top of Kubernetes, enabling simple integration with existing IT and data science workflows.