Question 1

What Is GPU Scheduling?

Accepted Answer

A graphics processing unit (GPU) is an electronic chip that renders graphics by quickly performing mathematical calculations. GPUs use parallel processing to enable several processors to handle different parts of one task.

GPUs are used to accelerate a wide range of workloads, including artificial intelligence (AI) and machine learning (ML). These workloads run complex computations and are often supported by high performance computing (HPC) infrastructure.

To quickly provide results, AI and ML workloads consume massive amounts of resources, including multi GPU clusters. GPU scheduling helps to distribute AL and ML workloads across a large number of GPUs, and utilize resources effectively. It is typically achieved through the use of schedulers - workload managers that automatically provision GPUs as needed.

Schedulers vs orchestrators

Traditionally, job scheduling was done by dedicated schedulers like Slum or IBM LSF. Due to the complexity of these tools, many organizations are transitioning to container orchestrators like Kubernetes or Nomad.

However, the complexity doesn’t end there. Container orchestrators do not support GPU scheduling by default. You can add GPU scheduling to your orchestrator, using plugins and libraries provided by some GPU and software vendors. AMD and NVIDIA provide device plugins that you can install on Kubernetes. HashiCorp Nomad 0.9 provides their own device plugin, and Microsoft offers The DirectX API for Windows 10.

Question 2

What Are the Challenges of GPU Scheduling for AI and HPC?

Accepted Answer

Most AI and high performance computing (HPC) applications offer GPU support. The NVIDIA CUDA environment makes it easier to program GPUs, with parallel code implemented as blocks of threads and unifying memory between GPUs and CPUs. Developers can leverage GPU-compatible libraries like cuFFT, cuDNN and cuBLAS to avoid programming at a low level.

However, there are several important challenges organizations face when trying to deploy AI and HPC applications on multiple GPU systems.

Host code vs device code

However, GPU applications can pose challenges for managing an HPC data center, because they comprise both host code, which runs on CPU, and device code, which runs on GPU. In CUDA, the GPU-optimized parallel device-code is called a kernel, and contains blocks of multiple parallel threads. Depending on the architecture used, factors such as the number of blocks, or threads per block, can impact performance.

Heterogenous systems

It is relatively simple to run your GPU apps on one system, but it is much more complex in a large HPC or AI environment:

Compute environments are often heterogeneous with multiple servers running multiple generations of GPUs
Multiple users, departments and projects frequently compete for resources with different business, performance and technology requirements
Application needs can vary widely – in some cases, multiple GPU kernels may share a GPU, while in other cases applications may be distributed across multiple hosts and GPUs
Workload managers add complexity

You can use a workload manager (e.g. Slurm or IBM Spectrum LSF) to optimize resource utilization and performance in your HPC center. Scheduling is especially important for GPU-powered applications, given the cost of resources and the sensitivity of application performance to workload placement. GPU resources for complex applications can easily be underutilized, due to complex application and resource dependencies.

Workload schedulers need detailed information—such as the mode, GPU model, memory and device status—to optimally place GPU hosts and devices. For instance, the GPU’s operating temperature might help the scheduler avoid placing a workload on an overheated device. However, schedulers were not designed with an AI environment in mind and are not tightly integrated with GPU resources, making it difficult to collect this information.

In addition, workload managers are difficult to maintain and use, and were not designed for cloud-native environments. A growing alternative to traditional workload managers is GPU scheduling on container orchestrators like Kubernetes and Nomad.

Question 3

GPU Scheduling on Kubernetes

Accepted Answer

Kubernetes allows you to manage GPUs across multiple nodes. However, Kubernetes only supports GPU scheduling for NVIDIA and AMD. Device plugins enable GPU scheduling in Kubernetes, but they are not part of the default settings, so you have to configure them for GPU scheduling.

Once you have selected your GPU vendor (NVIDIA or AMD), you can install the relevant GPU drivers on your nodes. The vendor will provide a device plugin that you can run.

You can set up a GPU driver and run it so that Kubernetes shows the GPU file (i.e. nvidia.com/gpu, amd.com/gpu) as a schedulable resource. You have to request the vendor-specific file to enable the consumption of GPUs from containers—this is the same as requesting a CPU or memory resource.

There are some limitations to how a GPU’s resource requirements can be specified:

GPUs cannot be overcommitted—you cannot share GPUs across containers and pods.
Containers cannot request parts of a GPU—a container can only be granted access to an entire GPU, or to several GPUs.
Limits must be the same as requests—requests guarantee what a container can receive, whereas limits ensure that the container does not receive resources that exceed specified values. GPUs are relatively inflexible compared to other resources, so you can only set up limits. Any requests you specify have to be equal to the limits.

GPU Scheduling

What are the Options?

What Is GPU Scheduling?

What Are the Challenges of GPU Scheduling for AI and HPC?

GPU Scheduling on Kubernetes

Deploying AMD Device Plugin on Kubernetes Nodes

Deploying NVIDIA Device Plugin on Kubernetes Nodes

How to Enable Hardware Accelerated GPU Scheduling

GPU Scheduling Simplified with Run:ai