GPU Scheduling

What are the Options?

What Is GPU Scheduling?

A graphics processing unit (GPU) is an electronic chip that renders graphics by quickly performing mathematical calculations. GPUs use parallel processing to enable several processors to handle different parts of one task.

GPUs are used to accelerate a wide range of workloads, including artificial intelligence (AI) and machine learning (ML). These workloads run complex computations and are often supported by high performance computing (HPC) infrastructure.

To quickly provide results, AI and ML workloads consume massive amounts of resources, including multi GPU clusters. GPU scheduling helps to distribute AL and ML workloads across a large number of GPUs, and utilize resources effectively. It is typically achieved through the use of schedulers - workload managers that automatically provision GPUs as needed.

Schedulers vs orchestrators

Traditionally, job scheduling was done by dedicated schedulers like Slum or IBM LSF. Due to the complexity of these tools, many organizations are transitioning to container orchestrators like Kubernetes or Nomad.

However, the complexity doesn’t end there. Container orchestrators do not support GPU scheduling by default. You can add GPU scheduling to your orchestrator, using plugins and libraries provided by some GPU and software vendors. AMD and NVIDIA provide device plugins that you can install on Kubernetes. HashiCorp Nomad 0.9 provides their own device plugin, and Microsoft offers The DirectX API for Windows 10.

In this article:

What Are the Challenges of GPU Scheduling for AI and HPC?

Most AI and high performance computing (HPC) applications offer GPU support. The NVIDIA CUDA environment makes it easier to program GPUs, with parallel code implemented as blocks of threads and unifying memory between GPUs and CPUs. Developers can leverage GPU-compatible libraries like cuFFT, cuDNN and cuBLAS to avoid programming at a low level.

However, there are several important challenges organizations face when trying to deploy AI and HPC applications on multiple GPU systems.

Host code vs device code

However, GPU applications can pose challenges for managing an HPC data center, because they comprise both host code, which runs on CPU, and device code, which runs on GPU. In CUDA, the GPU-optimized parallel device-code is called a kernel, and contains blocks of multiple parallel threads. Depending on the architecture used, factors such as the number of blocks, or threads per block, can impact performance.

Heterogenous systems

It is relatively simple to run your GPU apps on one system, but it is much more complex in a large HPC or AI environment:

  • Compute environments are often heterogeneous with multiple servers running multiple generations of GPUs
  • Multiple users, departments and projects frequently compete for resources with different business, performance and technology requirements
  • Application needs can vary widely – in some cases, multiple GPU kernels may share a GPU, while in other cases applications may be distributed across multiple hosts and GPUs

Workload managers add complexity

You can use a workload manager (e.g. Slurm or IBM Spectrum LSF) to optimize resource utilization and performance in your HPC center. Scheduling is especially important for GPU-powered applications, given the cost of resources and the sensitivity of application performance to workload placement. GPU resources for complex applications can easily be underutilized, due to complex application and resource dependencies.

Workload schedulers need detailed information—such as the mode, GPU model, memory and device status—to optimally place GPU hosts and devices. For instance, the GPU’s operating temperature might help the scheduler avoid placing a workload on an overheated device. However, schedulers were not designed with an AI environment in mind and are not tightly integrated with GPU resources, making it difficult to collect this information.

In addition, workload managers are difficult to maintain and use, and were not designed for cloud-native environments. A growing alternative to traditional workload managers is GPU scheduling on container orchestrators like Kubernetes and Nomad.

Learn more in our detailed guides to:

GPU Scheduling on Kubernetes

Kubernetes allows you to manage GPUs across multiple nodes. However, Kubernetes only supports GPU scheduling for NVIDIA and AMD. Device plugins enable GPU scheduling in Kubernetes, but they are not part of the default settings, so you have to configure them for GPU scheduling.

Once you have selected your GPU vendor (NVIDIA or AMD), you can install the relevant GPU drivers on your nodes. The vendor will provide a device plugin that you can run.

You can set up a GPU driver and run it so that Kubernetes shows the GPU file (i.e., as a schedulable resource. You have to request the vendor-specific file to enable the consumption of GPUs from containers—this is the same as requesting a CPU or memory resource.

There are some limitations to how a GPU’s resource requirements can be specified:

  • GPUs cannot be overcommitted—you cannot share GPUs across containers and pods.
  • Containers cannot request parts of a GPU—a container can only be granted access to an entire GPU, or to several GPUs.
  • Limits must be the same as requests—requests guarantee what a container can receive, whereas limits ensure that the container does not receive resources that exceed specified values. GPUs are relatively inflexible compared to other resources, so you can only set up limits. Any requests you specify have to be equal to the limits.

Deploying AMD Device Plugin on Kubernetes Nodes

You will need to install the designated Linux driver before you can deploy AMD GPUs on a node. You can run the AMD device plugin with the following command, once you’ve installed the driver:

kubectl create -f

Deploying NVIDIA Device Plugin on Kubernetes Nodes

You can use NVIDIA’s official GPU device plugin to configure GPUs on Kubernetes. The plugin has a number of prerequisites:

  • The relevant NVIDIA drivers must be installed on your nodes.
  • You must install nvidia-docker 2.0 on your nodes.
  • The kubelet uses Docker as the container runtime.
  • The default Docker runtime is nvidia-container-runtime (not runc).
  • The version of the NVIDIA driver must match the ~= 384.81 constraint.

When you’ve met all the prerequisites, you can use the following command to deploy NVIDIA’s device plugin:

kubectl create -f

Unfortunately, the standard Kubernetes scheduler is quite challenging to use when working with AI and HPC workloads. See here: Why the Kubernetes Scheduler is Not Enough for Your AI Workloads” on the CNCF blog, as well as our article on Kubernetes Scheduling for AI Workloads.

Related content: Read our guide to kubernetes gpu

The device plugins available from HashiCorp Nomad 0.9 support a flexible set of workload deployment and scheduling devices. Device plugins allow the Nomad job scheduler to detect, fingerprint and access physical hardware devices. Version 0.9 offers an NVIDIA GPU device plugin. Examples of use cases include:

  • Compute-intensive workloads that employ an accelerator such as a GPU or TPU
  • Modules for securing application hardware
  • More programmable devices

Device plugins are a new feature for Nomad 0.9 that allow the client to find the hardware resources available, as well as already built-in resources like memory, CPU and disk. Device plugins can detect devices according to their fingerprints and attributes. When the plugin schedules a task using its associated resources, it can also help the Nomad client make the specified device available for the task.

When a device plugin fingerprints a set of devices, it generates a report including the number of devices detected; device-specific information such as vendor, type and model; and attributes of the device such as hardware features and available memory. The plugin returns the information to the client, which passes it to the server, where it can be used to schedule jobs. It is made available through the device stanza within the resource stanza:

resources {
 device "vendor/type/model" {
   count = 2
   constraint { ... }
   affinity { ... }

The device stanza allows users to select custom devices, indicating requirements according to varying levels of specificity. For example, to get any NVIDIA GPU, request nvidia/gpu—to get a specific model, add the exact type (e.g. nvidia/gpu/1080ti). This stanza specifies necessary devices as well as the constraints and affinities for device resources. You can use any of the fingerprinted device attributes to specify your preferences or constraints.

How to Enable Hardware Accelerated GPU Scheduling

Another option for scheduling GPUs is using the native scheduling features in Windows 10. The DirectX API from Microsoft’s Windows 10 now offers an optional Hardware Accelerated GPU Scheduling feature (since the May 2020 update, ver. 2004). This feature is designed to reduce the latency resulting from buffering between CPUs and GPUs.

With DirectX 12, most GPU scheduling can be offloaded to a dedicated scheduling processor with the appropriate drivers and hardware. Microsoft says this is intended to prioritize GPU tasks and provide the user with a responsive experience.

You can reach the settings page via Settings > System > Display > Graphics Settings. The below UI will appear, provided that your GPU and the driver both support the GPU scheduler.

Image Source: Microsoft

Any recent GPU with the relevant hardware should support Microsoft’s new GPU scheduler when combined with the WDDMv2.7 driver. You will not be able to see this screen if you haven’t installed the new driver.

Windows controls prioritization and decides which applications take precedence in various contexts. Windows can offload frequently-used tasks to the scheduling processor, which handles data management and context switching for different GPU engines. A high-priority thread continues to run on the CPU prioritizing and scheduling the jobs received from applications.

The advent of hardware-accelerated GPU scheduling can have a major impact on drivers. Even if a GPU has the necessary hardware, the driver associated with it, which exposes this support, is only released after being tested by Microsoft’s Insider population.

GPU Scheduling Simplified with Run:ai

Run:ai automates resource management and orchestration for AI workloads that utilize distributed infrastructure on GPU in AI and HPC data centers. With Run:AI, you can automatically run as many compute intensive workloads as needed on GPU in your AI and HPC infrastructure.

Here are some of the capabilities you gain when using Run:ai:

  • Advanced visibility—create an efficient pipeline of resource sharing by pooling compute resources.
  • No more bottlenecks—you can set up guaranteed quotas of resources, to avoid bottlenecks and optimize billing in cloud environments.
  • A higher level of control—Run:ai enables you to dynamically change resource allocation, ensuring each job gets the resources it needs at any given time.

Run:ai accelerates deep learning and other compute intensive workloads, by helping teams optimize expensive compute resources.

Learn more about the Run:ai Kubernetes Scheduler.