Kubernetes is a highly popular container orchestrator, which can be deployed on-premises, in the cloud, and in hybrid environments.
To support compute-intensive workloads like machine learning (ML), Kubernetes can be used with graphical processing units (GPUs). GPUs provide hardware acceleration that is especially beneficial for deep learning and other machine learning algorithms. Kubernetes can be used to scale up multi GPU setups for large-scale ML projects.
GPU scheduling on Kubernetes is currently supported for NVIDIA and AMD GPUs, and requires the use of vendor-provided drivers and device plugins.
You can run Kubernetes on GPU machines in your local data center, or leverage GPU-powered compute instances on managed Kubernetes services, including Google Kubernetes Engine (GKE), Amazon Elastic Kubernetes Service (EKS), and Azure Kubernetes Service (AKS).
In this article, you will learn:
Kubernetes lets you manage graphical processing units (GPUs) across multiple nodes. GPU scheduling on Kubernetes is available primarily for AMD and NVIDIA accelerators.
To enable GPU scheduling, Kubernetes uses Device Plugins, which enable pods to access specialized hardware functionality, including GPUs. This is not set up by default—you need to configure GPU scheduling to use it.
First, you need to choose a GPU vendor—AMD or NVIDIA—and install your chosen GPU drivers on the nodes. You can then run the device plugin provided by the GPU vendor.
After you set up and run a GPU driver, Kubernetes exposes either nvidia.com/gpu or amd.com/gpu as a schedulable resource.
To consume GPUs from containers, you need to request .com/gpu, in the same manner you request memory or cpu resources. Note that there are certain limitations in how you can specify resource requirements for GPUs:
To run AMD GPUs on a node, you need to first install an AMD GPU Linux driver. Once your nodes have the driver, you can deploy the relevant AMD device plugin by using the below command:
kubectl create -f
You can configure NVIDIA GPUs on Kubernetes using the official NVIDIA GPU device plugin. Here are several prerequisites of the plugin:
Once all the prerequisites are met, you can deploy the NVIDIA device plugin using this command:
kubectl create -f
Learn more about Kubernetes for machine learning in our detailed guides about:
Google Kubernetes Engine lets you run Kubernetes nodes with several types of GPUs, including NVIDIA Tesla K80, P4, V100, P100, A100, and T4.
To reduce costs, you can use preemptible virtual machines (also known as spot instances)—as long as your workloads can tolerate frequent node disruptions.
There are several prerequisites to using GPUs on GKE:
Here are several limitations of GPUs on GKE:
How to install NVIDIA drivers
After you add GPU nodes to a cluster, install the relevant NVIDIA drivers. You can do this using the DaemonSet provided by Google.
Here is a command you can run to deploy the installation DaemonSet:
kubectl apply -f
You can now run GPU workloads on your GKE cluster.
AKS also supports creating Kubernetes nodes that are GPU-enabled. Currently you can only use GPUs for Linux node pools.
How to install NVIDIA device plugin
Before using GPUs on nodes, you need to deploy the DaemonSet for the NVIDIA device plugin. This DaemonSet runs pods on each node and provides the necessary drivers for the GPU.
To install the device plugin on Azure nodes:
1. Create a namespace using this command:
kubectl create namespace gpu-resources
2. Create a text file, rename it to nvidia-device-plugin-ds.yaml and paste the YAML manifest provided by Azure—get it here.
3. Run kubectl apply -f nvidia-device-plugin-ds.yaml to create the DaemonSet.
You can now run GPU-enabled workloads on your AKS cluster. See an example showing how to run Tensorflow on AKS nodes.
AWS offers an EKS-optimized AMI that comes with built-in GPU support. EKS-optimized AMIs are configured to be used as base images for Amazon P2 and P3 instances.
The GPU-accelerated AMIs are an optional image you can use to run GPU workloads on EKS nodes. In addition to the standard EKS-optimized AMI configuration, GPU AMIs include NVIDIA drives, a default runtime set to nvidia-container-runtime, and the nvidia-docker2 package.
To enable GPU workloads with the EKS-optimized AMI and test that GPU nodes are configured correctly:
1. After the GPU node has joined the cluster, apply the NVIDIA device plugin for Kubernetes as a DaemonSet in the cluster using the following command:
kubectl apply -f
2. Verify that a node has an allocation of GPUs by using the following command:
kubectl get nodes "-o=custom-columns=NAME:.metadata.name,GPU:.status.allocatable.nvidia\.com/gpu"
3. Create a file called nvidia-smi.yaml and use the YAML configuration provided by Amazon here. The manifest starts a Cuda container running nvidia-smi on the node.
4. Apply the manifest as follows:
kubectl apply -f nvidia-smi.yaml
5. Once the pod is running, check the logs using the following command:
kubectl logs nvidia-smi
Run:AI’s Scheduler is a simple plug-in to Kubernetes clusters and enables optimized orchestration of high-performance containerized workloads. It adds high-performance orchestration to your containerized AI workloads. The Run:AI platform includes:
Run:AI simplifies Kubernetes scheduling for AI and HPC workloads, helping researchers accelerate their productivity and the quality of their work.
Learn more about the Run:AI Kubernetes Scheduler