A GPU cluster is a group of computers that have a graphics processing unit (GPU) on every node. Multiple GPUs provide accelerated computing power for specific computational tasks, such as image and video processing and training neural networks and other machine learning algorithms.
There are three main types of GPU clusters, each offering specific advantages:
For more background on the use of GPUs for machine learning projects, read our multi-part guides about:
In this article:
GPU clusters are commonly used for:
Scaling up deep learning
GPU clusters provide the required computational power to train large models and datasets across multiple GPU nodes.
Here are two ways GPU clusters can be used to power deep learning tasks:
Edge AI
GPU clusters can also be distributed, with GPU nodes spread across devices deployed at the edge, rather than in a centralized data center. Joining GPUs from multiple, distributed nodes into one cluster makes it possible to run AI inference with very low latency. This is because each node can generate predictions locally, without having to contact the cloud or a remote data center.
Related content: Read our guide to edge AI
Use the following steps to build a GPU-accelerated cluster in your on-premises data center.
The basic component of a GPU cluster is a node—a physical machine running one or more GPUs, which can be used to run workloads. When selecting hardware for your node, consider the following parameters:
A GPU may require its own facility, or can be housed as part of an existing data center. Plan for the following elements:
Now that you have the facility and equipment ready, you will need to physically deploy the nodes. You will have a head node, a dedicated node which controls the cluster, and multiple worker nodes, which run workloads.
Deploy networking as follows:
Once hardware and networking is deployed, deploy an operating system on your nodes. NVIDIA recommends using open source Rocks Linux for computational clusters.
In addition, you will need management software. A common choice is to manage computational clusters using Kubernetes. In this case, you will need to deploy the Kubernetes control plane on the head node (preferably with another redundant head node for high availability), and deploy Kubernetes worker nodes on the remaining machines. You may need to deploy additional software such as the SLURM job scheduler.
Learn more in our detailed guides about:
Let’s review the main hardware options at your disposal when building a GPU cluster.
The following are some of the world’s most powerful, data center grade GPUs, commonly used to build large-scale GPU infrastructure.
The A100 is based on Tensor Cores and leverages multi-instance GPU (MIG) technology. It is built for workloads such as high-performance computing (HPC), machine learning and data analytics.
Tesla A100 is intended for scalability (up to thousands of units) and can be separated into seven GPU instances for different workload sizes. The A100 offers performance reaching up to 624 teraflops (billion floating-point operations per second) and has 40GB memory, 1,555 GB bandwidth and 600GB/s interconnects.
The V100 GPU is also based on Tensor Cores and is designed for applications such as machine learning, deep learning and HPC. It uses NVIDIA Volta technology to accelerate common tensor operations in deep learning workloads. The Tesla V100 offers performance reaching 149 teraflops as well as 32GB memory and a 4,096-bit memory bus.
The Tesla P100 GPU is based on an NVIDIA Pascal architecture designed specifically for HPC and machine learning. P100 offers performance of up to 21 teraflops, with 16GB of memory and a 4,096-bit memory bus.
The K80 GPU uses NVIDIA Kepler architecture, which enables the accelerating of data analytics and scientific computing. It incorporates GPU Boost™ technology and 4,992 NVIDIA CUDA cores. Tesla K80 offers up to 8.73 teraflops performance, with 480GB memory bandwidth and 24GB of GDDR5 memory.
Google offers slightly different tensor processing units (TPUs), which are application-specific integrated circuits (ASICs) based on chips or the cloud, that support deep learning. These TPUs are designed specifically to be used with TensorFlow and can only be found on the Google Cloud Platform.
Google TPUs offer performance of up to 420 teraflops with a high bandwidth memory (HBM) of 128 GB. You can also find pod versions that offer performance of over 100 petaflops with 32TB HBM and 2D toroidal mesh networks.
A GPU server, also known as GPU workstation, is a system capable of running multiple GPUs in one physical chassis.
NVIDIA DGX-1 is the first-generation DGX server. It is an integrated workstation with powerful computing capacity suitable for deep learning. It provides one petaflop of GPU compute power and offers these hardware features:
The architecture of DGX-2, the second-generation DGX server, is similar to that of DGX-1, but with greater computing power, reaching up to 2 petaflops when used with a 16 Tesla V100 GPU. NVIDIA explains that to train a ResNet-50 using a typical x86 architecture, you would need 300 servers enabled by dual Intel Xeon Gold CPUs to achieve the same processing speed as with DGX-2. It would also cost over $2.7 million.
DGX-2 offers these hardware features:
NVIDIA’s third generation AI system is DGX A100, which offers five petaflops of computing power in a single system.
A100 is available in two models, either with 320GB RAM or with 640GB RAM. It offers the following hardware features:
DGX Station is the lighter weight version of DGX A100, intended for use by developers or small teams. It has a Tensor Core architecture that allows A100 GPUs to leverage mixed-precision, multiply-accumulate operations, which helps accelerate training of large neural networks significantly.
The DGX Station comes in two models, with either 160GB or 320GB GPU RAM. It offers the following hardware features:
DGX SuperPOD is a multi-node computing platform for full-stack workloads. It offers networking, storage, compute and tools for data science pipelines. NVIDIA offers an implementation service to help you deploy and maintain SuperPOD on an ongoing basis.
SuperPOD supports the integration of up to 140 DGX A100 systems in a single AI infrastructure cluster. The cluster offers the following capabilities:
Lambda Labs offers mid-range GPU workstations with 2-4 GPUs. They are typically used by individual machine learning engineers or small teams training models in a local environment. The workstation offers the following hardware features:
Run:AI automates resource management and orchestration for machine learning infrastructure. With Run:AI, you can automatically run as many compute intensive experiments as needed.
Here are some of the capabilities you gain when using Run:AI
Run:AI simplifies machine learning infrastructure pipelines, helping data scientists accelerate their productivity and the quality of their models.
Learn more about the Run:AI GPU virtualization platform.