How to Build Your GPU Cluster

Process and Hardware Options

What Is a GPU Cluster?

A GPU cluster is a group of computers that have a graphics processing unit (GPU) on every node. Multiple GPUs provide accelerated computing power for specific computational tasks, such as image and video processing and training neural networks and other machine learning algorithms.

There are three main types of GPU clusters, each offering specific advantages:

  • High availability—the GPU cluster reroutes requests to different nodes in the event of a failure
  • High performance—the GPU cluster uses multiple parallel slave nodes to increase compute power for more demanding tasks
  • Load balancing—the GPU cluster spreads compute workloads evenly across slave nodes to handle a large volume of jobs

For more background on the use of GPUs for machine learning projects, read our multi-part guides about:

GPU Cluster Uses

GPU clusters are commonly used for:

Scaling up deep learning
GPU clusters provide the required computational power to train large models and datasets across multiple GPU nodes.

Here are two ways GPU clusters can be used to power deep learning tasks:

  • Computer vision – computer vision architectures such as ResNet and Inception often use hundreds or thousands of convolutional layers, and are computationally intensive to train. By using GPU clusters, researchers can accelerate training time and perform fast inference on massive datasets, including video data.
  • Natural Language Processing (NLP) – large-scale NLP models, such as conversational AI, require a large amount of computational power and continuous training. GPU clusters make is possible to ingest large amounts of training data, partition it into manageable units, and train the model in parallel.

Edge AI
GPU clusters can also be distributed, with GPU nodes spread across devices deployed at the edge, rather than in a centralized data center. Joining GPUs from multiple, distributed nodes into one cluster makes it possible to run AI inference with very low latency. This is because each node can generate predictions locally, without having to contact the cloud or a remote data center.

Related content: Read our guide to edge AI

Building a GPU-Powered Research Cluster

Use the following steps to build a GPU-accelerated cluster in your on-premises data center.

Step 1: Choose Hardware

The basic component of a GPU cluster is a node—a physical machine running one or more GPUs, which can be used to run workloads. When selecting hardware for your node, consider the following parameters:

  • CPU processor—the node requires a CPU as well as GPUs. For most GPU nodes, any modern CPU will do.
  • RAM—the more system RAM the better, but ensure you have a minimum of 24 GB DDR3 RAM on each node.
  • Networking—each node should have at least two available network ports. You will need to use Infiniband for fast interconnection between GPUs.
  • Motherboard—the motherboard should have PCI-express (PCIe) connections for the GPUs you intend to use and for the Infiniband card. Ensure you have a GPU board with physically separated PCIe x16 slots and PCIx8 slots. You will typically use the x16 slots for GPUs and x8 slots for the network card.
  • Power supply unit—data center grade GPUs are especially power hungry. When computing the total power needed, take into account the CPU, all GPUs running on the node, and other components.
  • Storage—prefer SSD drives, but SSD might be enough for some scenarios.
  • GPU form factor—consider the GPU form factor that matches your node hardware and the number of GPUs you want to run per node. Common form factors include compact (SFF), single slot, dual slot, actively cooled, passively cooled, and water cooled.

Step 2: Allocate Space, Power and Cooling

A GPU may require its own facility, or can be housed as part of an existing data center. Plan for the following elements:

  • Physical space—ensure you have racks and physical space in your data center for the nodes you intend to deploy.
  • Cooling—GPUs require extensive cooling. Take into account if your GPUs are actively cooled using on-board equipment or passively cooled, and plan for the overall cooling requirements of the entire cluster.
  • Networking—ensure you have a fast Ethernet switch to enable communication between the cluster main node and worker nodes.
  • Storage—depending on your data requirements, you may need a central storage solution in addition to local storage on the nodes.

Step 3: Physical Deployment

Now that you have the facility and equipment ready, you will need to physically deploy the nodes. You will have a head node, a dedicated node which controls the cluster, and multiple worker nodes, which run workloads.

Deploy networking as follows:

  • The head node must receive network connections and requests from outside the cluster, and pass them on to worker nodes.
  • Worker nodes must be connected to the head node via fast Ethernet connection.

Deploying Software for Head and Worker Nodes

Once hardware and networking is deployed, deploy an operating system on your nodes. NVIDIA recommends using open source Rocks Linux for computational clusters.

In addition, you will need management software. A common choice is to manage computational clusters using Kubernetes. In this case, you will need to deploy the Kubernetes control plane on the head node (preferably with another redundant head node for high availability), and deploy Kubernetes worker nodes on the remaining machines. You may need to deploy additional software such as the SLURM job scheduler.

Learn more in our detailed guides about:

GPU Cluster Hardware Options

Let’s review the main hardware options at your disposal when building a GPU cluster.

Data Center GPU Options

The following are some of the world’s most powerful, data center grade GPUs, commonly used to build large-scale GPU infrastructure.

NVIDIA Tesla A100

The A100 is based on Tensor Cores and leverages multi-instance GPU (MIG) technology. It is built for workloads such as high-performance computing (HPC), machine learning and data analytics.

Tesla A100 is intended for scalability (up to thousands of units) and can be separated into seven GPU instances for different workload sizes. The A100 offers performance reaching up to 624 teraflops (billion floating-point operations per second) and has 40GB memory, 1,555 GB bandwidth and 600GB/s interconnects.

NVIDIA Tesla V100

The V100 GPU is also based on Tensor Cores and is designed for applications such as machine learning, deep learning and HPC. It uses NVIDIA Volta technology to accelerate common tensor operations in deep learning workloads. The Tesla V100 offers performance reaching 149 teraflops as well as 32GB memory and a 4,096-bit memory bus.

NVIDIA Tesla P100

The Tesla P100 GPU is based on an NVIDIA Pascal architecture designed specifically for HPC and machine learning. P100 offers performance of up to 21 teraflops, with 16GB of memory and a 4,096-bit memory bus.

NVIDIA Tesla K80

The K80 GPU uses NVIDIA Kepler architecture, which enables the accelerating of data analytics and scientific computing. It incorporates GPU Boost™ technology and 4,992 NVIDIA CUDA cores. Tesla K80 offers up to 8.73 teraflops performance, with 480GB memory bandwidth and 24GB of GDDR5 memory.

Google TPU

Google offers slightly different tensor processing units (TPUs), which are application-specific integrated circuits (ASICs) based on chips or the cloud, that support deep learning. These TPUs are designed specifically to be used with TensorFlow and can only be found on the Google Cloud Platform.

Google TPUs offer performance of up to 420 teraflops with a high bandwidth memory (HBM) of 128 GB. You can also find pod versions that offer performance of over 100 petaflops with 32TB HBM and 2D toroidal mesh networks.

GPU Server Options

A GPU server, also known as GPU workstation, is a system capable of running multiple GPUs in one physical chassis.

NVIDIA DGX-1—First Generation DGX Server

NVIDIA DGX-1 is the first-generation DGX server. It is an integrated workstation with powerful computing capacity suitable for deep learning. It provides one petaflop of GPU compute power and offers these hardware features:


The architecture of DGX-2, the second-generation DGX server, is similar to that of DGX-1, but with greater computing power, reaching up to 2 petaflops when used with a 16 Tesla V100 GPU. NVIDIA explains that to train a ResNet-50 using a typical x86 architecture, you would need 300 servers enabled by dual Intel Xeon Gold CPUs to achieve the same processing speed as with DGX-2. It would also cost over $2.7 million.

DGX-2 offers these hardware features:


NVIDIA’s third generation AI system is DGX A100, which offers five petaflops of computing power in a single system.

A100 is available in two models, either with 320GB RAM or with 640GB RAM. It offers the following hardware features:

NVIDIA DGX Station A100—Third Generation DGX Workstation

DGX Station is the lighter weight version of DGX A100, intended for use by developers or small teams. It has a Tensor Core architecture that allows A100 GPUs to leverage mixed-precision, multiply-accumulate operations, which helps accelerate training of large neural networks significantly.

The DGX Station comes in two models, with either 160GB or 320GB GPU RAM. It offers the following hardware features:


DGX SuperPOD is a multi-node computing platform for full-stack workloads. It offers networking, storage, compute and tools for data science pipelines. NVIDIA offers an implementation service to help you deploy and maintain SuperPOD on an ongoing basis.

SuperPOD supports the integration of up to 140 DGX A100 systems in a single AI infrastructure cluster. The cluster offers the following capabilities:

Lambda Labs GPU Workstations

Lambda Labs offers mid-range GPU workstations with 2-4 GPUs. They are typically used by individual machine learning engineers or small teams training models in a local environment. The workstation offers the following hardware features:

GPU Cluster Management with Run:AI

Run:AI automates resource management and orchestration for machine learning infrastructure. With Run:AI, you can automatically run as many compute intensive experiments as needed.

Here are some of the capabilities you gain when using Run:AI

  • Advanced visibility—create an efficient pipeline of resource sharing by pooling GPU compute resources.
  • No more bottlenecks—you can set up guaranteed quotas of GPU resources, to avoid bottlenecks and optimize billing.
  • A higher level of control—Run:AI enables you to dynamically change resource allocation, ensuring each job gets the resources it needs at any given time.

Run:AI simplifies machine learning infrastructure pipelines, helping data scientists accelerate their productivity and the quality of their models.

Learn more about the Run:AI GPU virtualization platform.