PyTorch Multi GPU: 4 Techniques Explained

PyTorch provides a Python-based library package and a deep learning platform for scientific computing tasks. Learn four techniques you can use to accelerate tensor computations with PyTorch multi GPU techniques—data parallelism, distributed data parallelism, model parallelism, and elastic training.

In this article, you will learn:

What Is PyTorch?

PyTorch is a package for Python that you can use to perform scientific computing tasks. With PyTorch, you can use Graphical Processing Units (GPUs) to accelerate tensor computations and deep learning operations. 

Some of PyTorch’s most notable features are:

  • Simple interface—includes a user friendly API.
  • Pythonic in nature—integrates smoothly with Python data science stacks and enables use of all Python functionalities and services. 
  • Computational graphs—includes features for creating dynamic computational graphs. These features enable you to adjust model training processes in real-time. 

4 Ways to Use Multiple GPUs With PyTorch

There are three main ways to use PyTorch with multiple GPUs. These are:

  • Data parallelism—datasets are broken into subsets which are processed in batches on different GPUs using the same model. The results are then combined and averaged in one version of the model. This method relies on the DataParallel class.
  • Distributed data parallelism—enables you to perform data parallelism across GPUs and physical machines and can be combined with model parallelism. This method relies on the DistributedDataParallel class. 
  • Model parallelism—a single model is broken into segments with each segment run on different GPUs. The results from the segments are then combined to produce a completed model. This method relies on the model_parallel class.
  • Elastic training—dynamically scale training resources for deep learning models, running PyTorch jobs on multiple GPUs and/or machines.

Learn how to perform distributed training with Keras and with TensorFlow, in our articles about Keras multi GPU and TensorFlow multiple GPU.

Technique 1: Data Parallelism

To use data parallelism with PyTorch, you can use the DataParallel class. When using this class, you define your GPU IDs and initialize your network using a Module object with a DataParallel object. 

parallel_net = nn.DataParallel(myNet, gpu_ids = [0,1,2])

Once defined, you can perform the standard model training steps just as you would with a standard nn.Module object. For example:

# performs a forward pass
predictions = parallel_net(inputs) 
# computes a loss function
loss = loss_function(predictions, labels)
# averages GPU-losses and performs a backward pass

When using this method, you need to ensure that your data is initially stored on one GPU (the “primary GPU”). You should also place your data parallel object on that same GPU. You can do this using code like the following:

input        =
parallel_net =

Then, when you call your object it can split your dataset into batches that are distributed across your defined GPUs. Once the operations are complete, the outputs are aggregated on the primary GPU. 

Technique 2: Distributed Data Parallelism

If you want to use distributed data parallelism with PyTorch, you can use the DistributedDataParallel class. With this class, you should create one process for each model replica you need. These replicas can then span multiple devices. This class also enables you to split your model (model parallelism) if you combine it with the model_parallel class (explained below).

The main difference between DataParallel and DistributedDataParallel is that the former only works for single-processes while the later can work for single or multi-process training. This means you can run your model across multiple machines with DistributedDataParallel. Additionally, DataParallel may be slower than DistributedDataParallel due to thread interpretation issues and increased overhead created by distribution.

Technique 3: Model Parallelism

To implement model parallelism in PyTorch, you need to use the model_parallel class. This class is useful when your model is too large for a single GPU. However, keep in mind this method typically does not speed up your training processes like data parallelism can. 

This lack of speed is because model parallelism creates dependencies between the GPUs you are using, preventing the units from running in parallel. For example, you can see below how the process on Subnet 2 relies on the output from Subnet 1 and vice versa. 

Model Parallelism Paperspace

Image source: Paperspace

To implement model parallelism in PyTorch, you need to define a class like the following: 

class model_parallel(nn.Module):
   def __init__(self):
      self.sub_network1 = ...
      self.sub_network2 = ...

   def forward(x):
      x = x.cuda(0)
      x = self.sub_network1(x)
      x = x.cuda(1)
      x = self.sub_network2(x)
      return x

This class defines where your subnetworks are placed and how both perform a forward pass.  When defining this class, keep in mind that your input and your network need to be located on the same device. You should also remember that cuda and to functions support autograd. This means that your gradients are copied to the paired GPU buffer during your backpropagation. 

Technique 4: Elastic Training

PyTorch Elastic is a library you can use to dynamically scale training resources for deep learning models. It includes built-in interfaces and primitives that you can use to run PyTorch jobs on multiple devices or machines with scaling. 

This scaling works by defining a minimum and a maximum number of workers. Your jobs start once the minimum available is reached and can dynamically scale up to the max allowed. If worker requirements drop during the job process, the number is scaled down to make resources available for other jobs. 

Additionally, Elastic includes features for fault tolerance, enabling you to automatically detect and replace failed nodes mid process. This is supported by the Rendezvous component which is used to ensure that all workers in use agree on participants and roles in the job.

PyTorch Multi GPU With Run:AI

Run:AI automates resource management and workload orchestration for machine learning infrastructure. With Run:AI, you can automatically run as many compute intensive experiments as needed in PyTorch and other deep learning frameworks. 

Here are some of the capabilities you gain when using Run:AI: 

  • Advanced visibility—create an efficient pipeline of resource sharing by pooling GPU compute resources.
  • No more bottlenecks—you can set up guaranteed quotas of GPU resources, to avoid bottlenecks and optimize billing.
  • A higher level of control—Run:AI enables you to dynamically change resource allocation, ensuring each job gets the resources it needs at any given time.

Run:AI simplifies machine learning infrastructure pipelines, helping data scientists accelerate their productivity and the quality of their models. 

Learn more about the GPU virtualization platform.