PyTorch provides a Python-based library package and a deep learning platform for scientific computing tasks. Learn four techniques you can use to accelerate tensor computations with PyTorch multi GPU techniques—data parallelism, distributed data parallelism, model parallelism, and elastic training.
In this article, you will learn:
PyTorch is a package for Python that you can use to perform scientific computing tasks. With PyTorch, you can use Graphical Processing Units (GPUs) to accelerate tensor computations and deep learning operations.
Some of PyTorch’s most notable features are:
There are three main ways to use PyTorch with multiple GPUs. These are:
To use data parallelism with PyTorch, you can use the DataParallel class. When using this class, you define your GPU IDs and initialize your network using a Module object with a DataParallel object.
parallel_net = nn.DataParallel(myNet, gpu_ids = [0,1,2])
Once defined, you can perform the standard model training steps just as you would with a standard nn.Module object. For example:
# performs a forward pass
predictions = parallel_net(inputs)
# computes a loss function
loss = loss_function(predictions, labels)
# averages GPU-losses and performs a backward pass
When using this method, you need to ensure that your data is initially stored on one GPU (the “primary GPU”). You should also place your data parallel object on that same GPU. You can do this using code like the following:
input = input.to(0)
parallel_net = parellel_net.to(0)
Then, when you call your object it can split your dataset into batches that are distributed across your defined GPUs. Once the operations are complete, the outputs are aggregated on the primary GPU.
If you want to use distributed data parallelism with PyTorch, you can use the DistributedDataParallel class. With this class, you should create one process for each model replica you need. These replicas can then span multiple devices. This class also enables you to split your model (model parallelism) if you combine it with the model_parallel class (explained below).
The main difference between DataParallel and DistributedDataParallel is that the former only works for single-processes while the later can work for single or multi-process training. This means you can run your model across multiple machines with DistributedDataParallel. Additionally, DataParallel may be slower than DistributedDataParallel due to thread interpretation issues and increased overhead created by distribution.
To implement model parallelism in PyTorch, you need to use the model_parallel class. This class is useful when your model is too large for a single GPU. However, keep in mind this method typically does not speed up your training processes like data parallelism can.
This lack of speed is because model parallelism creates dependencies between the GPUs you are using, preventing the units from running in parallel. For example, you can see below how the process on Subnet 2 relies on the output from Subnet 1 and vice versa.
Image source: Paperspace
To implement model parallelism in PyTorch, you need to define a class like the following:
self.sub_network1 = ...
self.sub_network2 = ...
x = x.cuda(0)
x = self.sub_network1(x)
x = x.cuda(1)
x = self.sub_network2(x)
This class defines where your subnetworks are placed and how both perform a forward pass. When defining this class, keep in mind that your input and your network need to be located on the same device. You should also remember that cuda and to functions support autograd. This means that your gradients are copied to the paired GPU buffer during your backpropagation.
PyTorch Elastic is a library you can use to dynamically scale training resources for deep learning models. It includes built-in interfaces and primitives that you can use to run PyTorch jobs on multiple devices or machines with scaling.
This scaling works by defining a minimum and a maximum number of workers. Your jobs start once the minimum available is reached and can dynamically scale up to the max allowed. If worker requirements drop during the job process, the number is scaled down to make resources available for other jobs.
Additionally, Elastic includes features for fault tolerance, enabling you to automatically detect and replace failed nodes mid process. This is supported by the Rendezvous component which is used to ensure that all workers in use agree on participants and roles in the job.
Run:AI automates resource management and workload orchestration for machine learning infrastructure. With Run:AI, you can automatically run as many compute intensive experiments as needed in PyTorch and other deep learning frameworks.
Here are some of the capabilities you gain when using Run:AI:
Run:AI simplifies machine learning infrastructure pipelines, helping data scientists accelerate their productivity and the quality of their models.
Learn more about the Run.ai GPU virtualization platform.