Training deep learning models takes time. Deep neural networks often consist of millions or billions of parameters that are trained over huge datasets. As deep learning models become more complex, computation time can become unwieldy. Training a model on a single GPU can take weeks. Distributed training can fix this problem. This guide explores how it does so and what you need to know about distributed training to start using it to your advantage. In this guide:
- Distributed Training: What is it?
- The two types of distributed training
- Deep learning synchronization methods
- Distributed training frameworks – Tensorflow, Keras, Pytorch and Horovod
- Supporting software libraries and their role in distributed training
- Optimizing your infrastructure
- Simplifying your distributed training
As its name suggests, distributed training distributes training workloads across multiple mini-processors. These mini-processors, referred to as worker nodes, work in parallel to accelerate the training process. Their parallelism can be achieved by data parallelism or model parallelism, both of which are described below.
In this type of distributed training, data is split up and processed in parallel. Each worker node trains a copy of the model on a different batch of training data, communicating its results after computation to keep the model parameters and gradients in sync across all nodes. These results can be shared synchronously (at the end of each batch computation) or asynchronously (in a system in which the models are more loosely coupled).
In most cases, data parallelism is relatively straightforward and efficient; however, there are times when the model is so large it cannot fit on a single worker node. This is where model parallelism comes in.
In model parallelism, the model itself is divided into parts that are trained simultaneously across different worker nodes. All workers use the same data set, and they only need to share global model parameters with other workers—typically just before forward or backward propagation. This type of distributed training is much more difficult to implement and only works well in models with naturally parallel architectures, such as those with multiple branches.
Since data parallelism is more common, this article will focus on data parallelism.
How Distributed Training Shortens Training Time
While it can be applied to any machine learning model, distributed training has the greatest impact on resource-intensive processes, such as deep learning.
Training deep learning models often takes a long time because the process typically requires substantial storage and compute capacity. During training, intermediate results must be calculated and held in memory. When a complex neural network performs logistic regression, for example, the model must calculate and store millions or billions of updated weight parameters until backpropagation is completed. In distributed training, storage and compute power are magnified with each added GPU, reducing training time.
Distributed training also addresses another major issue that slows training down: batch size. Every neural network has an optimal batch size which affects training time. When the batch size is too small, each individual sample has a lot of influence, creating extra noise and fluctuation and delaying the convergence of the algorithm. This problem intensifies as neural networks become increasingly complex, resulting in GPUs severely limiting batch size. With distributed training, training is no longer constrained by the memory of a single GPU, and batch size can be increased to shorten training time.
What to Consider Before Using Distributed Training
Before deciding to use distributed training, it’s worth considering how many modifications are needed to switch to a distributed approach. You should also evaluate how difficult distributed training will be to implement.
It’s important to ask yourself whether or not the move is worth it; in other words, how much will this shift speed up the training process? To answer this question, you will need to tackle the tricky issue of synchronization between workers.
One of the biggest challenges in distributed training is determining how the different workers will share and synchronize their results. This process can create considerable performance bottlenecks if not handled properly. In data parallelism, there are two main approaches to this issue: the parameter server approach and the all-reduce approach.
In a parameter server-based architecture, nodes are divided into workers that train the model and parameter servers which maintain the globally shared parameters. With each iteration, workers receive the latest global weights from a parameter server, calculate the gradients with their local data sets, and feed their results back to the parameter server.
This approach has some serious drawbacks. First, when parameters are shared synchronously, individual workers depend on all of the other workers to complete their iteration, which means the whole process is only as fast as the slowest machine. Second, when there’s one central parameter server, adding more machines doesn’t lower compute time because the server must recalculate the weights from each machine and so the overhead of communicating with all machines scales poorly.
In this approach, all machines share the load of storing and maintaining global parameters. In doing so, all-reduce overcomes the limitations of the parameter server method.
There are different all-reduce algorithms that dictate how these parameters are calculated and shared. In Ring AllReduce, for example, machines are set up in a ring. Each machine is in charge of a certain subset of parameters which are shared only with the next machine in line. In Tree AllReduce, on the other hand, parameters are shared via a tree structure. No matter what topology is used, all-reduce is a valuable tool that dramatically reduces synchronization overhead. In this approach, unlike in the parameter server approach, machines can be added without limiting bandwidth. This means computation time is only affected by the size of the model.
There are a number of deep learning tools that provide capabilities indispensable to distributed training. Four are profiled below.
The TensorFlow library has built-in functionality that supports distributed training. Its tf.distribute.Strategy API allows training to be split across multiple GPUs with minimal code changes. TensorFlow distributed training also supports different synchronization strategies, such as keeping variables in sync across workers using tf.distribute.MirroredStrategy and implementing both the parameter server and all-reduce approaches.
This open-source neural network library has an API called tf.distribute which allows models to be trained across multiple GPUs with very little additional code. In Keras distributed training, synchronous data parallelism has two typical setups: single-host, multi-device synchronous training, in which multiple GPUs are hosted on a single machine, and multi-worker distributed synchronous training, which involves a cluster of multiple machines.
The Pytorch open-source machine learning library is also built for distributed learning. Its distributed package, torch.distributed, allows data scientists to employ an elegant and intuitive interface to distribute computations across nodes using messaging passing interface (MPI).
Horovod is a distributed training framework developed by Uber. Its mission is to make distributed deep learning fast and it easy for researchers use. HorovodRunner simplifies the task of migrating TensorFlow, Keras, and PyTorch workloads from a single GPU to many GPU devices and nodes. Because it leverages the MPI library, it is perfectly suited for multi-node training.
MPI and NCCL are two communication libraries worth installing. They facilitate communication between nodes and simplify the distributed training process.
Message Passing Interface (MPI) is an open standard communication library that is widely used in high-performance computing. It specifies how nodes communicate with one another over a network, providing critical functionality for distributed training.
The NVIDIA Collective Communications Library (NCCL) is an incredibly fast collective communication library for GPU clusters which is optimized for high performance in distributed deep learning contexts. It is MPI compatible, and it enables data scientists to take full advantage of all available GPUs without needing to write and deploy complicated communication scripts.
Your underlying infrastructure also plays an important role in determining how long it will take to train your model.
Setting up fast interconnects enables GPUs to communicate quickly and directly with one another. Nvidia’s NVLink is a high-speed GPU interconnect within a single machine that is considerably faster than traditional solutions, maximizing performance and memory capacity in a multi-GPU environment.
InfiBand and RoCE are two types of fast interconnects between machines. InfiBand’s low latency and high bandwidth properties lower processing overhead and are ideal for running scientific models in parallel clusters over a single connection. RDMA over Converged Ethernet (RoCE) provides remote direct memory access over an Ethernet network. In addition to using fast interconnects, ensuring machines are in close proximity to one another will also help to optimize networking speeds.
GPU Direct RDMA (Remote Direct Memory Access) helps maintain high performance when moving from a single GPU to a multi-node distributed cluster. It does so by allowing nodes to transfer memory to one another with minimal memory copies.
Finally, both hardware and software aspects of storage should be sufficiently fast to stream data so that GPUs are not left idle or underutilized. This can be achieved by using scalable distributed storage that is designed to support massive parallel processing.
RUN:AI’s deep learning platform handles the heavy lifting behind the scenes of distributed training. Using automated distributed computing, RUN:AI pools compute resources and allocates them dynamically with elastic GPU clusters. In this way, data scientists enjoy virtually unlimited compute power with optimally executed data parallelism. In addition, the platform’s Kubernetes-based scheduler and gradient accumulation solution help overcome some of the biggest challenges discussed in this article. They ensure that training continues even when resources run out, prevent batch size limitations, and keep machines from being idle.
A Final Note
As deep learning models become more ambitious by the day, their supporting infrastructures struggle to keep up. Because a single GPU can no longer accommodate complex neural networks, it’s looking like the future of computing is distributed. RUN:AI is translating this future into reality. It allows data science teams to leverage the power of distributed training to get where they dream of going—faster and more skillfully than before.