Question 1

Distributed Training: What Is It? How It Can Be Valuable?

Accepted Answer

As its name suggests, distributed training distributes training workloads across multiple mini-processors. These mini-processors, referred to as worker nodes, work in parallel to accelerate the training process. Their parallelism can be achieved by data parallelism or model parallelism, both of which are described below.

Question 2

The Two Types of Distributed Training

Accepted Answer

Data Parallelism In this type of distributed training, data is split up and processed in parallel. Each worker node trains a copy of the model on a different batch of training data, communicating its results after computation to keep the model parameters and gradients in sync across all nodes. These results can be shared synchronously (at the end of each batch computation) or asynchronously (in a system in which the models are more loosely coupled). Data Parallelism In most cases, data parallelism is relatively straightforward and efficient; however, there are times when the model is so large it cannot fit on a single worker node. This is where model parallelism comes in. Model Parallelism In model parallelism, the model itself is divided into parts that are trained simultaneously across different worker nodes. All workers use the same data set, and they only need to share global model parameters with other workers—typically just before forward or backward propagation. This type of distributed training is much more difficult to implement and only works well in models with naturally parallel architectures, such as those with multiple branches. Since data parallelism is more common, this article will focus on data parallelism. Learn more about FPGA for Deep Learning.

Question 3

How Does Distributed Training Shorten Training Time?

Accepted Answer

While it can be applied to any machine learning model, distributed training has the greatest impact on resource-intensive processes, such as deep learning.

Training deep learning models often takes a long time because the process typically requires substantial storage and compute capacity. During training, intermediate results must be calculated and held in memory. When a complex neural network performs logistic regression, for example, the model must calculate and store millions or billions of updated weight parameters until backpropagation is completed. In distributed training, storage and compute power are magnified with each added GPU, reducing training time.

Distributed training also addresses another major issue that slows training down: batch size. Every neural network has an optimal batch size which affects training time. When the batch size is too small, each individual sample has a lot of influence, creating extra noise and fluctuation and delaying the convergence of the algorithm. This problem intensifies as neural networks become increasingly complex, resulting in GPUs severely limiting batch size. With distributed training, training is no longer constrained by the memory of a single GPU, and batch size can be increased to shorten training time.

Question 4

What to Consider Before Using Distributed Training

Accepted Answer

Before deciding to use distributed training, it’s worth considering how many modifications are needed to switch to a distributed approach. You should also evaluate how difficult distributed training will be to implement. It’s important to ask yourself whether or not the move is worth it; in other words, how much will this shift speed up the training process? To answer this question, you will need to tackle the tricky issue of synchronization between workers. Learn more about Best GPU for Deep Learning.

Question 5

Deep Learning Synchronization Methods

Accepted Answer

One of the biggest challenges in distributed training is determining how the different workers will share and synchronize their results. This process can create considerable performance bottlenecks if not handled properly. In data parallelism, there are two main approaches to this issue: the parameter server approach and the all-reduce approach.

Distributed Training

The future of computing is distributed