TensorFlow provides strong support for distributing deep learning across multiple GPUs. TensorFlow is an open source platform that you can use to develop and train machine learning and deep learning models. TensorFlow operations can leverage both CPUs and GPUs. If you’re operating from Google Cloud Platform (GCP), you can also use TensorFlow Processing Units (TPUs), specially designed for TensorFlow operations.
In this article, you will learn:
The primary distributed training method in TensorFlow is tf.distribute.Strategy. This method enables you to distribute your model training across machines, GPUs or TPUs. It is designed to be easy to use, provide strong out-of-the-box performance and enable you to switch between strategies easily.
The distribute method also forms the base of several additional methods, including some experimental methods.
tf.distribute.MirroredStrategy is a method that you can use to perform synchronous distributed training across multiple GPUs. Using this method, you can create replicas of your model variables which are mirrored across your GPUs.
During operation, these mirrored variables are grouped into a MirroredVariable and kept in sync with all-reduce algorithms. The default algorithm used is the one implemented by NVIDIA NCCL; you can also specify another pre-built option or create a custom algorithm.
tf.distribute.experimental.TPUStrategy is a method you can use to distribute training across TPUs. This method works the same as MirroredStrategy. The difference is that it includes a different implementation of all-reduce that is customized to TPUs.
tf.distribute.experimental.MultiWorkerMirroredStrategy is a method that is similar to MirroredStrategy but enables you to spread your training across machines. This method uses a set of collectiveOps methods to sync variables across your workers. This set reduces your operations to a single unit in your TensorFlow graph, which then selects the appropriate all-reduce algorithm.
tf.distribute.experimental.CentralStorageStrategy is a method you can use to perform synchronous training from a central CPU. With this method, your variables are maintained centrally and operations are mirrored across your GPUs. This enables you to perform the same operations with different subsets of data.
tf.distribute.experimental.ParameterServerStrategy is a method that you can use to train parameter servers on multiple machines. Using this method, you separate your machines into parameter servers and workers. Your variables are distributed to the different parameter servers and your computations are replicated in the worker GPUs.
In the following tutorial, the Estimator class is combined with MirroredStrategy to enable you to distribute your operations across GPUs. This is adapted from a more detailed tutorial based on the TensorFlow documentation here.
Horovod is an open source framework created to support distributed training of deep learning models through Keras and TensorFlow. It also supports Apache MXNet and PyTorch. Horovod was created to enable you to easily scale your GPU training scripts for use across many GPUs running in parallel.
Here is a brief tutorial on how to enable the use of Horovod with TensorFlow. This requires modifying the standard Horovod program with the following additions. This tutorial is adapted from a more detailed tutorial found in the Horovod documentation here.
1. Start by importing Horovod and initializing it.
import horovod.tensorflow as hvd
2. Next, define the GPU order for your processes. You can set this as a custom order or using local ranks, which assigns processes sequentially.
config = tf.ConfigProto()
config.gpu_options.visible_device_list = str(hvd.local_rank())
3. Update the learning rate of your model based on your workers.
4. Then, you need to wrap your optimizer in hvd.DistributedOptimizer. This directs your gradient computations to the optimizer, averages your gradients and applies the averages to the model update.
opt = hvd.DistributedOptimizer(opt)
Next, you need to ensure that your workers are consistently initialized. This is done by adding hvd.BroadcastGlobalVariablesHook(0), which broadcasts your initial variable states to all processes.
hooks = [hvd.BroadcastGlobalVariablesHook(0)]
Ensure that you are only saving your checkpoints on your 0 worker. This ensures that other workers do not corrupt checkpoints.
checkpoint_dir = '/tmp/train_logs' if hvd.rank() == 0 else None
You can then run your training operation, which you have previously defined. This is done with the MonitoredTrainingSession method to automate session initialization, checkpoint operations, and error handling.
hooks=hooks) as mon_sess:
while not mon_sess.should_stop():
Run:AI automates resource management and workload orchestration for machine learning infrastructure. With Run:AI, you can automatically run as many compute intensive experiments as needed in TensorFlow and other deep learning frameworks.
Here are some of the capabilities you gain when using Run:AI:
Run:AI simplifies machine learning infrastructure pipelines, helping data scientists accelerate their productivity and the quality of their models.
Learn more about the Run:AI GPU virtualization platform.