Tensorflow and Multiple GPU: 5 Strategies and 2 Tutorials

How Can I Use  TensorFlow with Multiple GPUs?

TensorFlow provides strong support for distributing deep learning across multiple GPUs. TensorFlow is an open source platform that you can use to develop and train machine learning and deep learning models. TensorFlow operations can leverage both CPUs and GPUs. If you’re operating from Google Cloud Platform (GCP), you can also use TensorFlow Processing Units (TPUs), specially designed for TensorFlow operations. 

In this article, you will learn:

Distributed Training Strategies with TensorFlow

The primary distributed training method in TensorFlow is tf.distribute.Strategy. This method enables you to distribute your model training across machines, GPUs or TPUs. It is designed to be easy to use, provide strong out-of-the-box performance and enable you to switch between strategies easily. 

The distribute method also forms the base of several additional methods, including some experimental methods. 

Mirrored Strategy

tf.distribute.MirroredStrategy is a method that you can use to perform synchronous distributed training across multiple GPUs. Using this method, you can create replicas of your model variables which are mirrored across your GPUs. 

During operation, these mirrored variables are grouped into a MirroredVariable and kept in sync with all-reduce algorithms. The default algorithm used is the one implemented by NVIDIA NCCL; you can also specify another pre-built option or create a custom algorithm.

TPU Strategy

tf.distribute.experimental.TPUStrategy is a method you can use to distribute training across TPUs. This method works the same as MirroredStrategy. The difference is that it includes a different implementation of all-reduce that is customized to TPUs.

Multi Worker Mirrored Strategy

tf.distribute.experimental.MultiWorkerMirroredStrategy is a method that is similar to MirroredStrategy but enables you to spread your training across machines. This method uses a set of collectiveOps methods to sync variables across your workers. This set reduces your operations to a single unit in your TensorFlow graph, which then selects the appropriate all-reduce algorithm.

Central Storage Strategy

tf.distribute.experimental.CentralStorageStrategy is a method you can use to perform synchronous training from a central CPU. With this method, your variables are maintained centrally and operations are mirrored across your GPUs. This enables you to perform the same operations with different subsets of data. 

Parameter Server Strategy

tf.distribute.experimental.ParameterServerStrategy is a method that you can use to train parameter servers on multiple machines. Using this method, you separate your machines into parameter servers and workers. Your variables are distributed to the different parameter servers and your computations are replicated in the worker GPUs. 

Quick Tutorial 1: Distribution Strategy API With TensorFlow Estimator

In the following tutorial, the Estimator class is combined with MirroredStrategy to enable you to distribute your operations across GPUs. This is adapted from a more detailed tutorial based on the TensorFlow documentation here

  1. Start by defining your model function. A definition like the following can get you started:
def model_fn(features, labels, mode):

  layer = tf.layers.Dense(1)

  logits = layer(features)

  if mode == tf.estimator.ModeKeys.PREDICT:

    predictions = {"logits": logits}

    return tf.estimator.EstimatorSpec(mode, predictions=predictions)

  loss = tf.losses.mean_squared_error(

      labels=labels, predictions=tf.reshape(logits, []))

  if mode == tf.estimator.ModeKeys.EVAL:

    return tf.estimator.EstimatorSpec(mode, loss=loss)

  if mode == tf.estimator.ModeKeys.TRAIN:

    train_op = tf.train.GradientDescentOptimizer(0.2).minimize(loss)

    return tf.estimator.EstimatorSpec(mode, loss=loss, train_op=train_op)

2. Next, define your input function. You use this function to supply the data you want to train your model on.

def input_fn():

  features = tf.data.Dataset.from_tensors([[1.]]).repeat(100)

  labels = tf.data.Dataset.from_tensors(1.).repeat(100)

  return tf.data.Dataset.zip((features, labels))

3. With your input and model defined, you can then define your estimator. This involves defining MirroredStrategy as your distribution type and Estimator as your classifier.

distribution = tf.contrib.distribute.MirroredStrategy()

config = tf.estimator.RunConfig(train_distribute=distribution)

classifier = tf.estimator.Estimator(model_fn=model_fn, config=config)



Quick Tutorial 2: Use Horovod in TensorFlow 

Horovod is an open source framework created to support distributed training of deep learning models through Keras and TensorFlow. It also supports Apache MXNet and PyTorch. Horovod was created to enable you to easily scale your GPU training scripts for use across many GPUs running in parallel. 

Here is a brief tutorial on how to enable the use of Horovod with TensorFlow. This requires modifying the standard Horovod program with the following additions. This tutorial is adapted from a more detailed tutorial found in the Horovod documentation here

  1. Start by importing Horovod and initializing it.
import horovod.tensorflow as hvd


2. Next, define the GPU order for your processes. You can set this as a custom order or using local ranks, which assigns processes sequentially. 

config = tf.ConfigProto()

config.gpu_options.visible_device_list = str(hvd.local_rank())

3. Update the learning rate of your model based on your workers. 

4. Then, you need to wrap your optimizer in hvd.DistributedOptimizer. This directs your gradient computations to the optimizer, averages your gradients and applies the averages to the model update. 

opt = hvd.DistributedOptimizer(opt)

Next, you need to ensure that your workers are consistently initialized. This is done by adding hvd.BroadcastGlobalVariablesHook(0), which broadcasts your initial variable states to all processes.

hooks = [hvd.BroadcastGlobalVariablesHook(0)]

Ensure that you are only saving your checkpoints on your 0 worker. This ensures that other workers do not corrupt checkpoints. 

checkpoint_dir = '/tmp/train_logs' if hvd.rank() == 0 else None

You can then run your training operation, which you have previously defined. This is done with the MonitoredTrainingSession method to automate session initialization, checkpoint operations, and error handling. 

with tf.train.MonitoredTrainingSession(



hooks=hooks) as mon_sess:

while not mon_sess.should_stop():


TensorFlow Multi GPU With Run:AI

Run:AI automates resource management and workload orchestration for machine learning infrastructure. With Run:AI, you can automatically run as many compute intensive experiments as needed in TensorFlow and other deep learning frameworks. 

Here are some of the capabilities you gain when using Run:AI:

  • Run:AI provides a simple way to launch training jobs on Kubernetes, removing all of the heavy lifting of setting up clusters and configuring networking etc., as well as gang scheduling which ensures all workers are launched and end together, recover from failure correctly, etc.
  • Run:AI’s also offers Python scripts – essentially wrappers that use APIs to provide advanced capabilities – like gradient accumulation and elastic training

Run:AI simplifies machine learning infrastructure pipelines, helping data scientists accelerate their productivity and the quality of their models. 

Learn more about the Run.ai GPU virtualization platform.