Question 1

How Can I Use TensorFlow with Multiple GPUs?

Accepted Answer

TensorFlow provides strong support for distributing deep learning across multiple GPUs. TensorFlow is an open source platform that you can use to develop and train machine learning and deep learning models. TensorFlow operations can leverage both CPUs and GPUs. If you’re operating from Google Cloud Platform (GCP), you can also use TensorFlow Processing Units (TPUs), specially designed for TensorFlow operations. Learn more about How to Build Your GPU Cluster.

Question 2

Distributed Training Strategies with TensorFlow

Accepted Answer

The primary distributed training method in TensorFlow is tf.distribute.Strategy. This method enables you to distribute your model training across machines, GPUs or TPUs. It is designed to be easy to use, provide strong out-of-the-box performance and enable you to switch between strategies easily. The distribute method also forms the base of several additional methods, including some experimental methods. Mirrored Strategy tf.distribute.MirroredStrategy is a method that you can use to perform synchronous distributed training across multiple GPUs. Using this method, you can create replicas of your model variables which are mirrored across your GPUs. During operation, these mirrored variables are grouped into a MirroredVariable and kept in sync with all-reduce algorithms. The default algorithm used is the one implemented by NVIDIA NCCL; you can also specify another pre-built option or create a custom algorithm. TPU Strategy tf.distribute.experimental.TPUStrategy is a method you can use to distribute training across TPUs. This method works the same as MirroredStrategy. The difference is that it includes a different implementation of all-reduce that is customized to TPUs. Multi Worker Mirrored Strategy tf.distribute.experimental.MultiWorkerMirroredStrategy is a method that is similar to MirroredStrategy but enables you to spread your training across machines. This method uses a set of collectiveOps methods to sync variables across your workers. This set reduces your operations to a single unit in your TensorFlow graph, which then selects the appropriate all-reduce algorithm. Central Storage Strategy tf.distribute.experimental.CentralStorageStrategy is a method you can use to perform synchronous training from a central CPU. With this method, your variables are maintained centrally and operations are mirrored across your GPUs. This enables you to perform the same operations with different subsets of data. Parameter Server Strategy tf.distribute.experimental.ParameterServerStrategy is a method that you can use to train parameter servers on multiple machines. Using this method, you separate your machines into parameter servers and workers. Your variables are distributed to the different parameter servers and your computations are replicated in the worker GPUs. Learn more about GPU Scheduling.

Tensorflow and Multiple GPU

Five Strategies and Two Tutorials

How Can I Use TensorFlow with Multiple GPUs?

Distributed Training Strategies with TensorFlow

Mirrored Strategy

TPU Strategy

Multi Worker Mirrored Strategy

Central Storage Strategy

Parameter Server Strategy

Quick Tutorial 1: Distribution Strategy API With TensorFlow Estimator

Quick Tutorial 2: Use Horovod in TensorFlow

TensorFlow Multi GPU With Run:AI