Keras Multi GPU: A Practical Guide

Keras is a deep learning API you can use to perform fast distributed training with multi GPU. Distributed training with GPUs enable you to perform training tasks in parallel, thus distributing your model training tasks over multiple resources. You can do that via model parallelism or via data parallelism. This article explains how Keras multi GPU works and examines tips for managing the limitations of multi GPU training with Keras.

If you are working with other deep learning frameworks, check out our articles about PyTorch multi GPU and TensorFlow multiple GPU.

In this article, you will learn:

What Is Keras?

Keras is a deep learning API that is based on the TensorFlow platform. It was designed to allow fast experimentation and easy model building with multiple graphical processing units (GPUs). Keras is broadly supported and can be used with TensorFlow, CNTK, Theano, MXNet and PlaidML. 

What is Distributed Training with GPUs?

Keras enables you to distribute your model training tasks over multiple resources, performing training tasks in parallel. Distributed training is an essential part of deep learning. It enables you to leverage multiple CPUs or GPUs and drastically reduces the amount of time needed to train models.

When using distributed training, there are two implementation methods you can choose from—model parallelism and data parallelism. These implementations can be used individually or in combination, depending on your model requirements.

Model parallelism

Model parallelism segments your model into parts that can then be run in parallel. Parts are trained individually and the results of each part are rejoined with the whole. 

This method enables you to run each segment on a different resource using the same data. This limits the amount of communication that is needed between workers to only that required for synchronization of shared parameters. You can also use this method with multiple GPUs in a single server.

Model Parallelism Run:AI

Data parallelism

Data parallelism segments your training data into parts that can be run in parallel. Using copies of your model, you run each subset on a different resource. This is the most commonly used type of distributed training.

This method requires that you synchronize model parameters during subset training. If you do not, your prediction errors will not align between subsets. Because of this, data parallelism implementations require communications between workers so changes can be synced.

Keras Multi GPU and Distributed Training

First we should note that distributed training, as it is called in the Keras framework, may refer to two types of scalability:

  • Single worker—distributed training across multiple GPUs in the same physical server
  • Multi worker—distributed training across multiple GPUs on multiple physical servers

The discussion below refers to distributed training with single worker scalability – distributing workloads across multiple servers is more complex and is beyond the scope of this article. Below you can learn about Run:AI, which can help you automatically distribute workloads on any number of physical machines.

When using multiple GPUs in Keras, there are a few aspects that are helpful to know to get you started. The following section covers the basics of Keras multi-GPU training and provides some tips you can apply to improve your performance.

How it works

Keras offers several workload distribution strategies, including tf.distribute.Strategy, tf.distribute.MirroredStrategy, and tf.distribute.experimental.TPUStrategy

Below we describe how to work with MirroredStrategy, which lets you perform synchronous distributed training on multiple GPUs on a single machine.

When using multi-GPU training, you run your model through the same series of steps for each segment. Below is an overview of the steps that are performed when using data parallelism with MirroredStrategy:

  • A global batch is segmented into local batches with data being evenly distributed.
  • A number of model replicas, up to the number of batches, is created and used to process an assigned batch. This involves a forward pass and a backward pass followed by an output of the gradient of the weights associated with the loss of the model on the batch.
  • The weights of each batch are then merged across the replicas to ensure that replicas stay in sync. 

How to use it

Performing this sort of multi-GPU training with Keras requires the tf.distribute.MirroredStrategy API. Using this API, you must:

  • Instantiate a MirroredStrategy. During this process you have the option of configuring specific devices or using the default, which uses all available GPUs.
  • With your strategy object, open a scope and create any Keras objects and variables needed. Generally, this requires you to create and compile your model within the distribution scope.
  • Train your model using fit()(per normal). It is recommended to use tf.data.Dataset objects to load your data.

Using Keras callbacks to ensure fault tolerance

Fault tolerance is very important in distributed training since there are more operations that can experience errors. Having a strategy to recover in the event of failures can help ensure the accuracy of your model and prevent time spent redoing computations. 

With Keras, the easiest way to build in fault tolerance is with a ModelCheckpoint callback to fit(). This method allows you to save your model at regular intervals; you can then use these savepoints to restore your model if something goes wrong. 

tf.data Performance Tips

As previously mentioned, loading your data with a tf.data pipeline is the recommended method. When using this method, there are also a few tips that can help you increase your efficiency. 

Calling dataset.cache()

Calling .cache() on a dataset enables you to cache data after your first iteration. Each subsequent iteration can then use this cache, eliminating loading time. Caching can be a valuable time saver when your data remains the same between iterations. It is also useful if you are reading data from a remote filesystem or your workflow is IO-bound.

Calling dataset.prefetch(buffer_size)

Calling .prefetch(buffer_size)enables you to prefetch your GPU memory in preparation for your next iteration. This allows you to use your pipeline asynchronously, processing new samples while your model is trained on the current set. This prefetching enables you to reduce the amount of time your resources are unused and to immediately move to the next iteration as soon as one finishes. 

Tips for Managing the Limitations of Multi GPU Training with Keras 

When using Keras, there are advantages and limitations to your ability to perform multi-GPU training. Below are a few limitations to be aware of and how to handle these limitations.

Keras Multi GPU training is not automatic

Using single GPU configurations with Keras and Tensorflow is straightforward. Provided you are using NVIDIA and you have CUDA libraries installed, use of GPUs is automatic. However, this isn’t the case for scenarios with multiple GPUs. 

To use multiple GPUs with Keras, you can use the multi_gpu_model method. This method enables you to copy your model across GPUs. When used, it can automatically split your input across GPUs for aggregation later. However, keep in mind that this method does not scale linearly with the number of GPUs due to the synchronization required. 

Saving your parallel models

Once your training is finished, you may want to persist your training weights. Unfortunately, you can’t just use the save()method because Keras does not support saving parallel models. 

To get around this, you can either call save()on the original model reference or you can serialize your model. The former automatically updates your weights, while the latter requires some manual clean up of synchronization connections. 

GPU data bottlenecks

Often, preprocessing calculations are the most expensive aspect of training deep learning models. These calculations require data to be preprocessed in your CPUs and then fed to the GPUs. This goes smoothly as long as preprocessing is relatively simple and data isn’t bottlenecked in the CPU. If it is, your GPUs are left sitting idle while waiting for data to process. 

While Keras can perform your preprocessing calculations in parallel, this is bottlenecked by Python’s Global Interpreter Lock (GIL), which prevents true multithreading. The easiest way to manage this is to simplify your preprocessing as much as possible. 

You can typically do this using standard generators. However, if you need to use custom generators, try to offset some of the work with other libraries, like Numpy. These libraries can release the GIL and enable you to access a greater degree of parallelism. 

Keras Multi GPU With Run:AI

Run:AI automates resource management and workload orchestration for machine learning infrastructure. With Run:AI, you can automatically run as many compute intensive experiments as needed in Keras and other deep learning frameworks. 

Here are some of the capabilities you gain when using Run:AI: 

  • Advanced visibility—create an efficient pipeline of resource sharing by pooling GPU compute resources.
  • No more bottlenecks—you can set up guaranteed quotas of GPU resources, to avoid bottlenecks and optimize billing.
  • A higher level of control—Run:AI enables you to dynamically change resource allocation, ensuring each job gets the resources it needs at any given time.

Run:AI simplifies machine learning infrastructure pipelines, helping data scientists accelerate their productivity and the quality of their models. 

Learn more about the Run:AI GPU virtualization platform.