Keras is a deep learning API you can use to perform fast distributed training with multi GPU. Distributed training with GPUs enable you to perform training tasks in parallel, thus distributing your model training tasks over multiple resources. You can do that via model parallelism or via data parallelism. This article explains how Keras multi GPU works and examines tips for managing the limitations of multi GPU training with Keras.
In this article, you will learn:
Keras is a deep learning API that is based on the TensorFlow platform. It was designed to allow fast experimentation and easy model building with multiple graphical processing units (GPUs). Keras is broadly supported and can be used with TensorFlow, CNTK, Theano, MXNet and PlaidML.
Keras enables you to distribute your model training tasks over multiple resources, performing training tasks in parallel. Distributed training is an essential part of deep learning. It enables you to leverage multiple CPUs or GPUs and drastically reduces the amount of time needed to train models.
When using distributed training, there are two implementation methods you can choose from—model parallelism and data parallelism. These implementations can be used individually or in combination, depending on your model requirements.
Model parallelism segments your model into parts that can then be run in parallel. Parts are trained individually and the results of each part are rejoined with the whole.
This method enables you to run each segment on a different resource using the same data. This limits the amount of communication that is needed between workers to only that required for synchronization of shared parameters. You can also use this method with multiple GPUs in a single server.
Data parallelism segments your training data into parts that can be run in parallel. Using copies of your model, you run each subset on a different resource. This is the most commonly used type of distributed training.
This method requires that you synchronize model parameters during subset training. If you do not, your prediction errors will not align between subsets. Because of this, data parallelism implementations require communications between workers so changes can be synced.
First we should note that distributed training, as it is called in the Keras framework, may refer to two types of scalability:
The discussion below refers to distributed training with single worker scalability – distributing workloads across multiple servers is more complex and is beyond the scope of this article. Below you can learn about Run:AI, which can help you automatically distribute workloads on any number of physical machines.
When using multiple GPUs in Keras, there are a few aspects that are helpful to know to get you started. The following section covers the basics of Keras multi-GPU training and provides some tips you can apply to improve your performance.
How it works
Keras offers several workload distribution strategies, including tf.distribute.Strategy, tf.distribute.MirroredStrategy, and tf.distribute.experimental.TPUStrategy.
Below we describe how to work with MirroredStrategy, which lets you perform synchronous distributed training on multiple GPUs on a single machine.
When using multi-GPU training, you run your model through the same series of steps for each segment. Below is an overview of the steps that are performed when using data parallelism with MirroredStrategy:
How to use it
Performing this sort of multi-GPU training with Keras requires the tf.distribute.MirroredStrategy API. Using this API, you must:
Using Keras callbacks to ensure fault tolerance
Fault tolerance is very important in distributed training since there are more operations that can experience errors. Having a strategy to recover in the event of failures can help ensure the accuracy of your model and prevent time spent redoing computations.
With Keras, the easiest way to build in fault tolerance is with a ModelCheckpoint callback to fit(). This method allows you to save your model at regular intervals; you can then use these savepoints to restore your model if something goes wrong.
As previously mentioned, loading your data with a tf.data pipeline is the recommended method. When using this method, there are also a few tips that can help you increase your efficiency.
Calling .cache() on a dataset enables you to cache data after your first iteration. Each subsequent iteration can then use this cache, eliminating loading time. Caching can be a valuable time saver when your data remains the same between iterations. It is also useful if you are reading data from a remote filesystem or your workflow is IO-bound.
Calling .prefetch(buffer_size)enables you to prefetch your GPU memory in preparation for your next iteration. This allows you to use your pipeline asynchronously, processing new samples while your model is trained on the current set. This prefetching enables you to reduce the amount of time your resources are unused and to immediately move to the next iteration as soon as one finishes.
When using Keras, there are advantages and limitations to your ability to perform multi-GPU training. Below are a few limitations to be aware of and how to handle these limitations.
Keras Multi GPU training is not automatic
Using single GPU configurations with Keras and Tensorflow is straightforward. Provided you are using NVIDIA and you have CUDA libraries installed, use of GPUs is automatic. However, this isn’t the case for scenarios with multiple GPUs.
To use multiple GPUs with Keras, you can use the multi_gpu_model method. This method enables you to copy your model across GPUs. When used, it can automatically split your input across GPUs for aggregation later. However, keep in mind that this method does not scale linearly with the number of GPUs due to the synchronization required.
Saving your parallel models
Once your training is finished, you may want to persist your training weights. Unfortunately, you can’t just use the save()method because Keras does not support saving parallel models.
To get around this, you can either call save()on the original model reference or you can serialize your model. The former automatically updates your weights, while the latter requires some manual clean up of synchronization connections.
GPU data bottlenecks
Often, preprocessing calculations are the most expensive aspect of training deep learning models. These calculations require data to be preprocessed in your CPUs and then fed to the GPUs. This goes smoothly as long as preprocessing is relatively simple and data isn’t bottlenecked in the CPU. If it is, your GPUs are left sitting idle while waiting for data to process.
While Keras can perform your preprocessing calculations in parallel, this is bottlenecked by Python’s Global Interpreter Lock (GIL), which prevents true multithreading. The easiest way to manage this is to simplify your preprocessing as much as possible.
You can typically do this using standard generators. However, if you need to use custom generators, try to offset some of the work with other libraries, like Numpy. These libraries can release the GIL and enable you to access a greater degree of parallelism.
Run:AI automates resource management and workload orchestration for machine learning infrastructure. With Run:AI, you can automatically run as many compute intensive experiments as needed in Keras and other deep learning frameworks.
Here are some of the capabilities you gain when using Run:AI:
Run:AI simplifies machine learning infrastructure pipelines, helping data scientists accelerate their productivity and the quality of their models.
Learn more about the Run:AI GPU virtualization platform.