Today, Run:AI published our own gradient accumulation mechanism for Keras – it’s a generic implementation, that can wrap any Keras optimizer (both a built-in one or a custom one) – automatically enabling gradient accumulation by adding a single line to your code (no code modifications required).
- To get started, follow the docs in the github repo.
We published three blog posts that help explain the concept and how to use the code:
- The problem of deep learning batch size and limited GPU memory
- What is Gradient Accumulation and how does it help?
- How-to guide to using the gradient accumulation mechanism – with just a single line of code
We hope the tool will help both veteran data science teams and beginners train on large batch sizes even when GPU memory is limited, improving both performance and accuracy of models.
When building a deep learning model, one of the critical hyperparameters that data scientists consider is how many training examples (e.g. images) the neural network model should process in each training iteration — the deep learning batch size. However, sometimes the batch size is limited by the available memory of the GPUs which are running the model. Deep learning models themselves are becoming bigger and more complex, taking up more GPU memory and further reducing the maximum possible batch size and the achievable accuracy.
One solution to this problem is gradient accumulation. The idea is to split up the batch into smaller mini-batches which are run sequentially, while accumulating their results. The accumulated results are used to update the model parameters only at the end of the last mini-batch. Gradient accumulation is a particularly good option where there’s only access to a single GPU, because it can be run sequentially on the single resource.
Although the concept is simple, the mathematics and code required to implement gradient accumulation can be complicated.
We hope you’ll try it out and let us know what you think!