Run:AI Open Sourced a Tool for Gradient Accumulation

Overcoming the problem of batch size and available GPU memory in training neural networks

Run:AI has developed and open sourced a feature known as “gradient accumulation”, where users can run training jobs even when there are not enough available resources. The model processes data samples and gets some updates, then processes another set of samples and gets additional updates. The gradients, or updates to the model, can be accumulated. The average of the gradients is taken and a new model is created, even with limited resources for the job.

Consider the following example. There is a single machine with two GPUs, and one GPU is already in use by a job. When a new job comes in, a job that requires the full resource (two GPUs), it would typically fall into a pending state, which is inefficient since one GPU is idle and in theory, could be used. With Run:AI, no job would fall into a pending state. Instead, using Run:AI’s elasticity feature, we essentially shrink the workload and still run it on a single GPU. It will run more slowly, but the data scientist will be able to begin the job. When the second GPU becomes available the job would dynamically expand back to use two GPUs.