Gradient Accumulation: Overcoming Memory Constraints in Deep Learning

Mayukh Bhattacharyya
Towards Data Science
4 min readMay 25, 2020

--

Photo by Daniel von Appen on Unsplash

Let’s be honest. Deep Learning without GPUs is a big headache! Yes, Google Colab and Kaggle are there but life and work aren’t always about training a neat and cool MNIST classifier.

Introduction

For training the state-of-the-art or SOTA models, GPU is a big necessity. And even if we are able to procure one, there comes the problem of memory constraints. We are more or less accustomed to seeing the OOM (Out of Memory) error whenever we throw a large batch to train. The problem is far more apparent when we talk about state-of-the-art computer vision algorithms. We have crossed much longer ground since the time of VGG or even ResNet18. Modern deeper architectures like UNet, ResNet-152, RCNN, Mask-RCNN are extremely memory intensive. Hence, there exists quite a high probability that we will run out of memory while training deeper models.

Here is an OOM error from while running the model in PyTorch.

RuntimeError: CUDA out of memory. Tried to allocate 44.00 MiB (GPU 0; 10.76 GiB total capacity; 9.46 GiB already allocated; 30.94 MiB free; 9.87 GiB reserved in total by PyTorch)

There are usually 2 solutions that practitioners do instantly whenever encountering the OOM error.

  1. Reduce batch size
  2. Reduce image dimensions

In over 90% of cases, these two solutions are more than enough. So the question you want to ask is: why does the remaining 5% need something else. In order to answer, let’s check out the below images.

From Kaggle notebook of Dimitre Oliveira

It’s from the Kaggle competition, Understanding Clouds from Satellite Images. The task was to correctly segment the different types of clouds. Now, these images were of very high resolution 1400 x 2100. As well you can understand, reducing image dimensions too much will have a very negative impact in this scenario, since the minute patterns and textures are important features to learn here. Hence the only other option is to reduce the batch size.

Gradient Descent

As a refresher, if you happen to remember gradient descent or specifically mini-batch gradient descent in our case, you’ll remember that instead of calculating the loss and the eventual gradients on the whole dataset, we do the operation on the smaller batches. Other than helping us to fit the data into memory, it also helps us to converge faster, since the parameters are updated after each mini-batch. But what happens when the batch size becomes too small as in the above case. Taking a rough estimate that maybe 4 such images can be fit into a single batch in an 11GB GPU, the loss and the gradients calculated will not accurately represent the whole dataset. As a result, the model will converge a lot slower, or worse, not converge at all.

Gradient Accumulation

The idea behind gradient accumulation is stupidly simple. It calculates the loss and gradients after each mini-batch, but instead of updating the model parameters, it waits and accumulates the gradients over consecutive batches. And then ultimately updates the parameters based on the cumulative gradient after a specified number of batches. It serves the same purpose as having a mini-batch with higher number of images.

Example: If you run a gradient accumulation with steps of 5 and batch size of 4 images, it serves almost the same purpose of running with a batch size of 20 images.

Implementation

Coding the gradient accumulation part is also ridiculously easy on PyTorch. All you need to do is to store the loss at each batch and then update the model parameters only after a set number of batches that you choose.

We hold onto optimizer.step() which updates the parameters for accumulation_steps number of batches. Also, model.zero_grad() is called at the same time to reset the accumulated gradients.

Doing the same thing is a little more tricky for keras/tensorflow. There are different versions written by people that you’ll find on the internet. Here’s one of those written by @alexeydevederkin.

Caution: It is much longer and complex than the pytorch code due to the lack of modularity in the training process in keras.

There are pure tensorflow codes also available which are smaller in size. You’ll find those easily.

Gradient Accumulation is a great tool for hobbyists with less computing or even for practitioners intending to use images without scaling them down. Whichever one you are, it is always a handy trick in your armory.

--

--

Data Scientist, Kaggle Expert. Takes a keen interest in Football and Movies when not training a model.