What is Gradient Accumulation in Deep Learning?

Backpropagation process of neural networks explained

Raz Rotenberg
Towards Data Science

--

Photo by Austris Augusts on Unsplash

In another article, we addressed the problem of batch size being limited by GPU memory, and how gradient accumulation helps in overcoming this.

In this post, we will first examine the backpropagation process of a neural network and then go through the technical and algorithmic details of gradient accumulation. We will discuss how it works, and iterate through an example.

What is Gradient Accumulation?

Gradient accumulation is a mechanism to split the batch of samples — used for training a neural network — into several mini-batches of samples that will be run sequentially.

Gradient accumulation

Before further going into gradient accumulation, it will be good to examine the backpropagation process of a neural network.

Backpropagation of a neural network

A deep-learning model consists of many layers, connected to each other, in all of which the samples are propagating through the forward pass in every step. After propagating through all the layers, the network generates predictions for the samples and then calculates the loss value for every sample, which specifies “how wrong was the network for this sample?”. The neural network then computes the gradients of those loss values with respect to the model parameters. Then, these gradients are used for calculating the updates for the respective variables.

When building the model, we choose an optimizer, which is responsible for the algorithm used for minimizing the loss. The optimizer can be one of the common optimizers that are already implemented in the framework (SGD, Adam, etc…), or a custom optimizer, implementing the desired algorithm. Along with the gradients, there may be more parameters that the optimizer would manage and use for calculating the updates, such as the learning rate, the current step index (for adaptive learning rate), momentums, etc…

The optimizer represents a mathematical formula that computes the parameter updates. A simple example would be the stochastic gradient descent (SGD) algorithm: V = V — (lr * grad), where V is any trainable model parameter (weight or bias), lr is the learning rate, and grad is the gradients of the loss with respect to the model parameter:

The algorithm of SGD optimizer

So what is gradient accumulation, technically?

Gradient accumulation means running a configured number of steps without updating the model variables while accumulating the gradients of those steps and then using the accumulated gradients to compute the variable updates.

Yes, it’s really that simple.

Running some steps without updating any of the model variables is the way we — logically — split the batch of samples into a few mini-batches. The batch of samples that is used in every step is effectively a mini-batch, and all the samples of those steps combined are effectively the global batch.

By not updating the variables at all those steps, we cause all the mini-batches to use the same model variables for calculating the gradients. This is mandatory to ensure the same gradients and updates are calculated as if we were using the global batch size.

Accumulating the gradients in all of these steps results in the same sum of gradients as if we were using the global batch size.

Iterating through an example

So, let’s say we are accumulating gradients over 5 steps. We want to accumulate the gradients of the first 4 steps, without updating any variable. At the fifth step, we want to use the accumulated gradients of the previous 4 steps combined with the gradients of the fifth step to compute and assign the variable updates. Let’s see it in action:

Starting at the first step, all the samples of the first mini-batch propagate through the forward and backward passes, resulting in computed gradients for each trainable model variable. We don’t want to actually update the variables, so there is no need in computing the updates at this point. What we need, though, is a place to store the gradients of the first step, in order for them to be accessible in the following steps, and we will use another variable for each trainable model variable, to hold the accumulated gradients. So, after computing the gradients of the first step, we will store them in the variables we created for the accumulated gradients.

The value of the accumulated gradients at the end of N steps

Now the second step starts, and again, all the samples of the second mini-batch propagate through all the layers of the model, computing the gradients of the second step. Just like the step before, we don’t want to update the variables yet, so there is no need in computing the variable updates. What’s different than the first step though, is that instead of just storing the gradients of the second step in our variables, we are going to add them to the values stored in the variables, which currently hold the gradients of the first step.

Steps 3 and 4 are pretty much the same as the second step, as we are not yet updating the variables, and we are accumulating the gradients by adding them to our variables.

Then, in step 5, we do want to update the variables, as we intended to accumulate the gradients over 5 steps. After computing the gradients of the fifth step, we will add them to the accumulated gradients, resulting in the sum of all the gradients of those 5 steps.

We’ll then take this sum and insert it as a parameter to the optimizer, resulting in the updates computed using all the gradients of those 5 steps, computed over all the samples in the global batch.

If we will take the SGD optimizer as an example, let’s see the variables after the updates at the end of the fifth steps, computed using the gradients of those 5 steps (N=5 in the following example):

The value of a trainable variable after N steps (using SGD)

Great! So let’s implement it!

It is possible to implement a gradient-accumulated version of any optimizer. Each optimizer has a different formula and therefore will require a different implementation. This is not optimal, as gradient accumulation is a general approach and should be optimizer-independent.

In another article, we cover the way in which we implemented a generic gradient accumulation mechanism and show you how you could use it in your own models using any optimizer of your choice.

--

--