AdaBelief Optimizer: fast as Adam, generalizes as well as SGD

Believe in AdaBelief

Kaustubh Mhaisekar

Follow

Published in

Towards Data Science

8 min readDec 19, 2020

--

Introduction

All types of neural networks and many machine learning algorithms optimize their loss functions using gradient-based optimization algorithms. There are several such optimization algorithms, or optimizers, that exist and are used to train models - RMSprop, Stochastic Gradient Descent(SGD), Adaptive Moment Estimation(Adam) and so many more.

There are two primary metrics to look at while determining the efficacy of an optimizer:

The speed of convergence, that is, how quickly the minima of the loss function is achieved.
Generalization of the model, that is, how well the model performs on new unseen data.

Adaptive algorithms like Adam have a good convergence speed, while algorithms like SGD generalize better.

But recently researchers from Yale introduced a novel AdaBelief optimizer (AdaBelief Optimizer: Adapting Stepsizes by the Belief in Observed Gradients) that combines many benefits of existing optimization methods:

We propose AdaBelief to simultaneously achieve three goals: fast convergence as in adaptive methods, good generalization as in SGD, and training stability. The intuition for AdaBelief is to adapt the stepsize according to the “belief” in the current gradient direction. Viewing the exponential moving average (EMA) of the noisy gradient as the prediction of the gradient at the next time step, if the observed gradient greatly deviates from the prediction, we distrust the current observation and take a small step; if the observed gradient is close to the prediction, we trust it and take a large step.

To understand what this means and how AdaBelief works, we first need to take a look at Adam (Adam: A Method for Stochastic Optimization), the optimizer AdaBelief is derived from.

Adam

The Adam Optimizer is one of the most widely used optimizer to train all kinds of neural networks. It basically combines the optimization techniques of momentum and RMS prop. Let me explain how it works in short:

Notations used here:

f(θ): f is the loss function to be optimized given the parameter(weights) θ.
g-t: g is the gradient at step t.
m-t: m is the exponential moving average(EMA) of g-t.
v-t: v is the exponential moving average(EMA) of (g-t)².
β1, β2: These are the hyperparameters used in the moving averages of g-t and (g-t)², most commonly set to 0.9 and 0.999 respectively.
α: The learning rate.
ε: A very small number, used to avoid the denominator = 0 scenario.

So what’s happening here is we have a loss function f(θ) that is to be minimized by finding the optimal values of θ such that the loss function achieves its minima. In order to do that, we use gradient descent where we basically compute the gradients of the function and keep subtracting them from the weights until we reach the minimum.

To make this descent faster, we combine two optimization techniques:

We compute the EMA of the gradient m-t and use it in the numerator of the update direction. So if m-t has a high value, that means the descent is headed in the right direction so we take bigger steps. Similarly, if the the value of m-t is low, it means the descent is probably not heading towards the minimum and we take smaller steps. This is the momentum part of the optimizer.
We compute the EMA of the gradient squared v-t and use it in the denominator of the update direction. Since we are taking the square of the gradients here, suppose if same sized gradient updates are taking place in the opposite directions alternately, m-t would have a value close to 0 as the positive and negative values would cancel out when summed. But v-t will have a high value in this case. And since here we are not heading towards the minima, we don’t want to take steps in this direction. Hence, we keep v-t in the denominator of the update direction as on dividing by a high value the update steps will get smaller, and similarly when v-t has a low value the steps will get bigger. This is the RMSProp part of the optimizer.

In Adam, we simply combine these two to form an optimizer that can capture the characteristics of both these optimization algorithms to get this update direction:

Update direction in Adam

Note: m-t and v-t here are used after bias correction, to get better fitting early on in the training.

The addition of ε is to prevent the denominator from being equal to 0.

Also here m-t is known as the first moment, and v-t is known as the second moment, hence the name “Adaptive moment estimation”.

Now that you know how Adam works, let’s look at AdaBelief.

AdaBelief

As you can see, the AdaBelief optimizer is extremely similar to the Adam optimizer, with one slight difference. Here instead of using v-t, the EMA of gradient squared, we have this new parameter s-t:

s-t

And this s-t replaces v-t to form this update direction:

Update direction in AdaBelief

Let us now see what difference does this one parameter make and how does it affect the performance of the optimizer.

s-t is defined as the EMA of (g-t - m-t)², that is, the square of the difference between the gradient and the EMA of the gradient(m-t). This means that AdaBelief takes a large step when the value of the gradient is close to its EMA, and a small step when the two values are different.

Let’s look at this graph here to better understand AdaBelief’s advantage over Adam -

Understanding AdaBelief. Image Source: AdaBelief Optimizer: Adapting Stepsizes by the Belief in Observed Gradients

In the given graph, look at region 3:

In region 3, the value of g-t is going to be big as the curve is really steep in that area. The value of v-t is going to be big as well, and thus if we used Adam here, the step size in this region is going to be really small as v-t is in the denominator.

But, in AdaBelief, we calculate s-t as the moving average of the difference between the gradient and its moving average squared. And since both of these values are really close, the value of s-t is going to be really small this time, thus if we use AdaBelief, since s-t is really small and is in the denominator, we will end up taking big steps in this region, as an ideal optimizer should.

We see that AdaBelief can take care of regions with “Large gradient, small curvature” cases, while Adam can’t.

Also, note that in the graph regions 1 and 2 can be used to demonstrate the advantage of AdaBelief and Adam over optimizers such as momentum or SGD in the following way:

Region 1: The curve is very flat with the gradient almost equal to 0, we would ideally want to have large steps here. If we use momentum or SGD, where we multiply the step size with the moving average,it would result in small steps, while in AdaBelief and Adam large steps will be taken as we will be dividing by the moving averages.
Region 2: The curve here is very steep with a high gradient, we would ideally want small steps here. If we use momentum or SGD, on multiplying by the large moving averages we will get large update steps, while in Adam and AdaBelief we would divide by the moving averages thus resulting in smaller steps.

Let us gain some more intuition by looking at this 2D example, consider a loss function - f(x,y) = |x| + |y|:

f(x,y) = |x| + |y|. Image Source: AdaBelief Optimizer: Adapting Stepsizes by the Belief in Observed Gradients

Here the blue arrows represent the gradients and the x on the right side is the optimal point. As you can see, the gradient in the x-direction is always 1, while in the y-direction it keeps oscillating between 1 and -1.

So in Adam, v-t for x and y directions will always be equal to 1 as it considers only the amplitude of the gradient and not the sign. Hence Adam will take the same sized steps in both x and y directions.

But in AdaBelief, both the amplitude and sign of the gradient is considered. So in the y-direction s-t will be equal to 1, in the x direction it will become 0, thus taking much larger steps in the x-direction than the y-direction.

Here are some video examples created by the authors of the original paper to demonstrate the performance of AdaBelief - AdaBelief Optimizer, Toy examples

Video source: AdaBelief Optimizer, Toy examples

Here are some experimental results comparing the performance of AdaBelief with other optimizers on different neural networks like CNNs, LSTMs, and GANs presented by the authors of the original paper:

Image Classification:

Image Classification. Image Source: AdaBelief Optimizer: Adapting Stepsizes by the Belief in Observed Gradients

2. Time Series Modeling:

Time Series Modeling. Image Source: AdaBelief Optimizer: Adapting Stepsizes by the Belief in Observed Gradients

3. Generative Adversarial Network:

Generative Adversarial Network. Image Source: AdaBelief Optimizer: Adapting Stepsizes by the Belief in Observed Gradients

Conclusion

AdaBelief is an optimizer that is derived from Adam and has no extra parameters, just a change in one of the parameters.
It is an optimizer that gives both fast convergence speed as well as good generalization.
It adapts its step size according to its “belief” in the current gradient direction.
It performs well in the “Large gradient, small curvature” scenarios as it considers both the amplitude and sign of the gradients.

I hope you understood and enjoyed all of the concepts explained in this post. Please feel free to reach out for any kind of questions or doubts.

Thanks for reading!