Complete Guide to Adam Optimization
The Adam optimization algorithm from definition to implementation
In the 1940s, mathematical programming was synonymous with optimization. An optimization problem included an objective function that is to be maximized or minimized by choosing input values from an allowed set of values [1].
Nowadays, optimization is a very familiar term in AI. Specifically, in Deep Learning problems. And one of the most recommended optimization algorithms for Deep Learning problems is Adam.
Disclaimer: basic understanding of neural network optimization. Such as Gradient Descent and Stochastic Gradient Descent is preferred before reading.
In this post, I will highlight the following points:
- Definition of Adam Optimization
- The Road to Adam
- The Adam Algorithm for Stochastic Optimization
- Visual Comparison Between Adam and Other Optimizers
- Implementation
- Advantages and Disadvantages of Adam
- Conclusion and Further Reading
- References
1. Definition of Adam Optimization
The Adam algorithm was first introduced in the paper Adam: A Method for Stochastic Optimization [2] by Diederik P. Kingma and Jimmy Ba. Adam is defined as “a method for efficient stochastic optimization that only requires first-order gradients with little memory requirement” [2]. Okay, let’s breakdown this definition into two parts.
First, stochastic optimization is the process of optimizing an objective function in the presence of randomness. To understand this better let’s think of Stochastic Gradient Descent (SGD). SGD is a great optimizer when we have a lot of data and parameters. Because at each step SGD calculates an estimate of the gradient from a random subset of that data (mini-batch). Unlike Gradient Descent which considers the entire dataset at each step.
Second, Adam only requires first-order gradients. Meaning, Adam only requires the first derivative of the parameters.
Now, the name of algorithm Adam is derived from adaptive moment estimation. This will become apparent as we go through the algorithm.
2. The Road to Adam
Adam builds upon and combines the advantages of previous algorithms. To understand the Adam algorithm we need to have a quick background on those previous algorithms.
I. SGD with Momentum
Momentum in physics is an object in motion, such as a ball accelerating down a slope. So, SGD with Momentum [3] incorporates the gradients from the previous update steps to speed up the gradient descent. This is done by taking small but straightforward steps in the relevant direction.
SGD with momentum is achieved by computing a moving average of the gradient (also known as exponentially weighted averages), then use it to update your parameters “θ” (weights, biases).
- The term Beta (𝛽) controls the moving average. The value of Beta is [0,1), a common value is 𝛽 = 0.9, meaning we are averaging over the last 10 iterations’ gradients and the older gradients are discarded or forgotten. So, a large value of beta (say 𝛽 = 0.98) means that we are averaging over more gradients.
- Alpha (α) is the learning rate which determines the step size at each iteration.
II. Related Work (AdaGrad and RMSProp)
Alright, there are two algorithms to know about before we get to Adam. AdaGrad (adaptive gradient algorithm)[4] and RMSProp (root mean square propagation)[5] are both extensions of SGD. The two algorithms share some similarities with Adam. In fact, Adam combines the advantages of the two algorithms.
III. Adaptive Learning Rate
Both AdaGrad and RMSProp are also adaptive gradient descent algorithms. Meaning, for each one of the parameters (w, b), the learning rate (α) is adapted. In short, the learning rate is maintained per-parameter.
To illustrate this better, here is an explanation of AdaGrad and RMSProp:
- AdaGrad
AdaGrad’s per-parameter learning rate helps increases the learning rate for sparser parameters. Thus, AdaGrad works well for sparse gradients, such as in natural language processing and image recognition applications [4].
- RMSProp
RMSProp was introduced by Tielemen and Hinton to speed up mini-batch learning. In RMSProp, the learning rate adapts based on the moving average of the magnitudes of the recent gradients.
Meaning, RMSProp maintains a moving average of the squares of the recent gradients, denoted by (v). Thus, giving more weight to recent gradients.
Here, the term Beta (𝛽) is introduced as the forgetting factor (just like in SGD with Momentum).
In short, when updating θ (say w or b), divide the gradient of the previous value of θ by a moving average of the squares of recent gradients for that parameter θ, then multiply it by α and of course subtract the previous value of θ.
Also, RMSProp works well on big and redundant datasets (e.g. noisy data)[5].
* The term (𝜖) is used for numerical stability (avoid division by zero).
Here’s a visual comparison of what we learned so far:
In the gif above, you can see Momentum exploring around before finding the corrected path. As for SGD, AdaGrad, and RMSProp, they are all taking a similar path, but AdaGrad and RMSProp are clearly faster.
3. The Adam Algorithm for Stochastic Optimization
Okay, now we’ve got all the pieces we need to get to the algorithm.
As explained by Andrew Ng, Adam: adaptive moment estimation is simply a combination of Momentum and RMSProp.
Here’s the algorithm to optimize an objective function f(θ), with parameters θ (weights and biases).
Adam includes the hyperparameters: α, 𝛽1 (from Momentum), 𝛽2 (from RMSProp).
Initialize:
- m = 0, this is the first moment vector, treated as in Momentum
- v = 0, this is the second moment vector, treated as in RMSProp
- t = 0
On iteration t:
- Update t, t := t + 1
- Get the gradients / derivatives (g) with respect to t, here g is equivalent to (dw and db respectively)
- Update the first moment mt
- Update the second moment vt
- Compute the bias-corrected mt (bias-correction gives better estimate for the moving averages)
- Compute the bias-corrected vt
- Update the parameters θ
And that’s it! The loop will continue until Adam converges to a solution.
4. Visual Comparison Between Optimizers
A better way to recognize the differences between the previously mentioned optimization algorithms is to see a visual comparison of their performance.
The figure above is from the Adam paper. It showcases the training cost over 45 epochs, and you can see Adam converging faster than AdaGrad for CNNs. Perhaps it’s good to mention that AdaGrad corresponds to a version of Adam with the hyperparameters (α, 𝛽1, 𝛽2) are at specific values [2]. I decided to remove the AdaGrad’s math explanation from this post to avoid confusion, but here is a simple explanation by mxnet if you want to learn more on that.
In the gif above, you can see Adam and RMSProp converging at a similar speed, while AdaGrad seems to be struggling to converge.
Meanwhile, in this gif, you can Adam and SGD with Momentum converging to a solution. While SGD, AdaGrad, and RMSProp seem to be stuck in a local minimum.
5. Implementation
Here I’ll show three different ways to incorporate Adam into your model, with TensorFlow, PyTorch, and NumPy implementations.
- Implementation with just NumPy:
This implementation may not be as practical, but it will give you a much better understanding of the Adam algorithm.
But as you can guess, the code is quite long, so for better viewing, here’s the gist.
6. Advantages and Disadvantages of Adam
Adam is one of the best optimizers compared to other algorithms, but it is not perfect either. So, here are some advantages and disadvantages of Adam.
Advantages:
- Can handle sparse gradients on noisy datasets.
- Default hyperparameter values do well on most problems.
- Computationally efficient.
- Requires little memory, thus memory efficient.
- Works well on large datasets.
Disadvantages:
- Adam does not converge to an optimal solution in some areas (this is the motivation for AMSGrad).
- Adam can suffer a weight decay problem (which is addressed in AdamW).
- Recent optimization algorithms have been proven faster and better [6].
7. Conclusion and Further Reading
That is all for Adam: adaptive moment estimation!
Adam is an extension of SGD, and it combines the advantages of AdaGrad and RMSProp. Adam is also an adaptive gradient descent algorithm, such that it maintains a learning rate per-parameter. And it keeps track of the moving average of the first and second moment of the gradient. Thus, using the first and second moment, Adam can give an unscaled direct estimation of the parameter’s updates. Finally, although newer optimization algorithms have emerged, Adam (and SGD) is still a stable optimizer to use.
Great resources for further reading (and watching):
- Definition of Gradients by Prof. Denis Auroux (MIT)
- Stochastic Gradient Descent with momentum by Vitaly Bushaev
- Gentle Introduction to the Adam Optimization Algorithm for Deep Learning by Jason Brownlee
- Deep Learning Specialization by Andrew Ng (deeplearning.ai), also available on YouTube
- Adam — latest trends in deep learning optimization by Vitaly Bushaev
8. References
- Stephen J. Wright, Optimization (2016), Encyclopædia Britannica
- Diederik P. Kingma, Jimmy Ba, Adam: A Method for Stochastic Optimization (2015), arxiv
- Learning internal representations by error propagation (1986), Rumelhart, Hinton and Williams, ACM
- Duchi et al., Adaptive Subgradient Methods for Online Learning and Stochastic Optimization (2011), Stanford
- Geoffrey Hinton with Nitish Srivastava Kevin Swersky, Neural Networks for Machine Learning (Lecture 6) (2012), UToronto and Coursera
- John Chen, An updated overview of recent gradient descent algorithms (2020), GitHub
- kuroitu S, Comparison of optimization methods (2020), Qiita