How to implement an Adam Optimizer from Scratch

Enoch Kan
The ML Practitioner
3 min readNov 6, 2020

--

It’s not as hard as you think!

Tl;dr if you want to skip the tutorial. Here is the notebook I created.

Adam is algorithm the optimizes stochastic objective functions based on adaptive estimates of moments. The update rule of Adam is a combination of momentum and the RMSProp optimizer.

The rules are simple. Code Adam from scratch without the help of any external ML libraries such as PyTorch, Keras, Chainer or Tensorflow. Only libraries we are allowed to use arenumpy and math .

(っ^‿^)っ(っ^‿^)っ(っ^‿^)っ(っ^‿^)っ(っ^‿^)っ(っ^‿^)っ

Step 1: Understand how Adam works

The easiest way to learn how Adam’s works is to watch Andrew Ng’s video. Alternatively, you can read Adam’s original paper to get a better understanding of the motivation and intuition behind it.

Two values that Adam depend on are β₁ and β₂. β₁ is the exponential decay of the rate for the first moment estimates, and its literature value is 0.9. β₂ is the exponential decay rate for the second-moment estimates, and its literature value is 0.999. Both literature values work well with most datasets.

Calculation of moving averages using parameters β1 and β2

On a given iteration t, we can calculate the moving averages based on parameters β₁, β₂, and gradient gt. Since most algorithms that depend on moving averages such as SGD and RMSProp are biased, we need an extra step to correct the bias. This is known as the bias correction step:

Bias correction of the moving averages

Finally, we can update the parameters (weights and biases) based on the calculated moving averages with a step size η:

Step 2: Implement Adam in Python

To summarize, we need to define several variables: 1st-order exponential decay β₁, 2nd-order exponential decay β₂, step size η and a small value ε to prevent zero-division. Additionally, we define m_dw , v_dw , m_db and v_db as the mean and uncentered variance from the previous time step of the gradients of weights and biases dw and db.

Recall that Adam relies on two important moments: a 1st-order moment estimate of the mean and the 2nd-order moment estimate of the variance. Using these moment estimates we can update the weights and biases given an appropriate step size.

Step 3: Testing the implementation

To test our implementation, we will first need to define a loss function and its respective gradient function. A gradient function can be obtained by simply taking the derivative of the loss function. For example:

Note that we also define an additional function to check the convergence based on the fact that weights will not change when convergence is reached. Finally, we can iteratively update the weights and biases using our constructed Adam optimizer and see if they converge:

Looking at the results, convergence is reached under 750 iterations. Great success!

Feel free to check out my other stories and github projects. Have a good day!

Photo by Uriel SC on Unsplash

Don’t miss out on cutting-edge ML trends, tips, and discussions. Become part of an exclusive community of forward-thinkers. Stay connected and advance your knowledge with our weekly insights.

🚀 [Subscribe to Our Newsletter] 🚀

Connect and contribute! Find us on LinkedIn to submit an article or share your thoughts.

--

--

Enoch Kan
The ML Practitioner

ML Lead @ Kognitiv, Founder @ Kortex Labs, The ML Practitioner 🇬🇧 🇺🇸 🇭🇰