Photo by SIMON LEE on Unsplash

Flux.jl on MNIST — What about ADAM?

So far we’ve seen a performance analysis using the standard gradient descent optimizer. But which results do we get, if we use a more sophisticated one like ADAM?

Roland Schätzle
Towards Data Science
7 min readJul 11, 2022

--

In Flux.jl on MNIST — Variations of a theme, I presented three neural networks for recognizing handwritten digits as well as three variations of the gradient descent algorithm (GD) for training these networks.

The follow-up article Flux.jl on MNIST — A performance analysis showed how well each of these models performs using different parametrizations und how much effort it takes to train them, depending on the GD variant applied. The GD algorithm used the standard GD learning formula utilizing a constant learning rate in that analysis.

Now we will perform this analysis using an advanced learning formula (in Flux called an “optimizer”) and compare the results to the performance achieved with the standard formula. The optimizer we will use is called ADAM (= ADAptive Moment estimation) published in 2015 by Diederik P. Kingma and Jimmy Ba. It adapts the learning rate on each iteration and is currently one of the most sophisticated approaches in this field.

In analogy to the last analyses we will apply all three GD variants for training: The batch GD, the stochastic GD and the mini-batch GD.

Batch Gradient Descent

4LS-Model

Applied to the 4LS-model using batch GD, ADAM leads to a dramatic improvement: With only 500 iterations we get an accuracy of 74.16%, increasing up to over 91%. In contrast the standard Descent optimizer got stuck at 11.35%.

Batch GD — 4LS [image by author]

The learning curves illustrate this behaviour: On the left we have the standard optimizier up to iteration 500 and on the right ADAM with 1,000 iterations. ADAM reduces the loss very fast to a value below 0.1, stagnating then a bit until about iteration 300 where it starts to decrease again. Descent is also quite fast until below 0.1 (please note the different scaling on the y-axis) and then stagnates. Apart from that, the decrease in loss with Descent doesn’t translate into an increase in accuracy (on the test dataset).

iterations 1–500 (Descent) and 1–1,000 (ADAM) [image by author]

3LS-Model

We get a similar result for the 3LS-model. ADAM improves the results here also significantly: At 500 iterations we get an accuracy of 95.66%, which can be slightly improved to over 96% at 1,000 iterations. The standard optimizer on the other hand only produced accuracies below 20%.

Batch GD — 3LS [image by author]

The learning curves show this too: Both optimizers decrease the loss quite fast to a value below 0.1 (note again different scaling on the y-axis), ADAM being even faster than Descent and then decreasing significantly more.

Compared to the 4LS-model, ADAM stagnates here much sooner, but achieving at that point already a better accuracy than on 4LS.

iterations 1–500 (Descent) and 1–1,000 (ADAM) [image by author]

2LR-Model

On the 2LR-model we got already good results with the standard optimizer and we reach the same level of accuracy (above 96%) with ADAM. But we get there with ADAM already at iteration 500 (then decreasing slightly), whereas the standard optimizer needs 4,000 iterations to reach that level.

Batch GD — 2LR [image by author]

The learning curves are, apart from the faster decrease with ADAM at the beginning, in this case quite similar:

iterations 1–1,000 (Descent and ADAM) [image by author]

Stochastic Gradient Descent

The standard optimizer delivered on the 4LS- and 3LS-models quite good results with stochastic GD. So let’s see, if ADAM can do any better.

4LS-Model

The learning curves for the 4LS-model have the same basic shape for both optimizers, but ADAM results in a stronger oscillation.

iterations 1–1,024,000 (Descent and ADAM) [image by author]

ADAM gives us higher accuracies with less iterations, but in the end we don’t quite get the quality we received from Descent (at 2 million iterations 94.16% vs. 92.12%).

Stochastic GD — 4LS [image by author]

3LS-Model

With the 3LS-model we get almost the same situation: Basically similar curves where ADAM converges at the beginning faster to lower loss values (and oscillates much more on that way):

iterations 1–1,024,000 (Descent and ADAM) [image by author]

This is reflected in the accuracies we can achieve. And again, ADAM doesn’t quite get to the level we reached with the standard optimizer (93.73% vs. 97.64% at 2 million iterations):

Stochastic GD — 3LS [image by author]

2LR-Model

Training the 2LR-model with the standard optimizer was a failure. We never got an accuracy above 20%. ADAM is in this case a real improvement, achieving a bit more than 87% of accuracy at about 0.5 million iterations (and then getting worse again). But we don’t get in the region above 95% as we did with other configurations.

Stochastic GD — 2LR [image by author]

The learning curve using ADAM shows heavy oscillations and it doesn’t really fit to the numbers in the table above, since the loss seems to increase the more iterations we do.

iterations 1–1,024,000 (Descent and ADAM) [image by author]

A 100-moving-average on the the first 128,000 iterations shows a better picture of what happens in that part. At least on the first 40,000 iterations the loss decreases significantly. But this still doesn’t explain why the accuracy improves in the range up to about 0.5 million iterations.

Moving 100-average — iterations 1–128,000 (ADAM) [image by author]

Mini-batch Gradient Descent

The mini-batch GD achieved very good results on all three models. So there isn’t much space left for ADAM. Let’s have a look at the differences.

4LS-Model

As seen above on other configurations, ADAM converges faster at the beginning (92.46% accuracy at 500 iterations) but doesn’t quite reach the level we had with the standard optimizer (94.49% vs. 93.28% at 2 million iterations):

Mini-batch GD — 4LS [image by author]

The learning curves reflect these findings (and again, ADAM leads to more oscillation):

iterations 4,001–512,000 (Descent) and 1–512,000 (ADAM) [image by author]

3LS-Model

The same holds for the 3LS-Model. ADAM achieves an accuracy of 94.47% already at 500 iterations, but is slightly worse in the end at 2 million iterations (97.68% vs. 96.17%).

Mini-batch GD — 3LS [image by author]

In an analogous way this is reflected within the learning curves:

iterations 4,001–512,000 (Descent) and 1–512,000 (ADAM) [image by author]

2LR-Model

With the 2LR-model the situation is a bit different. Here ADAM isn’t even faster converging at the beginning. And again it doesn’t reach the same level of accuracy as the standard optimizer (97.0% vs. 91.33%).

Mini-batch GD — 2LR [image by author]

The learning curve shows here (as with stochastic GD) a significant oscillation:

iterations 4,001–512,000 (Descent) and 1–512,000 (ADAM) [image by author]

Conclusions

Descent vs. ADAM

The following diagram gives an overview of the best results achieved with the standard optimizer (left) and ADAM (right). It shows accuracy in relation to training time. Here we can see, that ADAM does generally well: Almost all results are in the upper left corner. I.e. a high accuracy has been achieved with relatively little effort. The standard optimizer delivered a broader range of results. But in the end it produced some of the best results.

Accuracy vs. training-time (Descent & ADAM) [image by author]

Top Performers

The next diagram shows the top performers (accuracy > 90%). The best results could be achieved by 3LS/Descent (bright blue) followed by 2LR/Descent (bright yellow). The 4LS-variants (greenish colors) didn’t so well in general.

Accuracy vs. training-time — Top performers [image by author]

Effort

And finally, as in the recent analysis, I calculated a performance indicator called “effort”, which is the ratio between training-time and accuracy (describing how much training-time is necessary to achieve a certain level of accuracy). So you can see the top performers on that aspect too:

Ranking by “effort” [image by author]

Summary

Interestingly we could get significantly better results with ADAM in all three cases where the standard optimizer failed. But in all other cases ADAM didn’t quite reach the levels we obtained with Descent. It converged in these cases mostly faster at lesser iterations, but this didn’t help in the end. So we can conclude that both optimizers complement each other well, but none of them can be recommended for all purposes.

--

--

CEO at adviion.de, Lecturer at KIT.edu and dhbw.de, Dr. rer. pol. —Fields of interest: Data Science, Machine Learning, Software Engineering, Project Management