Normalized Direction-preserving Adam, Switching from Adam to SGD, and Nesterov Momentum Adam with Interactive Code [ Manual Back Prop in TF, Regularization Study 1 ]

Published in

Towards Data Science

9 min readJun 11, 2018

Recently I have been interested on different methods to regularize a neural network, and as expected there are many different methods that many different researchers have worked on. Specifically, for today, I want to take a look at variance of Adam Optimizer. And below is the list of all of different methods that we are going to take a look at this post.

Case a: stochastic gradient descent
Case b: stochastic gradient descent momentum
Case c: stochastic gradient descent with nesterov momentum
Case d: Adam Optimizer
Case e: Nesterov Momentum Adam
Case f: Switching from Adam to SGD
Case g: Normalized Direction-preserving Adam

Please note this post is for my future self to look back and review the materials on these paper without reading them all over again. Also please note some of the implementation are still in working progress.

Paper form this website

Paper from this website

Base Network || Data Set || Simple Theory Behind Different Methods

Red Rectangle → Input Image (32*32*3)
Black Rectangle → Convolution with ELU() with / without mean pooling
Orange Rectangle → Softmax for classification

For simplicity I am going to use the base network from my previous post, “The All Convolutional Net”. Also we are going to evaluate each of the optimization methods on CIFAR 10 data set.

Finally, I wish to write a simple version of the theory behind each new method. (Nesterov Momentum Adam, SWAT, and Normalized Direction preserving Adam.)

Improving Generalization Performance by Switching from Adam to SGD → Simply switching from Adam to SGD during middle of training. So we can take advantage of fast convergence of Adam at the beginning of the training but later we can make the model generalize better by SGD.

Normalized Direction-preserving Adam → One problem (or draw back) of why Adam might not be good at generalization is due to not preserving the gradient direction. (Unlike SGD) the authors of this paper proposed a method to fix this drawback.

Incorporating Nesterov Momentum into Adam → A simple explanation about this paper is simply incorporating the principle used to extend Momentum to Nesterov Momentum and apply that to Adam.

Results: Case a: stochastic gradient descent

Left Image → Testing Set Accuracy / Cost Over Time
Right Image → Training Set Accuracy / Cost Over Time

As expected, stochastic gradient descent performed extremely well. With the final accuracy of 85.3 percent on CIFAR 10 data set, in only 20 epoch, not too bad for vanilla gradient descent.

Results: Case b: stochastic gradient descent momentum

Left Image → Testing Set Accuracy / Cost Over Time
Right Image → Training Set Accuracy / Cost Over Time

Although SGD have performed well, SGD with momentum was able to outperform slightly with 85.7 percent accuracy.

Results: Case c: stochastic gradient descent with nesterov momentum

Left Image → Testing Set Accuracy / Cost Over Time
Right Image → Training Set Accuracy / Cost Over Time

Following the implementation from online stand-ford course, stochastic gradient descent with nesterov momentum gave pretty good results. However it was disappointing to see this method being outperformed by regular momentum method.

Results: Case d: Adam Optimizer

Left Image → Testing Set Accuracy / Cost Over Time
Right Image → Training Set Accuracy / Cost Over Time

I already expected Adam to outperform most of the optimization algorithms, however, it also comes with the price that the model generalize poorly. The training accuracy was 97 percent, while testing accuracy was halted at 87 percent. (Which have the highest accuracy on testing images but only 3 for room for improvement.)

Results: Case e: Nesterov Momentum Adam

Left Image → Testing Set Accuracy / Cost Over Time
Right Image → Training Set Accuracy / Cost Over Time

For NAdam I am 100 % confident that my implementation is not fully done. Since I didn’t incorporate the product of U. Hence this might explain why the performance of this model is so bad. (Not even training at all.)

Red Box → Image from original paper, terms that I am still working on

Results: Case f: Switching from Adam to SGD

Left Image → Testing Set Accuracy / Cost Over Time
Right Image → Training Set Accuracy / Cost Over Time

I believe with the optimal hyper-parameters this method could outperform every other methods. However, without the optimal hyper parameter adjustments I was only able to achieve 67 percent accuracy.

This was one of the most interesting optimization methods that I have encountered so far. My implementation is messy, hence not the greatest, but it was really fun and challenging trying to get this model to train. One of the reason why it was so interesting was due to all of the if condition during back propagation.

Results: Case g: Normalized Direction-preserving Adam

Despite my efforts of triple checking my implementation and comparing with the original code from the author (found here) I wasn’t able to successfully train this model while using ND-Adam. ( I am very confident that I made a mistake here and there, since the authors were able to achieve over 90 percent accuracy but in 80000 epoch.)

Interactive Code

For Google Colab, you would need a google account to view the codes, also you can’t run read only scripts in Google Colab so make a copy on your play ground. Finally, I will never ask for permission to access your files on Google Drive, just FYI. Happy Coding! Also for transparency I uploaded all of the training logs on my github.

To access the code for case a click here, for the logs click here.
To access the code for case b click here, for the logs click here.
To access the code for case c click here, for the logs click here.
To access the code for case d click here, for the logs click here.
To access the code for case e click here, for the logs click here.
To access the code for case f click here, for the logs click here.
To access the code for case g click here, for the logs click here.

Final Words

I am very excited to start these series about generalization. However, I am quite sad that despite my best efforts I wasn’t able to successfully train the model with Nesterov Momentum Adam and Normalized Direction-preserving Adam.

If any errors are found, please email me at jae.duk.seo@gmail.com, if you wish to see the list of all of my writing please view my website here.

Meanwhile follow me on my twitter here, and visit my website, or my Youtube channel for more content. I also implemented Wide Residual Networks, please click here to view the blog post.

Reference

True, t. (2018). tensorflow: check if a scalar boolean tensor is True. Stack Overflow. Retrieved 10 June 2018, from https://stackoverflow.com/questions/43263933/tensorflow-check-if-a-scalar-boolean-tensor-is-true
tf.Print?, H. (2018). How to print part of a tensor using tf.Print?. Stack Overflow. Retrieved 10 June 2018, from https://stackoverflow.com/questions/47000828/how-to-print-part-of-a-tensor-using-tf-print
Using tf.Print() in TensorFlow — Towards Data Science. (2018). Towards Data Science. Retrieved 10 June 2018, from https://towardsdatascience.com/using-tf-print-in-tensorflow-aa26e1cff11e
tf.Print | TensorFlow. (2018). TensorFlow. Retrieved 10 June 2018, from https://www.tensorflow.org/api_docs/python/tf/Print
Tensorflow?, H. (2018). How to pass parmeters to functions inside tf.cond in Tensorflow?. Stack Overflow. Retrieved 10 June 2018, from https://stackoverflow.com/questions/38697045/how-to-pass-parmeters-to-functions-inside-tf-cond-in-tensorflow/39573566
tf.cond(pred, fn1, fn2, name=None) | TensorFlow. (2018). TensorFlow. Retrieved 10 June 2018, from https://www.tensorflow.org/versions/r1.0/api_docs/python/tf/cond
training, h. (2018). how can I change tensorflow optimizer during training. Stack Overflow. Retrieved 10 June 2018, from https://stackoverflow.com/questions/48259650/how-can-i-change-tensorflow-optimizer-during-training
Principal Component Analysis Pooling in Tensorflow with Interactive Code [PCAP]. (2018). Medium. Retrieved 10 June 2018, from https://medium.com/@SeoJaeDuk/principal-component-analysis-pooling-in-tensorflow-with-interactive-code-pcap-43aa2cee9bb
tf.logical_and | TensorFlow. (2018). TensorFlow. Retrieved 10 June 2018, from https://www.tensorflow.org/api_docs/python/tf/logical_and
Mathematical symbols list (+,-,x,/,=,<,>,…). (2018). Rapidtables.com. Retrieved 10 June 2018, from https://www.rapidtables.com/math/symbols/Basic_Math_Symbols.html
Improving Generalization Performance by Switching from Adam to SGD · Issue #76 · kweonwooj/papers. (2018). GitHub. Retrieved 10 June 2018, from https://github.com/kweonwooj/papers/issues/76
zj10/ND-Adam. (2018). GitHub. Retrieved 10 June 2018, from https://github.com/zj10/ND-Adam/blob/master/ndadam.py
norms?, W. (2018). What is the meaning of super script 2 subscript 2 within the context of norms?. Cross Validated. Retrieved 10 June 2018, from https://stats.stackexchange.com/questions/181620/what-is-the-meaning-of-super-script-2-subscript-2-within-the-context-of-norms
Linear Algebra 27, Norm of a Vector, examples. (2018). YouTube. Retrieved 10 June 2018, from https://www.youtube.com/watch?v=mKfn23Ia7QA
tf.float32?, H. (2018). How to convert tf.int64 to tf.float32?. Stack Overflow. Retrieved 10 June 2018, from https://stackoverflow.com/questions/35596629/how-to-convert-tf-int64-to-tf-float32
tf.reduce_sum | TensorFlow. (2018). TensorFlow. Retrieved 10 June 2018, from https://www.tensorflow.org/api_docs/python/tf/reduce_sum
Brownlee, J. (2017). Gentle Introduction to the Adam Optimization Algorithm for Deep Learning. Machine Learning Mastery. Retrieved 10 June 2018, from https://machinelearningmastery.com/adam-optimization-algorithm-for-deep-learning/
Stochastic gradient descent. (2018). En.wikipedia.org. Retrieved 10 June 2018, from https://en.wikipedia.org/wiki/Stochastic_gradient_descent
Stochastic Gradient Descent with momentum — Towards Data Science. (2017). Towards Data Science. Retrieved 10 June 2018, from https://towardsdatascience.com/stochastic-gradient-descent-with-momentum-a84097641a5d
[ ICLR 2015 ] Striving for Simplicity: The All Convolutional Net with Interactive Code [ Manual…. (2018). Towards Data Science. Retrieved 11 June 2018, from https://towardsdatascience.com/iclr-2015-striving-for-simplicity-the-all-convolutional-net-with-interactive-code-manual-b4976e206760
CIFAR-10 and CIFAR-100 datasets. (2018). Cs.toronto.edu. Retrieved 11 June 2018, from https://www.cs.toronto.edu/~kriz/cifar.html
CS231n Convolutional Neural Networks for Visual Recognition. (2018). Cs231n.github.io. Retrieved 11 June 2018, from http://cs231n.github.io/neural-networks-3/

Normalized Direction-preserving Adam, Switching from Adam to SGD, and Nesterov Momentum Adam with Interactive Code [ Manual Back Prop in TF, Regularization Study 1 ]

Written by Jae Duk Seo