Outperforming Tensorflow’s Default Auto Differentiation Optimizers, with Interactive Code [Manual Back Prop with TF]

Published in

Towards Data Science

12 min readMar 3, 2018

So I was thinking about this idea for a long time, is there a different (even better) way to train a neural network? Frameworks such as Tensroflow, Keras, pyTorch are amazing and very easy to use. Thanks to not only their ability to perform auto differentiation for us, but they provide us with wide selection of optimizers. But that does not mean that we have to only rely on their Auto Differentiation.

So lets do something different, I’ll try to outperform Tensorflows Auto Differentiation presented in their default implementation, in total there are 10 optimizers. And the two techniques that we are going to use to outperform auto differentiation are…

a. Google Brain’s Gradient Noise
b. Dilated Back Propagation with ADAM optimizer for each layer

Network Architecture / Experiment Task

Our Experiment is very simple, we are going to use fully connected neural network (that have 5 layers), to perform classification on Tensorflow’s MNIST Data set. And above is how each layer is constructed. Since we are going to use different optimization methods, the layers should have different ways to perform back propagation, hence it has three. Standard Back Propagation, Google Brain’s Added Noise, and ADAM Back Propagation. Also, please take note of two details.

1. We are going to use every data provided by Tensorflow MNIST Data set.
2. We are going to use vectorized images.

List of Tensorflow’s Optimizers

Above is the full list of optimzers we are going to compare with Google Brain’s Noise and Dilated Back Propagation. Now to easily see which are which, lets assign color for each optimizers.

List of Comparing Cases

As seen above, we have in total of 17 cases, and each case will have their own color assigned to them. Please see below for exact colors I used.

If you wish to know more about matplotlib colors, please visit this page for more information. Now lets assign each case with different optimization methods .

Case 0  → Google Brain's Added Gradient Noise + Standard Back PropCase 1  → Dilated ADAM Back Propagation Sparse Connection by Multiplication Case 2  → Dilated ADAM Back Propagation Sparse Connection by Multiplication + Google Brain's Added Gradient NoiseCase 3  → Dilated ADAM Back Propagation Dense Connection by AdditionCase 4  → Dilated ADAM Back Propagation Dense Connection by MultiplicationCase 5  → Dilated ADAM Back Propagation Dense Connection by Addition (Different Decay / Proportion Rate)Case 6  → Dilated ADAM Back Propagation Dense Connection by Addition (Different Decay / Proportion Rate)Case 7  → tf.train.GradientDescentOptimizer
Case 8  → tf.train.AdadeltaOptimizer
Case 9  → tf.train.AdagradOptimizer 
Case 10 → tf.train.AdagradDAOptimizer
Case 11 → tf.train.MomentumOptimizer
Case 12 → tf.train.AdamOptimizer
Case 13 → tf.train.FtrlOptimizer
Case 14 → tf.train.ProximalGradientDescentOptimizer
Case 15 → tf.train.ProximalAdagradOptimizer
Case 16 → tf.train.RMSPropOptimizer

In essence case 0 ~ 6 are manual back propagation and case 7 ~ 16 are auto differentiation.

List of Different Trials / Fundamental Superiority /
Random Initialization

If one of the optimization method is fundamentally superior from another, then every time we run the experiment that method would outperform the rest. To increase our probability on capturing this fundamental characteristics lets perform 3 different trials and on each trial we will run 10 experiments. (**Note** For each Trial the hyper parameters are set differently from one another.)

Moreover to guarantee the method’s superiority lets give random seed value for weight initialization.

After every experiment, we are going to compare all of the cases to know which one have…
a. Lowest Cost Rate (or Error Rate)
b. Highest Accuracy on Training Images
c. Highest Accuracy on Testing Images
And when all of the experiment are done we are going see the frequency bar graph of which case performed best on each criteria.

Trial 1 Results

Left Plot → Frequency Bar graph of 10 Experiments on Lowest Cost Rate
Middle Plot → Frequency Bar graph of 10 Experiments on Highest Accuracy on Training Images
Right Plot → Frequency Bar graph of 10 Experiments on Highest Accuracy on Testing Images

When learning rate was set to 0.001 and when the neurons were wider, having 1024, neurons for each layer. It seemed like dilated back propagation was prone to over-fitting, since they were most frequent on the lowest cost rate as well as highest accuracy on the training images, but did not show up on highest accuracy on test images.

I see a potential that they can outperform every case with proper regularization.

Percentage of Best Performing at Lowest Cost Rate1. Case 5: 60%
('Dilated ADAM Back Propagation Dense Connection by Addition (Different Decay / Proportion Rate)')2. Case 3: 30%
('Dilated ADAM Back Propagation Dense Connection by Addition')3. Case 0: 20%
('Google Brain's Added Gradient Noise + Standard Back Prop')4. Case 1: 10%
('Dilated ADAM Back Propagation Sparse Connection by Multiplication')
-------------------------------------------------------------
Percentage of Best Performing at Highest Accuracy on Training Images1. Case 3: 40%
('Dilated ADAM Back Propagation Dense Connection by Addition')2. Case 5: 30%
('Dilated ADAM Back Propagation Dense Connection by Addition (Different Decay / Proportion Rate)')3. Case 0: 20%
('Google Brain's Added Gradient Noise + Standard Back Prop')4. Case 1: 10%
('Dilated ADAM Back Propagation Sparse Connection by Multiplication')
-------------------------------------------------------------
Percentage of Best Performing at Highest Accuracy on Testing Images1. Case 0: 60%
('Google Brain's Added Gradient Noise + Standard Back Prop')2. Case 16: 40%
('tf.train.RMSPropOptimizer')

In essence case 0 ~ 6 are manual back propagation and case 7 ~ 16 are auto differentiation.

Trial 2 Results

When the learning rate was set bit higher (0.0025) with narrower neurons (824) case 5 had (mostly) outperformed every other cases.

Percentage of Best Performing at Lowest Cost Rate1. Case 3: 60%
('Dilated ADAM Back Propagation Dense Connection by Addition')2. Case 5: 40%
('Dilated ADAM Back Propagation Dense Connection by Addition (Different Decay / Proportion Rate)')
-------------------------------------------------------------
Percentage of Best Performing at Highest Accuracy on Training Images1. Case 3: 50%
('Dilated ADAM Back Propagation Dense Connection by Addition')2. Case 5: 50%
('Dilated ADAM Back Propagation Dense Connection by Addition (Different Decay / Proportion Rate)')
-------------------------------------------------------------
Percentage of Best Performing at Highest Accuracy on Testing Images1. Case 5: 70%
('Dilated ADAM Back Propagation Dense Connection by Addition (Different Decay / Proportion Rate)')2. Case 3: 20%
('Dilated ADAM Back Propagation Dense Connection by Addition')3. Case 1: 10%
('Dilated ADAM Back Propagation Sparse Connection by Multiplication')

In essence case 0 ~ 6 are manual back propagation and case 7 ~ 16 are auto differentiation.

Trial 3 Results

When the learning rate was set to 0.001 with narrower neurons (824) Google’s own method had (mostly) outperformed every other cases.

Percentage of Best Performing at Lowest Cost Rate1. Case 0: 90%
('Google Brain's Added Gradient Noise + Standard Back Prop')2. Case 3: 10%
('Dilated ADAM Back Propagation Dense Connection by Addition')
-------------------------------------------------------------
Percentage of Best Performing at Highest Accuracy on Training Images1. Case 0: 90%
('Google Brain's Added Gradient Noise + Standard Back Prop')2. Case 3: 10%
('Dilated ADAM Back Propagation Dense Connection by Addition')
-------------------------------------------------------------
Percentage of Best Performing at Highest Accuracy on Testing Images1. Case 0: 70%
('Google Brain's Added Gradient Noise + Standard Back Prop')2. Case 7: 10%
('tf.train.GradientDescentOptimizer')3. Case 14: 10%
('tf.train.ProximalGradientDescentOptimizer')4. Case 16: 10%
('tf.train.RMSPropOptimizer')

In essence case 0 ~ 6 are manual back propagation and case 7 ~ 16 are auto differentiation.

(Update Mar 6) Trial 4 Results

When the learning rate was set to 0.0025 with wider neurons (1024) case 2 performed well.

Percentage of Best Performing at Lowest Cost Rate1. Case 5: 60%
('Dilated ADAM Back Propagation Dense Connection by Addition (Different Decay / Proportion Rate)')2. Case 3: 40%
('Dilated ADAM Back Propagation Dense Connection by Addition')
-------------------------------------------------------------
Percentage of Best Performing at Highest Accuracy on Training Images1. Case 5: 60%
('Dilated ADAM Back Propagation Dense Connection by Addition (Different Decay / Proportion Rate)')2. Case 3: 40%
('Dilated ADAM Back Propagation Dense Connection by Addition')
-------------------------------------------------------------
Percentage of Best Performing at Highest Accuracy on Testing Images1. Case 3: 50%
('Dilated ADAM Back Propagation Dense Connection by Addition')2. Case 1: 30%
('Dilated ADAM Back Propagation Sparse Connection by Multiplication')2. Case 5: 20%
('Dilated ADAM Back Propagation Dense Connection by Addition (Different Decay / Proportion Rate)')

Archived Training Results

To increase the transparency of this experiment, I have made another blog post that contains plots for cost over time, accuracy on training images over time, and accuracy on testing images over time. To access it please click here.

Short Comings

I haven’t had time to optimize all of the Hyper Parameters for every auto differentiation. However, I kept the learning rate exactly the same for every cases, but there is a possibility that the learning rate I set is not optimal for Tensor Flow’s auto differentiation. If you are conducting this experiment (If you wish you can use the code I provided in the interactive code section.) and find an good hyper parameter for each setting, please let me know by commenting ,I would love to see if they can be outperformed as well.

With that being said, I believe Tensorflow have super optimized it’s algorithms, that made auto differentiation perform faster but to get the highest performance on every network. So this might be a fair comparison, overall.

Also I noticed two things with Dilated Back Propagation.
1. It does not perform well with shallow networks or smaller number of neurons
2. It performs well in the long term.

Red Box → Error Rate of Case 15 is lower than any other Cases
Blue Box → After 100'th epoch case 5 starts to out perform

As seen above, most of the time, on the first 100 epoch, the auto differentiation methods had lower error rate. However, after certain amount of epoch such as 100 or 150, Dilated Back Propagation started to outperform.

Interactive Code

I moved to Google Colab for Interactive codes! So you would need a google account to view the codes, also you can’t run read only scripts in Google Colab so make a copy on your play ground. Finally, I will never ask for permission to access your files on Google Drive, just FYI. Happy Coding!

To access the code for Trial 1, please click here.
To access the code for Trial 2, please click here.
To access the code for Trial 3, please click here.

Citation
(If you wish to use this implementation or any information please cite this blog post)

APA

Outperforming Tensorflow’s Default Auto Differentiation Optimizers, with Interactive Code [Manual…. (2018). Medium. Retrieved 3 March 2018, from https://medium.com/@SeoJaeDuk/outperforming-tensorflows-default-auto-differentiation-optimizers-with-interactive-code-manual-e587a82d340e

MLA

"Outperforming Tensorflow’S Default Auto Differentiation Optimizers, With Interactive Code [Manual…." Medium. N. p., 2018. Web. 3 Mar. 2018.

Harvard

Medium. (2018). Outperforming Tensorflow’s Default Auto Differentiation Optimizers, with Interactive Code [Manual…. [online] Available at: https://medium.com/@SeoJaeDuk/outperforming-tensorflows-default-auto-differentiation-optimizers-with-interactive-code-manual-e587a82d340e [Accessed 3 Mar. 2018].

Final Words

I want to conclude this post, with two of my favorite quotes.

Jeff Bezos (CEO of Amazon) : It’s all about the long-term…
Ginni Rometty (CEO of IBM): Growth And Comfort Don’t Co-Exist

If any errors are found, please email me at jae.duk.seo@gmail.com, if you wish to see the list of all of my writing please view my website here.

Meanwhile follow me on my twitter here, and visit my website, or my Youtube channel for more content. I also did comparison of Decoupled Neural Network here if you are interested.

Reference

Training | TensorFlow. (2018). TensorFlow. Retrieved 3 March 2018, from https://www.tensorflow.org/api_guides/python/train#Optimizers
Seo, J. D. (2018, February 05). Manual Back Prop with TensorFlow: Decoupled Recurrent Neural Network, modified NN from Google… Retrieved February 21, 2018, from https://towardsdatascience.com/manual-back-prop-with-tensorflow-decoupled-recurrent-neural-network-modified-nn-from-google-f9c085fe8fae
Bort, J. (2014, October 07). IBM CEO Ginni Rometty: Growth And Comfort Dont Co-Exist. Retrieved February 21, 2018, from http://www.businessinsider.com/ibm-ceo-growth-and-comfort-dont-co-exist-2014-10
Saljoughian, P. (2017, November 20). What I learned from Jeff Bezos after reading every Amazon shareholder letter. Retrieved February 21, 2018, from https://medium.com/parsa-vc/what-i-learned-from-jeff-bezos-after-reading-every-amazon-shareholder-letter-172d92f38a41
P. (n.d.). Pinae/TensorFlow-MNIST-example. Retrieved February 22, 2018, from https://github.com/pinae/TensorFlow-MNIST-example/blob/master/fully-connected.py
How do I print an integer with a set number of spaces before it? (n.d.). Retrieved February 22, 2018, from https://stackoverflow.com/questions/45521183/how-do-i-print-an-integer-with-a-set-number-of-spaces-before-it
How to pretty-printing a numpy.array without scientific notation and with given precision? (n.d.). Retrieved February 22, 2018, from https://stackoverflow.com/questions/2891790/how-to-pretty-printing-a-numpy-array-without-scientific-notation-and-with-given
Limiting floats to two decimal points. (n.d.). Retrieved February 22, 2018, from https://stackoverflow.com/questions/455612/limiting-floats-to-two-decimal-points
How to prevent tensorflow from allocating the totality of a GPU memory? (n.d.). Retrieved February 23, 2018, from https://stackoverflow.com/questions/34199233/how-to-prevent-tensorflow-from-allocating-the-totality-of-a-gpu-memory
Closing session in tensorflow doesn’t reset graph. (n.d.). Retrieved February 23, 2018, from https://stackoverflow.com/questions/42706761/closing-session-in-tensorflow-doesnt-reset-graph
[1]”tf.reset_default_graph | TensorFlow”, TensorFlow, 2018. [Online]. Available: https://www.tensorflow.org/api_docs/python/tf/reset_default_graph. [Accessed: 23- Feb- 2018].
Creating a float64 Variable in tensorflow. (n.d.). Retrieved February 23, 2018, from https://stackoverflow.com/questions/35884045/creating-a-float64-variable-in-tensorflow
What does global_step mean in Tensorflow? (n.d.). Retrieved February 23, 2018, from https://stackoverflow.com/questions/41166681/what-does-global-step-mean-in-tensorflow
List of zeros in python. (n.d.). Retrieved February 23, 2018, from https://stackoverflow.com/questions/8528178/list-of-zeros-in-python
Getting the index of the returned max or min item using max()/min() on a list. (n.d.). Retrieved February 23, 2018, from https://stackoverflow.com/questions/2474015/getting-the-index-of-the-returned-max-or-min-item-using-max-min-on-a-list
How to prevent tensorflow from allocating the totality of a GPU memory? (n.d.). Retrieved February 23, 2018, from https://stackoverflow.com/questions/34199233/how-to-prevent-tensorflow-from-allocating-the-totality-of-a-gpu-memory
How to get stable results with TensorFlow, setting random seed. (n.d.). Retrieved February 23, 2018, from https://stackoverflow.com/questions/36288235/how-to-get-stable-results-with-tensorflow-setting-random-seed
Specifying Colors¶. (n.d.). Retrieved February 23, 2018, from https://matplotlib.org/users/colors.html
Eisen, M. (n.d.). Set the bar colors for a plot with matplotlib. Retrieved February 24, 2018, from http://matthiaseisen.com/pp/patterns/p0178/
Close Event¶. (n.d.). Retrieved February 24, 2018, from https://matplotlib.org/gallery/event_handling/close_event.html#sphx-glr-gallery-event-handling-close-event-py
Module: tf.contrib.opt | TensorFlow. (2018). TensorFlow. Retrieved 3 March 2018, from https://www.tensorflow.org/api_docs/python/tf/contrib/opt
Ginni Rometty. (2018). En.wikipedia.org. Retrieved 3 March 2018, from https://en.wikipedia.org/wiki/Ginni_Rometty
Jeff Bezos. (2018). En.wikipedia.org. Retrieved 3 March 2018, from https://en.wikipedia.org/wiki/Jeff_Bezos
Only Numpy: Implementing “ADDING GRADIENT NOISE IMPROVES LEARNING FOR VERY DEEP NETWORKS” from…. (2018). Becoming Human: Artificial Intelligence Magazine. Retrieved 3 March 2018, from https://becominghuman.ai/only-numpy-implementing-adding-gradient-noise-improves-learning-for-very-deep-networks-with-adf23067f9f1

Outperforming Tensorflow’s Default Auto Differentiation Optimizers, with Interactive Code [Manual Back Prop with TF]

Written by Jae Duk Seo