Weight Initialization for Neural Networks — Does it matter?

A comparative study of different weight initialization techniques

Published in

Towards Data Science

8 min readSep 6, 2020

Validation accuracy and loss comparison for different uniform Initialization | Image by Author

Machine learning and deep learning techniques have encroached in every possible domain you can think of. With the increasing availability of digitized data and the advancement of computation capability of modern computers, it will possibly flourish in near future. As part of it, every day more and more people are delving into the task of building and training models. As a newbie in the domain, I also have to build and train a few neural network models. Well, most of the time, my primary goal remains normally to make a highly accurate as well generalized model. And in order to achieve that I normally break my head over finding proper hyper-parameters for my model, what all different regularization techniques I can apply, or thinking about the question, do I need to further deepen my model and so on. But often I forget to play around with the weight-initialization techniques. And I believe this is the case for many others like me.

Wait, weight initialization? Does it matter at all? I dedicate this article to find out the answer to the questions. When I went over the internet to find the answer to the question, there is overwhelming information about it. Some articles talk about the mathematics behind these articles, some compare this technique theoretically and some are more into deciding whether uniform initialization is better or the normal one. In this situation, I resorted to a result-based approach. I tried to find out the impact of weight initialization techniques via a small experiment. I applied a few of these techniques into a model and tried to visualize the training process itself, keeping aside the goal of achieving high accuracy for some time.

The setup:

The Model | Image by Author

I have created a simple CNN model with a combination of Conv-2D, Max-pool, Dropout, Batch-norm, and Dense layers with Relu activation(except for the last layer, which is Softmax). I have trained it for the classification task of the Cifar-10 data set. I have initialized the model with six different kernel initialization methods and analyzed the training phase. I trained the model with 30 epochs and with a batch size of 512 using an SGD optimizer. The six initialization methods used in this experiment namely are:

Glorot Uniform
Glorot Normal
He Uniform
He Normal
Random Uniform
Random Normal

Code snippet for model build and training | Image by Author

Note: The default kernel initialization method for Keras layers is Glorot Uniform.

Experiment results:

let us first have a look at how uniform and normal initialization behaved for the three different methods.

Validation accuracy for different initialization methods| Image by Author

For the first two methods, the validation accuracy curve for the model with normal initialization techniques almost followed the uniform technique, if not always. For Glorot methods(first plot) after it took a slow start, and then it started to rise. After the tenth epoch, both curves for normal and uniform showed a very jumpy behavior. But for the last few epochs, the normal curve took a downward journey, whereas for the uniform one the overall trend is upward.

For random methods(last plot), the trend for both the curves didn’t match at all. The overall validation accuracy trend for the random normal remained very low, below 40%(except two sharp spikes upward). Whereas the uniform method showed a better result. Although the curve for uniform technique is very unstable and jumpy in nature. After the 23rd epoch both the curves showed a downfall of validation accuracy. Following the trend of Glorot methods, here also we see a very slow start-up to epoch 5, showing a validation accuracy as low as ~10%. For Glorot and Random methods, the validation accuracy curve didn’t show a proper convergence pattern what so ever and remained unstable for the entire training process.

The He methods(middle plot) performed better than the other two in this aspect. The curves for uniform and normal techniques are similar and followed each other till the end of training of 30 epochs. And they both showed an overall upward trend, and the curves are comparatively less jumpy. Unlike Glorot and Random, when used He methods, the validation accuracy started to rise from the very first epoch without showing a slow start.

Let us now take a deep dive and analyze the performance of these methods on a few desirable aspects.

Comparison of validation accuracy and loss among uniform techniques| Image by Author

Comparison of validation accuracy and loss among normal techniques| Image by Author

Accuracy: As far as accuracy is concerned highest validation accuracy is achieved using Glorot methods(Blue curves). Although as the curves suggest it failed to maintain the position rather shown a very unstable behavior. He methods(Orange curves) are far more stable and reliable than rest two and the validation accuracy is comparable with Glorot. Among these, Random techniques, both uniform and normal, produced the lowest validation accuracy throughout the training. The below table will give you an overall summary of how these methods behaved.

Accuracy comparison table | Image by Author

Convergence: Nowadays people are training very deep models. GPT-3 is having 175 billion parameters to train, so faster and distinct convergence is always a desirable aspect. As I already mentioned Glorot and Random methods’ curves are very unstable and jumpy, so it’s difficult to deduce whether convergence is achieved or not. If we analyze the overall trend, Random initialization methods perform very poorly and we can say that the training for random normal converged at around 40% mark, whereas for Random Uniform it's below 50%. Random curves took as minimum as 15–16 epochs to reach that level of validation accuracy. For Glorot Uniform and Normal initialization, the validation accuracy converges between 50–60%(some random spikes above 60%). And the convergence trend started to formalize after 15 epochs. He curves after increasing constantly crossed the 50% mark at around 12 epochs(He Normal curve was faster). After that both He uniform and normal curves continued to go upward and finished at around 60% mark. Also if we look at the He loss curves, the starting loss for Glorot uniform and Random uniform methods are low compared to He Uniform. This is probably because the He initialization method initializes the weights a bit on the higher side.
Stability and randomness of training: The Validation accuracy comparison curves show very clearly that with Glorot and Random methods are very jumpy and as a consequence, it hinders the decision making ability of the designer for making an early stop of the training process. This is a very vital trick for reducing over-fitting and making a well-generalized model. On the other hand, when the model initialized with He uniform or random methods the validation accuracy curves show very stable and consistent behavior. This also gives confidence to the designer about the training process of the model. The designer can stop at any desired position as per their whim. In order to prove that our findings are correct, we trained the model five times with each of the initialization methods.

Validation accuracy comparison for He Uniform initialization(multiple runs)| Image by Author

Validation accuracy comparison for Glorot Uniform(multiple runs)| Image by Author

Validation accuracy comparison for Random Uniform(multiple runs)| Image by Author

The plots clearly show how jumpy the training process could be when initialized with Glorot and Random methods. The Random Uniform method failed to show any increment up to the 5th epoch and the story is the same for Glorot Uniform up to the 3rd epoch.

Generalization: Along with accuracy, we as designers want to train a model that is generalized well as well as accurate so that they perform well on unseen data. So we expect our model to show a minimal gap between training and testing.

Training vs Validation (Glorot Uniform)| Image by Author

Training vs Validation (Random Uniform)| Image by Author

Training vs Validation (He Uniform)| Image by Author

As these above plots show that the gap between the training and testing is narrowest when the model is initialized with the He Uniform method. The validation loss and training loss curve almost superimpose on one another. The gap is biggest when the model is initialized with the Random Uniform technique, ending up with a highly over-fitted model

Environment: Yes you read it correctly. The heading is the environment. An article published on MIT technical review says “Training a single AI model can emit as much carbon as five cars in their lifetimes”.

Training a single AI model can emit as much carbon as five cars in their lifetimes

The artificial-intelligence industry is often compared to the oil industry: once mined and refined, data, like oil, can…

www.technologyreview.com

As AI is shaping the future of humankind, on the other side while training our model we must reduce the carbon footprint generated by exploiting the computation power of computers to keep the world inhabitable. As model training is a cyclic process, we must look for techniques that converge fast. Looking at the training curves, we can see the model initialized with He methods are fast to converge and smooth training curves can enable the designer for an early stop.

Conclusion:

Well when the deep learning community is divided on which initialization method works best, what should we use Uniform or Normal, this experiment gives me confidence that HE initialization methods performed better on few aspects under the setting of this task given the model.

1) Very stable and smooth training progression

2) Fast to converge to a desirable validation accuracy

3) The constant increase of validation accuracy from very fast epoch

4) Multiple runs are similar, and less randomness in the training process

5) Clear convergence and very thin generalization gap(training vs testing) compared to others.