Random Initialization For Neural Networks : A Thing Of The Past

Aditya Ananthram
Towards Data Science
4 min readFeb 25, 2018

--

Photo by NASA on Unsplash

Lately, neural nets have been the go to solution for almost all our machine learning related problems. Simply because of the ability of neural nets to synthesize complex non-linearities which can magically give previously impossible accuracy, almost all the time.

In the industry, neural nets are seen as black boxes. This is because they use the features of the given data to formulate more and more complex features as we move from one dense layer to the next. Researchers have tried studying these complex feature generation processes but till date have not had much progress. And neural networks continue to be the black boxes that they have been.

This bit is what makes Deep learning, Deep.

courtesy: Machine Learning Memes for Convolutional Teens,facebook

Some researchers are also against using neural nets in heavily important fields like autonomous cars and drones. They say that the decisions taken by a deep neural network can not be justified as opposed to the decision making framework of say support vector machines or random forests. If anything goes wrong tomorrow, say if an autonomous car jumps off a cliff on the way to the grocery store, If it were support vector machines that were controlling the car’s actions, the reason behind what went wrong can easily be rectified and corrected, on the other hand due to the heavily complex framework of neural networks, no one can actually predict why the car jumped off the cliff and why it took that decision.

But all said, no other method today can learn data, with as much accuracy as Neural Networks. Neural networks are the reason why image recognition is what it is, today. Nowadays complex convolutional nets are being made which are becoming more and more accurate at identifying objects and even competing with humans at the task.

In neural networks, there exists weights between every two layers. The liner transformation of these weights and the values in the previous layers passes through a non linear activation function to produce the values of the next layer. This process happens layer to layer during forward propagation and by back propagation, the optimum values of these weights can be found out so as to produce accurate outputs given an input.

Until now, machine learning engineers have been using randomly initialized weights as the starting point for this process. Till now(ie:2015), it was not known that the initial values of these weights played such an important role in finding the global minimum of a deep neural network cost function

I’m currently doing the deep learning specialization on coursera by Andrew Ng, and the second course of the specialization deals with hyperparameter tuning of these deep neural networks.

Lets look at three ways to initialize the weights between the layers before we start the forward, backward propagation to find the optimum weights.

1: zero initialization

2: random initialization

3: he-et-al initialization

Zero Initialization

Zero initialization serves no purpose. The neural net does not perform symmetry-breaking.If we set all the weights to be zero, then all the the neurons of all the layers performs the same calculation, giving the same output and there by making the whole deep net useless. If the weights are zero, complexity of the whole deep net would be the same as that of a single neuron and the predictions would be nothing better than random.

w=np.zeros((layer_size[l],layer_size[l-1]))

courtesy: Machine Learning Memes for Convolutional Teens,facebook

Random Initialization

This serves the process of symmetry-breaking and gives much better accuracy. In this method, the weights are initialized very close to zero, but randomly. This helps in breaking symmetry and every neuron is no longer performing the same computation.

w=np.random.randn(layer_size[l],layer_size[l-1])*0.01

courtesy: Machine Learning Memes for Convolutional Teens,facebook

He-et-al Initialization

This method of initializing became famous through a paper submitted in 2015 by He et al, and is similar to Xavier initialization, with the factor multiplied by two. In this method, the weights are initialized keeping in mind the size of the previous layer which helps in attaining a global minimum of the cost function faster and more efficiently.The weights are still random but differ in range depending on the size of the previous layer of neurons. This provides a controlled initialisation hence the faster and more efficient gradient descent.

w=np.random.randn(layer_size[l],layer_size[l-1])*np.sqrt(2/layer_size[l-1])

Read this article for more info on the topic.

By trying all the three initialization techniques on the same data, I observed the following:

zero initialization:

cost after 15000 iterations:0.7

accuracy: 0.5

random initialization:

cost after 15000 iterations:0.38

accuracy: 0.83

he et al initialization:

cost after 15000 iterations:0.07

accuracy: 0.96

This itself shows how the initialization of weights affects the performance of neural networks.

Check out this rad page on facebook.

As always, Happy Learning.

--

--