The world’s leading publication for data science, AI, and ML professionals.

The Importance and Reasoning behind Initialisation

A critical component of neural networks illuminated

Neural networks work by learning what parameters best fit a dataset to predict an output. In order to learn the best parameters, the ML engineer must initialise and then optimise them using the training data. You may be considering a few options for initialisation, though the wrong choice can lead to a slow and even useless network!

Why we can’t initialise the weights to zero

This is a tempting option, though it does not work. If you initialise all weights to zero, or any identical number for that matter, the nodes of the network all behave in the same way. An example is explained with the diagram below.

Here all the weights and biases have been initialised to identical w and b respectively. In this case we assume that the activation functions in any single layer are all the same. Whatever path an example datapoint takes to get to the output, it will produce the same output. It will therefore produce an identical cost. During backpropagation, the parameters in the same layer will then undergo identical gradient descent and reach the same optimal value after training. So the nodes in each layer do not offer anything different to one another! When a new datapoint is passed through the network, it will produce the same output no matter what path it takes! We need to "break the symmetry" of the nodes…

Why we can’t initialise the weights completely randomly

Though this would "break the symmetry", it gives rise to the problem of vanishing and exploding gradients. The average of the output from each node should be zero. We say we want the activation functions to be zero-centred. This is explained in my article introducing activation functions, linked below.

The Importance and Reasoning behind Activation Functions

Consider the following many-layer network.

A datapoint X has to go through several nodes before producing an output y. If the weights that it is multiplied by are all large (or all small), this will result in very large (or very small) outputs by the final layer. The gradient of the cost will therefore explode (or vanish). In the case of exploding gradient the cost will never reach its minimum value as it is will keep missing it and oscillate around it. This is because each gradient descent step is too large.

In the case of vanishing gradient the cost will never reach its minimum value because of how small the steps it takes to get closer to it are.

What are the most common initialisation techniques, and why

To keep the gradient from getting too large or small and slowing down the gradient descent, we need to ensure two conditions are met. The average of any layer’s output is 0, and the variance is relatively constant throughout the layers. The solution is to randomise the weights, but multiply them by a value dependant on the number of nodes in the layer being initialised. The value is different for different activation functions.

He Initialisation

For nodes with a ReLU or Leaky ReLU activation function, we multiply the random number by a function of the number of nodes in the preceding layer. The formula is given below.

Xavier Initialisation

For nodes with a Tanh activation function, we use Xavier initialisation. This is very similar to He Initialisation, but the value we multiply our random numbers by has a 1 instead of a 2 in the numerator. The formula is given below.

Summary

This is a short introduction to the methodology of initialisation. Both of these initialisation techniques keep the variance across layers the same, and if we use an effective activation function we can ensure that the output of each node has an average of zero, thus keeping gradient descent times reasonable.


Written By

Topics:

Related Articles