All the ways to initialize your neural network

In this article, I evaluate the many ways of weight initialization and current best practices.

Published in

Towards Data Science

8 min readOct 30, 2020

Zero Initialization

Initializing weights to zero DOES NOT WORK. Then Why have I mentioned it here? To understand the need for weight initialization, we need to understand why initializing weights to zero WON’T work.

Fig 1. Simple Network. Image by the Author.

Let us consider a simple network like the one shown above. Each input is just one scaler X₁, X₂, X₃. And the weights of each neuron are W₁ and W₂. Each weight update is as below:

Out₁ = X₁*W₁ + X₂*W₁ + X₃*W₁
Out₂ = X₁*W₂ + X₂*W₂ + X₃*W₂

As you can see by now, if the weight matrix W = [W₁ W₂] is initialized to zero, then both out1 and out2 are exactly the same.

Even if we add a non zero random bias term to both, The weights are updated to non-zero, However, they will remain identical, hence both neurons of the hidden unit are computing the same thing. In other words, they are Symmetrical.

This is highly undesirable as this is wasted computation. This is why zero initialization does not work.

Random Initialization

Now that we know the weights have to be different, the next idea was to initialize these weights randomly. Random initialization is much better than zero initialization, but can these random numbers be ANY number?

Let’s assume you are using sigmoid non-linearity. The function is drawn below.

We can see that even for values as big as 6 the value of sigmoid is almost 1, and for values as small as -6 the value of sigmoid is zero. This means if our weight matrix is initialized to values that are either too big or too small, all useful information is lost in the sigmoid function.

This is not as important if we use a ReLu non-linearity, but there are other problems with weights being initialized to large or small values. There are better ways to initialize our weights.

Xavier Initialization

Xavier Initialization was proposed by Xavier Glorot and Yoshua Bengio in 2010. The main objective of this paper was to initialize weights such that the activations have a mean of zero and a standard deviation of 1. Consider a function computed as shown below.

Z = WX + b

Here W is the weight matrix, X is the input from the previous layer and b is the bias. Z is the output computed by a layer also called activations. We want Z to have a mean of 0 and a standard deviation of 1. (Technically Z is the result after a non-linear activation like ReLu)

Why is a mean of zero and a standard deviation of 1 important?
Consider a deep neural network with 100 layers. At each step, the weight matrix is multiplied with the activations from the previous layer. If the activations of each layer are greater than one, When they get multiplied repeatedly 100 times, they will keep getting larger and explode to infinity. Similarly, if the activations are less than one, They will vanish to zero. This is called the exploding and vanishing gradient problem.
We can see this in the image below. Values even a little greater than 1 explode to very large numbers and values even a little lesser than 1 vanish to zero.

To avoid exploding and vanishing of gradients and activations, we want activations on average having mean 0 and standard deviation 1. We can achieve this with careful selection of our weights.

During the time this paper was released, the best practice for weights was to select them randomly from a uniform distribution of [-1,1] and divide them by the square root of input dimensions. As it turns out, This is not such a good idea and gradients vanish and training is very slow if at all possible.

This was fixed with the Xavier initialization who proposed we initialize weights randomly from a uniform distribution as shown below.

Xavier Init Uniform Distribution. Image by the author.

Nowadays Xavier Initialization is done by choosing weights from a standard normal distribution and each element is divided by the square root of the size of input dimensions. In PyTorch, the code is as below.

torch.randn(n_inp, n_out)*math.sqrt(1/n_inp)

Xavier Initialization works fairly well for symmetric nonlinearities like sigmoid and Tanh. However, it does not work as well for ReLu which is now the most popular nonlinearity.

Kaiming Initialization

He et al. wrote a paper called Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification in 2015, in which they introduced what is now popularly known as Kaiming Init.

But why do we need Kaiming Init? What was the issue with Xavier Init for ReLu nonlinearities?

As you can see from the above graph, ReLu gives 0 for all X<0 and Y=X for all X>0. ReLu is not well defined at 0, but most modern programs assign it a close to zero approximation like machine epsilon.

Left: A normal distribution with mean 0 and standard deviation 1. Right: The normal distribution after being passed through ReLu. Image by the author.

Above we can see 2 scatter plots, on the left is data before ReLu and on the right is after ReLu. It is clear from the images that the variance is almost half and the mean is slightly higher, after ReLu. This changes the activations and the variance is halved, so we need to double the variance to get the original effect of Xavier Init. Hence we multiply the weights by an additional
√2. So in PyTorch, Kaiming Init is as shown below.

torch.randn(n_inp, n_out)*math.sqrt(2/n_inp)

In case you are still confused:
variance = (Standard deviation)²
Hence if you want to double variance, you multiply data by √2

Fixup Initialization

Fixup was an initialization proposed by Zhang et al. in 2019. Their observation was that Kaiming init and other standard initialization did not work well for networks with residual branches (AKA Residual networks). They found that residual networks with standard initialization worked well only with BatchNorm.

Let us see why Kaiming Init does not work on Residual networks. Consider the skip connection shown in Image below. X2 = f(X1) and X3 = f(X2) + X1. We know that Kaiming init chooses weights such that the activations after each layer have 0 mean and 1 variance. So we know X1 has variance 1 and X2 has variance 1. But Kaiming init does not account for skip connections. Hence by the law of total variance, the variance of X3 is doubled. This extra variance added by a residual branch is not accounted for in Kaiming Init. Hence Residual networks do not work well with standard init unless they have BatchNorm. Without BatchNorm, the output variance explodes exponentially with depth.

Var[Xₗ₊₁] ≈ 2Var[Xₗ ]

Skip connection in a residual network. Image by the author.

The authors in the paper made an important observation that SGD updates to weights of each residual branch updated the network output in a highly correlated direction. This means if residual branch weights are all updated by X, the network output also changed proportionally to X in the same direction of weight updates.

The authors defined the change in network output desired, to be Θ(η). As we know the residual branch updates each on average contribute equally to output update, And if we call the number of residual branches L, then on average each residual branch should change the output by Θ(η/L) to achieve a total change of Θ(η) on output.

Next, the authors show how they can initialize the residual branch of m layers so that it’s SGD update changes the output by Θ(η/L). The authors show that this can be done by rescaling the standard init of these weight layers by:

Weight Scale factor. Image by the author.

The authors also discuss the utility of biases and multipliers. They found adding a bias layer initialized at 0 before every convolution, linear layer and element-wise activation lead to significant improvement in training. They found that adding one multiplicative scaler per residual branch helped mimic the weight norm dynamics of a network with normalization.

To summarize fixup:

Fixup summary. Image is taken from the fixup paper Zhang et al.

Fixup is a little confusing, so if you have any questions feel free to ask in the comments and I will be happy to answer to the best of my ability.

LSUV Initialization

LSUV was introduced in a 2016 paper by Mishkin et al. called All you need is a good Init. LSUV Init is a data-driven approach that has minimal calculations and very low computational overhead. The initialization is a 2 part process, first initializing weights to orthonormal matrices ( as opposed to Gaussian noise, which is only approximately orthogonal). The next part is to iterate with a mini-batch and scale the weights so that the variance of activations is 1. The authors assert that the influence of mini-batch size on variance is negligible in wide margins.

In this paper, the authors lay out the steps for initialization as follows.

Initialize weights to Gaussian noise with unit variance.
Decompose them to orthonormal basis with either SVD or QR.
Iterate through the network with first mini-batch and on each iteration scale the weights to make the output variance closer to 1. Repeat till output variance is 1 or maximum iterations have taken place.

In the paper, the authors propose the scaling factor to be √Var(BL), Where
BL — its output blob
The authors also propose a value for maximum iterations to prevent infinite loops, However, in their experiments, they found that unit variance was achieved in 1–5 iterations.

LSUV Init can be thought of as a combination of orthonormal initialization and, BatchNorm performed only on the first mini-batch. The authors, in their experiments, show that this method is highly computationally efficient compared to full BatchNorm.

LSUV Algorithm. Image is taken from LSUV paper.

Transfer Learning

Transfer learning is the method of using weights from an already trained model that was trained for a similar task, in our new model. These weights have already learned a lot of useful information, we can simply fine-tune it for our specific goal and VOILA! We have an amazing model without the hassle of Initialization.

Using pre-trained weights from another model is the best approach every time. The only times we would have to initialize our own weights is if we were working on a network that nobody has ever trained before. And in most practical scenarios, this is hardly the case.