Why cautiously initializing deep neural networks matters?

Mathematics behind Kaiming initialization.

Poonam Ligade

Published in

Towards Data Science

13 min readApr 18, 2019

Careful weight initialization expedites the convergence of neural nets. (photo by @kimzy on unsplash)

Introduction

A significant part of recent success in deep learning goes to ReLU activation function. It has achieved the state of art results in deep CNNs for Image classification problems. In this blog, we’ll discuss a robust weight initialization method, which helps in faster convergence of deeper neural models. Kaiming He et al. proposes this method in the Delving Deep into Rectifiers paper(2015).

This blog takes inspiration from Fast.ai’s course Deep Learning for coders, Part 2, taught by Jeremy Howard at USF.

Parameters in neural networks

Parameters of neural networks include weights and biases. These numbers are randomly initialized first. Then our model learns them, which means we use gradients in the backward pass to update them gradually.

The most widespread way to initialize parameters is by using Gaussian Distribution. This distribution has 0 mean and a standard deviation of 1.

If m is the input size and nh is number of hidden units, then weights can be initialized as,

Why accurate initialization matters?

Deep neural networks are hard to train. Initializing parameters randomly, too small or too large can be problematic while backpropagating the gradients all the way till initial layers.

What happens when we initialize weights too small(<1)? Their gradient tends to get smaller as we move backward through the hidden layers, which means that neurons in the earlier layers learn much more slowly than neurons in later layers. This causes minor weight updates. This phenomenon is called vanishing gradient problem, where weights vanish to 0.

What if we initialize weights too large(>1)? The gradient gets much larger in the earlier layers, which causes extremely high weight updates overshooting the minimal value. This is the exploding gradient problem, where weights explode to infinity(NaN). Both of these cases makes neural network difficult to converge.

Below are the images from the experiment conducted by Gloriot et al. in the paper Understanding the difficulty of training deep feedforward neural networks. Authors considered neural networks with 5 layers deep with standard initialization. They initialized random weights from a normal distribution(0 mean and 1 variance).

0 peak increases for higher layers (layer 4 and 5).

The above graph depicts, as training progresses(forward pass from layer 1 to 5) the mean of all activation values is becoming smaller(vanishing to 0) in the last layers. In layer 5 they are almost 0.

We calculate gradients backward, from layer 5 to layer 1. For layer 1 all the gradients have nearly vanished.
A bad initialization can really hinder the learning of the highly non-linear system. Due to random initialization, the first layer throws away most information about the input image. So even if we train later layers extensively they don’t have enough information to learn from the input image.

Careful weight initialization prevents both of these from happening and results in faster convergence of deep neural networks.

Preprocessing of Input data

A neural network works best when input data is centered (have a mean of 0 and std of 1). So when input values get multiplied by weight values, their activation remains on scale 1. What does it do? It helps in graceful optimization of neural networks. Since the hidden activation functions don’t saturate that fast. And thus does not give near zero gradients early on in learning.

ReLU — Rectifier Linear Units

ReLU is a non-linear activation function. Definition of ReLU is,

Advantages of using ReLU activation -

ReLU solves exploding and vanishing gradients issue since it outputs a constant gradient of 1 for all the inputs > 0

2. This makes the neural net learn faster. And also speeds up the convergence of the training process.

3. ReLU gives better solutions sooner than conventional sigmoid units.

Kaiming et al.(1) derived a sound initialization method by cautiously modeling non-linearity of ReLUs, which makes extremely deep models (>30 layers) to converge.

Forward Propagation

The forward pass consists of activations coming out of sequential matrix multiplications (between layers’ inputs and weight) coupled with ReLU non-linearities. The output of this is passed on to consecutive layers carrying out similar operations.

For each convolutional layer l, the response is below equation (1),

linear equation for convolutional layer l

where

x_l = x_l is n by-l vector.( _l to denotes subscript l here onwards.)
x is an input image of k x k(length, width) and c input channels.
We assume the image is a square shape( l=b).
n is the number of activations in the output. n= k²c
If f is an activation function, of the previous layer (l-1), we have

W_l = d by n weight matrix, where d is the number of filters. n is x’s length which is n=k²c. Channels of current layer l are the same as filters of the previous layer (l-1).

b_l = a vector of biases (initialized to 0)
y_l = resulting vector after matrix multiplication of weights and inputs and addition of bias.

Assumptions

Elements in W_l and x_l are mutually independent and share the same distribution.
x_l and w_l are mutually independent of each other.

Why convolutional layer performing linear operations?

On a side note, if you are wondering how can a convolutional layer perform a linear equation like linear layer. Conv layer does convolutions. But if you follow this blog from Matthew Kleinsmith. You will understand convolutions are just matrix multiplications as shown below.

Convolutions are simply matrix multiplications.

Now back to our equation, if we take the variance of linear equation (1), we get equation (2)

where y_l, w_l, x_l are random variables of each of element in y_l, W_l, x_l resp. We assume w_l zero mean. Substituting that in Eqn. (2), we have the variance of the product of independent variables as Eqn. (3)

But how did we arrive at this? For random variables x_l and W_l, which are independent of each other. We can use properties of expectation to prove that

Since w_l have 0 mean, i.e E[w_l]=[E[w_l]]²=0
This means that in the above Eqn. (A), ★ evaluates to zero. We are then left with

Using the formula for variance

and the fact that E[w_l]=0 we can conclude that Var[w_l]=E[w²_l].
With this conclusion, we can replace E[w²_l] in Eqn. (B) with Var[w_l]to get,

Gotcha! By substituting Eqn. © into Eqn. (2) we obtain Eqn. (3)

Let’s focus on the term E[x²_l]. Here E() stands for the expectation of a given variable, which is its mean value. But below Eqn. doesn’t hold

unless x_l has zero mean. x_l cannot have 0 mean, because it is ReLU activation function of the previous layer (l-1).

If we further assume w_(l−1) have a symmetric distribution around zero and b_(l−1) = 0. Then y_(l−1) has zero mean and has a symmetric distribution around zero. So now we have below Eqn. (4) when f is ReLU.

Why does ReLU add scalar 1/2 to the output?

For the family of ReLUs, we have generic activation function as,

where,
1) y_i is the input of the nonlinear activation f on the ith channel,
2) a_i is the coefficient controlling the slope of the negative part.

ReLu is obtained when a_i=0. The resultant activation function is of the form f(y_i)=max(0,y_i).

ReLU activation is a threshold at zero which enables the network to have sparse representations. For eg, after uniform initialization of the weights, around 50% of hidden units continuous output values are real zeros.

Relu loses a lot of information(get replaced by zero values) this effects into aggressive data compression. To retain data you can use PRelu or LRelu which adds slope of a_i to the negative side of the axis represented by,

ReLu activation function retains only the positive half-axis values, so we have

Substituting this in the Eqn (3), we obtain Eqn. (5) as

Now we have an Eqn. with activations of layer l and those of layer (l -1). With L layers put together and starting from last layer L, we have below product as Eqn. (5)

Note: x_l is the input of the network, that’s why the above Eqn. starts at l = 2. This product is the key to the initialization design. A proper initialization method should avoid reducing or magnifying the magnitudes of input signals exponentially. So we expect our product to take a proper scalar(eg. 1).
A sufficient condition (1) for the forward pass is

This leads to a zero-mean Gaussian distribution whose standard deviation is

We also initialize bias, b = 0.
For the first layer (l = 1), we should have only n_1Var[w_1] = 1 since there is no ReLU applied on the input, we don’t need to halve the input. The small factor 1/2 does not matter if it just exists on one layer.

Backward Propagation

The backward pass is gradients calculated backward from the last layer to the first one.
For backpropagation, the gradient of convolutional layer l is given as Eqn. (6)

where,

Ŵ is a c_l-by- n̂_l matrix where the filters are rearranged in the way of back-propagation. Note that W_l and Ŵ can be reshaped from each other.
∆x is a c_l by-1 vector representing the gradient at a pixel of the layer l.
∆x = ∂E/∂x
∆y represents k-by-k colocated pixels(length x breadth) in d channels and is reshaped into a k²d-by-1 vector. ∆y = ∂E/∂y.
We denote the number of connections in the response by

no. of activations

During backprop, we are moving through the network in the reverse (backward) direction, so below equation doesn’t hold true.

Assumptions

w_l and ∆y_l are independent of each other and contains random numbers from a normal distribution.
∆x_l has zero mean for all l when w_l is initialized by a symmetric distribution around zero.

In backward propagation we further have,

where f′ is the derivative of activation function f.
For the ReLU case, f′(y_l) is zero or one, and their probabilities are equal. Pr(0)=1/2 and Pr(1)=1/2

we again assume that f′(y_l) and ∆x_(l+1) are independent of each other.
As we have seen in Eqn. (4) ReLU adds scalar 1/2 to its output. Thus, we have

Also, because Δy_l has a zero mean, and that f′(y_l)²=f′(y_l). By taking the expectation of the square of the equation

we get below Eqn. Here, the scalar 1/2 is the result of ReLU.

Then we compute the variance of the gradient in Eqn. (6):

With L layers put together, we have variance(∆x_L) as Eq. (7)

We have to define a sufficient condition (2) that makes sure the gradient is not exponentially large/small.

Note, the only difference between the condition in backpropagation and the forward pass is we have n̂_l instead of n_l.
k is image size and d are input channels.

and channels of current layer c_l are same as filters of previous layer d_(l-1).

Also, the condition satisfies zero-mean Gaussian Distribution whose standard deviation is,

For the first layer (l = 1), we need not compute ∆x1 because it represents the image. The factor of a single layer does not make the overall product exponentially large/small.

It is safe to use either condition (1) or (2) to make a neural model converge without having exploding/vanishing gradients.

Suppose we substitute condition (2) in Eqn. (7), we get

and then in Eqn. (5)

Instead of scalar 1, the constant to look for is c2/dL. i.e The number of channels at the beginning of the network and at the end of it. This is not a diminishing number to make neural networks face exploding or vanishing gradients problems. According to the authors, this equation properly scales forward and backward passes making neural net converge efficiently.

Xavier Initialization

Xavier and Bengio had earlier on proposed the “Xavier” initialization, a method whose derivation was based on the assumption that the activations are linear. This assumption is invalid for relu. The main difference between kaiming and Xavier initialization is that Kaiming addresses ReLU non-linearity. Xavier method uses below a condition which initializes weights from the standard distribution of

Here, i is nothing but our layer index l. The normalization factor which keeps activations and backpropagated gradients stable while moving up and down in the network according to Xavier’s normalized initialization is

We can use Xavier formula in PyTorch as

Conclusion

Here are the graphs that compare the convergence of Xavier and Kaiming initialization methods with ReLU nonlinearity. On 22 and 30(27 conv, 3 fc) layers deep neural networks.

no.of epochs vs error rate in 22 layers deep neural model with ReLU (source)

As we see, both the init methods helps 22 layer deep neural net converge, while using ReLU. But Kaiming’s starts decreasing error rate much earlier than Xavier’s.

no.of epochs vs error rate in 30 layers deep neural model with ReLU activation

As we can see, only Kaiming initialization method enables the deep model to converge. The Xavier’s one completely halts the learning, and the gradients are diminishing leading to no convergence at all.

Here is how to initialize weights using Kaiming init strategy

torch.nn uses below formula to demonstrate Kaiming initialization

There are 2 modes to initialize with -
1. fan_in (default) — preserves the magnitude of weights in the forward pass.
2. fan_out — preserves the magnitudes of the weights in the backward pass.