The world’s leading publication for data science, AI, and ML professionals.

Breaking Symmetry in Deep Learning

Initializing weights to zero matrices in an L-Layered Deep Learning Model may lead to a decrease in the cost but no change in weights…

Initializing weights to zero matrices in an L-Layered Deep Learning Model may lead to a decrease in the cost but no change in weights. This article talks about a beautiful but vital term "Breaking Symmetry" in Deep Learning.

Photo from Unsplash by Ellicia@ellicia_
Photo from Unsplash by Ellicia@ellicia_

The Deep learning algorithm has fascinated all of us by its accuracy. Every field now and then uses deep learning models whether it is self-drive cars, artificial intelligence, speech to text conversion, image recognition, sentimental analysis, and much more. Deep Learning has improved the state-of-art for many problems. But these are very complex technologies and are still in the phase of development.

As there is always a starting point, trends show that probably all Deep Learning instructors start with "Logistic Regression as Neural Network", which constitutes fundamentals of Deep Learning, its a single-layered deep learning model uses cross-entropy as a Cost Function and gradient descent algorithm to updates the weight. Now, indirectly I have mentioned all the prerequisites for this article.

There is an assumption that is as small as an iron nail sunk in the sand and is ignored usually, i.e. we initialize weights to zero. This assumption holds back good when we are dealing with single-layer neural network viz. which do not contains any hidden layer.

Let us consider the famous Titanic dataset problem, where we need to predict who has survived given certain features about them. We will not use the raw data set but the cleaned dataset which is ready to be get trained.

CASE 0: Single Layer Deep Learning Model

So, let us predict who has survived at titanic using a single-layered Deep Learning model with activation functions as sigmoid function. This is a logistic regression but with a backpropagation step using the gradient descent algorithm and a learning rate of 0.001.

################# WEIGHTS ARE INITIALIZED TO ZERO ##################
W = np.zeros((1,X.shape[0]))
b = np.zeros((1,1))
################# WEIGHTS ARE INITIALIZED RANDOMLY #################
W = np.random.randn(1,X.shape[0])
b = np.zeros((1,1))
################### ONCE WEIGHTS ARE INITIALIZED ###################
## dict for storing costs
ini_single = {}
irts = 1000
while irts > 0 :
    irts = irts -1
    ## Forward Propagation
    Z = np.dot(W,X) + b
    A = sigmoid(Z)
## Cost estimation
    logerror = -(np.multiply(Y, np.log(A)) +  np.multiply(1-Y, np.log(1 - A)))
    cost = logerror.sum()/m
    ini_single[1000-irts] = cost
    print('The cost of the function is: ' + str(cost))

    ## Backward Propagation
    dZ = A-Y
    dw = np.dot(dZ,X.T)/m
    db = np.sum(dZ)/m
## Updating Weights    
    W = W - 0.001*dw
    b = b - 0.001*db

Once we have executed the code, we can generate the Cost vs the number of iterations for a single-layered deep learning model for both the techniques of weight initialization. So, we can say after several iterations both techniques of weight initialization tend to the same cost but weights initialized to non zero have a lower cost initially.

Single Layer Deep Learning Model for the titanic Data set (Published by Author).
Single Layer Deep Learning Model for the titanic Data set (Published by Author).

The results for a single layer are satisfactory because both techniques are converging to the same cost after some iterations. Now, let us introduce a hidden layer in the model, keeping the weights and biases initialized to zero.

CASE 1: Two Layer Deep Learning Model, Weights initialized to zero matrix

## dict for storing costs
ini_zero = {}
## initializing weights of a 2 layer Deep Learning model as zero matrix.
W1 = np.zeros((4,X.shape[0]))
b1 = np.zeros((4,1))
W2 = np.zeros((Y.shape[0],4))
b2 = np.zeros((Y.shape[0],1))
irts = 10
while irts > 0 :
    print('================ iteration number: ' + str(n-irts) + '================')
    irts = irts -1
    ## Forward Propagation
    Z1 = np.dot(W1,X) + b1
    A1 = np.tanh(Z1)
    Z2 = np.dot(W2,A1) + b2
    A2 = sigmoid(Z2)
## Cost estimation
    logerror = -(np.multiply(Y, np.log(A2)) +  np.multiply(1-Y, np.log(1 - A2)))
    cost = logerror.sum()/m
    ini_zero[n-irts] = cost
    print('The cost of the function is: ' + str(cost)
## Backward Propagation
    dz2 = A2 - Y
    dw2 = np.dot(dz2, A1.T)/m
    db2 = np.sum(dz2, axis=1, keepdims = True)/m
    derivative = 1 - np.tanh(Z1) * np.tanh(Z1)
    dz1 = np.multiply(np.dot(W2.T, dz2) , derivative)
    dw1 = np.dot(dz1,X.T)/m
    db1 = np.sum(dz1, axis=1, keepdims = True)/m
## Updating Weights    
    W1 = W1 - 0.01*dw1
    b1 = b1 - 0.01*db1
    W2 = W2 - 0.01*dw2
    b2 = b2 - 0.01*db2

As stated this is a two-layered deep learning model where the activation function at the hidden layer is tanh and at the output layer is the sigmoid function with learning rate equals 0.01.

If we examine carefully then the cost is decreasing but the values of weights W1 and W2 remain constant. No change in W’s is called Symmetry in Deep Learning. Initializing weights to zero can create a mess for the model that is why there is a need to initialize weights randomly hence Breaking Symmetry.

Value of cost, W1, and W2 for 10 iterations when weights are initialized to zero (Published by Author).
Value of cost, W1, and W2 for 10 iterations when weights are initialized to zero (Published by Author).

But why then the cost is changing? Keen observation is that W’s remain zero but still the cost is decreasing. The same question tickled me and now I have an answer. The reason behind changing cost is ‘b2’, the bias in the output layer. For the first iteration of Gradient Descent, all the predictions are going to be the same but we find:

dZ2 = A2-Y where dZ2 is the partial derivative of cost function with respect to Z2, A2 is the output of output layer and Y is the actual value of the target variables.

If we find dZ2, it is obvious that it will have a non-zero value for Imbalanced data such as titanic. Hence the only parameter changing during the iterations of gradient descent is b2 and it is the reason the cost is decreasing. This also shows how powerful is the gradient descent algorithm.

Now, what if we initialize weights in a non-zero manner then let us see what happens.

CASE 2: Two Layer Deep Learning Model, Weights initialized non-zero

## dict for storing costs
ini_nonzero = {}
## initializing weights of a 2 layer Deep Learning model as randmon value matrix.
W1 = np.random.randn(4,X.shape[0])*0.01
b1 = np.zeros((4,1))
W2 = np.random.randn(Y.shape[0],4) *0.01
b2 = np.zeros((Y.shape[0],1))
irts = n
while irts > 0 :
    irts = irts -1
    ## Forward Propagation
    Z1 = np.dot(W1,X) + b1
    A1 = np.tanh(Z1)
    Z2 = np.dot(W2,A1) + b2
    A2 = sigmoid(Z2)
## Cost estimation
    logerror = -(np.multiply(Y, np.log(A2)) +  np.multiply(1-Y, np.log(1 - A2)))
    cost = logerror.sum()/m
    ini_nonzero[n-irts] = cost
    print('The cost of the function is: ' + str(cost))
## Backward Propagation
    dz2 = A2 - Y
    dw2 = np.dot(dz2, A1.T)/m
    db2 = np.sum(dz2, axis=1, keepdims = True)/m
    derivative = 1 - np.tanh(Z1) * np.tanh(Z1)
    dz1 = np.multiply(np.dot(W2.T, dz2) , derivative)
    dw1 = np.dot(dz1,X.T)/m
    db1 = np.sum(dz1, axis=1, keepdims = True)/m
## Updating Weights    
    W1 = W1 - 0.01*dw1
    b1 = b1 - 0.01*db1
    W2 = W2 - 0.01*dw2
    b2 = b2 - 0.01*db2

Here, we have initialized weights randomly by a normal distribution and have changed the weights but not the biases, and also the learning rate is the same, but the difference is HUGE.

Value of cost, W1, and W2 for 10 iterations when weights are initialized non-zero (Published by Author).
Value of cost, W1, and W2 for 10 iterations when weights are initialized non-zero (Published by Author).

This is a clear distinction that how important is weight initialization. But a keen observer will notice that the biases are still initialized to zero. As we have seen that there is a direct impact of gradient descent on biases, hence they may or may not be initialized to zero. This will not make a huge difference like one of the weights initialized to zero.

Another observation is that we have multiplied the weights by a factor of 0.01 (coincidently equal to learning rate), to ensure that the weights are small. If the weights are large enough and we are using activation functions like sigmoid or tanh, then the initial slope (where the input value in the activation function is large) will we very small. Hence, the performance will be good but will take time to reach the optimum results.

Let us run both the codes (in case1 and case2) for 3000 iterations.

Cost vs Number of Iteration for 3000 Iterations (Published by Author).
Cost vs Number of Iteration for 3000 Iterations (Published by Author).

This shows that case1 and case2 will converge after some iterations but mostly at that number of iteration, overfitting of data will take place. So, it is better to opt for case2, where we can obtain results in lesser time, and the chances of overfitting are also minimized.

Summing Up

It is admiring to examine small steps because they can create a large difference.

Just a small step of initializing weights can create such a difference but it is important to note that biases are still initialized to zero. By admiring this we can save time and huge data problems like overfitting. The complete code file can be found here.


Related Articles