The world’s leading publication for data science, AI, and ML professionals.

Why Regularization Works

The intuition behind the Regularization

Photo by Franki Chamaki on Unsplash
Photo by Franki Chamaki on Unsplash

When we train a Machine Learning model or a Neural Network, we witness that sometimes our model performs exceptionally well on our training data but fails to give the desired output when it comes to testing or validation data. One of the many reasons for such a difference in the performance of a model is the large weights learned during the training of the model thus resulting in Overfitting. Large weights cause instability in our model and a little variation in the test data leads to high error. Apart from this the large weights also cause a problem in the gradient descent step of the training. To penalize these large weights, we regularize them to the smaller values.

But why does regularizing the weights to a lower value works??

Let’s explore the intuitive logic behind why the lower value of weight is actually necessary.

First Some Maths

The Gradient descent algorithm updates the weight w.r.t. the error made by the model.

Gradient Descent Function - Image by author
Gradient Descent Function – Image by author

The derivative part in the above equation represents the slope of the error function or change in error function with respect to the weight.

This derivative part is often calculated by using the chain rule. Suppose we are using a squared error function to calculate the error, and applying the sigmoid activation function on our linear output to obtain the final output.

Linear Function with H as input - Image by author
Linear Function with H as input – Image by author
Sigmoid Activation Function - Image by author
Sigmoid Activation Function – Image by author
Squared Error Function - Image by author
Squared Error Function – Image by author

By chain rule we will need to calculate:

  1. The slope of the error function w.r.t. the activation function output.
  2. The slope of activation function w.r.t. the linear output.
  3. Finally, the slope of the linear function w.r.t. the weight.
Chain Rule - Image by author
Chain Rule – Image by author

The value of the weights of the model plays a very crucial role in the weight update step of the gradient descent.

Let’s understand this by a simple example.


The Main Play

The sigmoid function is a very common activation function used to determine the output of the model.

The code given above plots a simple sigmoid function for a linear output of form y=mx+c for given values of x. ‘m’ as usual represents the slope of the line or weight of our model as we say.

Now we will see how different values of slope affects our sigmoid function.

plot_sigmoid(0.001)

Output 1:

Output 1 - Image by author
Output 1 – Image by author
plot_sigmoid(0.0005)

Output 2:

Output 2 - Image by author
Output 2 – Image by author
plot_sigmoid(0.0002)

Output 3:

Output 3 - Image by author
Output 3 – Image by author
plot_sigmoid(0.0001)

Output 4:

Output 4 - Image by author
Output 4 – Image by author

From the above 4 plots, we can observe that as we decrease the slope or weight in our linear function consequently the steepness in our sigmoid function also decreases.

For Output 1 we can see that the value of our sigmoid function changes rapidly from 0 to 1. Hence the slope at any point of this function mostly remains close to 0 or 1.

This causes inconvenience in the gradient descent step since it will cause almost no change in the weights thus slowing down our model’s learning and ultimately failing to reach the minimum tip of our error function or sometimes a too big change that may lead to missing the minima of the error function.

Now in the other 3 outputs, we subsequently decrease the value of slope and observe that the steepness in the output function decreases too. The value of sigmoid function starts changing gradually from 0 to 1 and omits the rapid change in slope, this change facilitates the gradient descent step. Now the weights are updated in a much more uniform manner thus achieving the minimum value of the error function.

It is evident that why large weights need to be penalized and brought back to lower value while training the model by using a desirable regularization technique.


Related Articles