In the last post, we have coded a deep dense neural network, but to have a better and more complete neural network, we would need it to be more robust and resistant to overfitting. The commonly applied method in a deep neural network, you might have heard, are regularization and dropout. In this article, we will together understand these 2 methods and implement them in python.
(we will directly use the function created in the last post in the following, if you get confused about some of the code, you might need to check the previous post)

Regularization
Regularization helps to prevent model from overfitting by adding an extra penalization term at the end of the loss function.

Where m
is the batch size. The shown regularization is called L2 regularization
, while L2
applies square to weights, L1 regularization
applies absolute value, which has the form of |W|.
The appended extra term would enlarge the loss when either there are too many weights or the weight becomes too large, and the adjustable factor λ put an emphasis on how much we want to penalize the weights.
1. Why penalizing weights would help to prevent overfitting?
An intuitive understanding would be that in the process of minimizing the new loss function, some of the weights would decrease close to zero so that the corresponding neurons would have very small effect to our results, as if we are training on a smaller neural network with fewer neurons.
Forward
In the forward process, we need only to change the loss function.
Backward
The backward propagation of L2 regularization
is actually straight forward, we only need to add the gradient of the L2 term.

Training
As usual, we test our model on a binary classification case and compare the model with regularization and without.
Model without Regularization

Model with Regularization

Actually when we have the iteration
goes up, the model would continue to overfit that causes error in the divide operation, suspecting that in the forward process, result A
gets too close to 0.
In contrast, the model with regularization would not overfit. For the complete implementation and training process please check my Github Repo.
Dropout
Dropout prevents overfitting by randomly shutting down some output units.
![[source]: https://github.com/enggen/Deep-Learning-Coursera](https://towardsdatascience.com/wp-content/uploads/2020/11/12uIGSALpIRNxE8rU5FUtng.png)
In the process above, in each iteration, some units on layer [2] would be randomly muted, meaning that there would be less neurons working in the forward process, thus the overall structure of neural network is simplified.
Meanwhile, the trained model would be more robust, since the model no longer can rely on any specific neurons anymore (as they could be muted in the process), all other neurons would need to learn in the training.
Forward
You can think of dropout as adding an extra layer to the forward process.
In the previous sessions, we have the forward equations as following,
Without Dropout

Where g
is the activation function. Now with dropout an extra layer is applied to A^[l].
Dropout

Where D
is the dropout layer. The key factor in the dropout layer is keep_prob
parameter, which specifies the probability of keeping each unit. Say if keep_prob = 0.8
, we would have 80% chance of keeping each output unit as it is, and 20% chance set them to 0.
The implementation would be adding an extra mask to the result A
. Assume we have an output A^{[l]} with four elements as following,

And we want to mute the third unit while keeping the rest, what we need is a matrix of the same shape and do an element-wise multiplication as following,

Forward
Some of the modules below are pre-imported, to check the complete code, please go to my Github Repo.
Here we have D
initialized as the same shape as A's
, and convert it to 0 and 1 matrix based on the keep_prob
.
Note that after dropout, result A
needs to rescale! Because some of the neurons are muted in the process, correspondingly the left neurons need to be augmented in order to match the expected value.
Backward
The backward process is to mask the same function D
to the corresponding dA
.
The backward propagation equations remain the same as we’ve introduced in deep dense net implementation. The only difference lies in the matrix D
. Except the last layer, all other layers with dropout would apply the corresponding masking D
to dA
.
Note that in back propagation, dA
also needs to be rescaled.
The training and evaluating part with dropout, if you are interested, please check my Github link above.
Conclusion
Both regularization and dropout are widely adopted methods to prevent overfitting, regularization achieves that by adding an extra punishing term at the end of the loss function and dropout by randomly mute some neurons in the forward process in order to make the network more concise.