When I started my Data Science path, I encountered a lot of difficulties while dealing with my first project. There was a sequence of steps to solve the problem, but I still didn’t have a clear overview and I needed a lot of patience and time to understand well the key concepts. One of the first things I have learned is that I had to divide the dataset into two separate sets, called training set and test set, before applying any model. At first, I couldn’t understand the real reason for this important procedure, but after some experience in different solving problems, I began to comprehend the sense of this step. When we build a model, we should verify this requirement:
A model must be able to generalize well on unseen data.
This means that the model shouldn’t only make good predictions on the training set, but also on the test set. There are two possible situations we can meet:
- Underfitting: if the model doesn’t make good predictions on the training set and, consequently, obtains low training errors, we have an underfitting problem. For example, we fit the data with a linear regression model, but this model is too simple and is unable to capture the complex patterns in the data since there are underlying assumptions that are violated. So, we met this issue if the model is too simple and doesn’t learn well the patterns from the training data. It can also happen if the training set is too small.
- Overfitting: it’s the opposite of Underfitting in the sense that the model is too complex and capture even the noise in the data. In this case, we would observe a very high gap between the training and test evaluation measures.
The goal is to reach an optimal model, which is a result of a balance between underfitting and overfitting. In this post, I am going to focus on the overfitting problem, which can be handled using several regularization techniques. These techniques have a relevant role since they limit the complexity of the model. Below I show the most commonly used techniques that help in avoiding (or at least reducing) overfitting.
Regularization Techniques:
- Early Stopping
- L1 Regularization
- L2 Regularization
- Sparse Coding
- Dropout layer
- BatchNorm layer
1. Early Stopping

Early stopping is a form of regularization to avoid overfitting when training a neural network. At the first epochs, both training and test errors decrease. But at some point, the training loss will keep decreasing, while the test loss begins to increase. In this specific point, we need to use Early Stopping to avoid this behaviour, which leads to the high gap between the training and test evaluations. In other words, it stops the training of the model when the test error starts to increase.
2. L1 Regularization
L1 Regularization, L2 regularization and Sparse Coding belong to a regularization category, called Model Regularization. In all these strategies, a term is added to the loss function to impose a penalty on a large network’s weights to reduce overfitting. Depending on the technique, some of the weight parameters may be estimated to be equal to zero.
In L1 regularization, we add a scaled version of the L1 norm of the weight parameters to the loss function:

The L1 norm is simply the sum of the absolute values of the parameters, while lambda is the regularization parameter, which represents how much we want to penalize the weight parameters. Its range is between 0 and 1.
3. L2 Regularization
L2 Regularization, also called Weight Decay, works in a similar way as the L1 Regularization. Instead of adding the L1 norm term, it adds the squared norm of the weights to the loss function:

The L2 norm is the square root of the squared values of the weight parameters.
4. Sparse Coding
Sparse Coding allows limiting the overall number of units that can be activated at the time.

At first sight, it can appear similar to the L1 Regularization, since the idea is to calculate the sum of absolute values. Differently from the L1 strategy, we are interested in the number of units that can be activated, not the weight parameters!
5. Dropout Layer

The idea of Dropout is to remove input and hidden units during the processing of each pattern. Knowing that every node in a layer is fully connected to the layer above, we randomly remove some of these connections during the training of the neural network.
It’s worth noticing that it might interfere if it’s employed with other techniques, such as batch normalization. For this reason, it’s better to use one of these two strategies.
6. Batch Normalization
The goal of Batch Normalization is to prevent batches from obtaining different means and different standard deviations [1]. The trick consists in normalizing each activation value using the batch mean and the batch standard deviation during training. Once the training is done, the running statistics, that were computed during the training process, are used for testing.

As you can notice from Algorithm 1, the main steps to normalize the activation of a layer are:
- Calculate the batch mean and the batch variance of each activation xᵢ, where m is the size of the batch
- Normalize each activation xᵢ by making it have a mean of 0 and a variance of 1. For numerical stability, ε is added to the batch variance.
- Scale and shift each normalized activation using respectively the parameters γ and β, that are learned during training.
- Pass each scaled and shift activation yᵢ to the next layer
There are two main advantages when using this regularization method. First, the cost function will look smoother and more balanced. Consequently, the training of the neural network will be faster.
Final thoughts:
In this post, I provided an overview of the most popular methods that prevent overfitting. I hope you found this article useful to have a more complete understanding of regularization’s role in Machine Learning and Deep Learning models. Personally, I found them very useful in my Data Science projects. Thanks for reading. Have a nice day!
More related articles:
K-Fold Cross Validation for Machine Learning Models
The Basics of Optimization Algorithms explained in simple words
References:
[1] Generative Adversarial Networks, Coursera, DeepLearning.AI
Did you like my article? Become a member and get unlimited access to new data science posts every day! It’s an indirect way of supporting me without any extra cost to you. If you are already a member, subscribe to get emails whenever I publish new data science and python guides!