
Overfitting is the bane of any data scientist.
There is nothing more frustrating than meticulously building a model that scores highly on training data only for it to deliver substandard performances when used on test data.
Users can avoid this outcome by using regularization, one of the most effective techniques for deterring overfitting.
Here, we delve into what regularization does and how it can be leveraged to train models to perform at a high level.
The overfitting problem
Prior to covering regularization, it’s worth discussing overfitting and the reason it is so undesirable in machine learning projects.
Overfitting is a term used to describe a model that excessively attunes to the training data, thereby rendering it unable to perform well on unseen data despite performing well on the training data.
This may sound odd at first. What does it mean for a model to "learn too much" from the training data?
The answer lies in the fact that these models learn by gauging performance with a loss function, which quantifies how much the predicted values of a model differ from the actual values in the training data. Naturally, the goal of the loss function is to minimize this difference.
However, minimizing the loss alone doesn’t necessarily correspond to better model performance.
After all, a model that only cares about the difference between the actual and predicted values for the training data will also consider unwanted elements like noise and, as a result, incorporate additional complexities that won’t apply to unseen data.
A model trained with this arrangement is unable to generalize and will fail to perform adequately against unseen data, which is the whole purpose of the model.
Regularization
So, how does regularization help avoid the loathsome overfitting problem?
Simply put, it adds a "penalty" to the loss function.
Regularization ensures that the loss function not only considers the difference between the predicted and actual values but also considers how much importance is being assigned to the features.
With this technique, users can limit model complexity and train models that can make accurate predictions with unseen data.
Of course, there are multiple ways to penalize a model for incorporating too much complexity.
The two main regularization techniques are Lasso regularization (also known as L1 regularization) and Ridge regularization (also known as L2 regularization).
Overall, the Lasso regularization penalizes models based on the magnitude of their features’ coefficients, while the Ridge regularization penalizes the models based on the squared value of their features’ coefficients.
Implementing Regularization
Now that we’ve explored the benefits of regularization, how do we utilize this technique in our machine learning models?
Well, many algorithms in Python’s machine learning packages already implement regularization by default. For instance, the linear classifiers in the Scikit-learn module use L2 regularization if the user doesn’t explicitly assign a regularization technique.
So, since regularization is already an embedded feature in machine learning algorithms, are users properly utilizing this technique?
The short answer is: probably not.
This is because the default values assigned to the model parameters are rarely optimal.
After all, different machine learning tasks vary in terms of the type and strength of regularization needed to ensure that the trained model performs well on unseen data.
This begs the question: how does one determine the best type of regularization to use when training a model with the data of interest?
Case Study
To demonstrate how users can determine and implement the most effective regularization approach, we can build a linear classifier using a toy dataset from the Scikit-learn module.
First, we prepare the data, creating train and test sets.
Most linear classifiers in the Scikit-learn module allow users to modify the regularization technique with the penalty
and C
parameters.
The penalty
parameter refers to the regularization technique incorporated by the algorithm.
The C
parameter defines the strength of regularization used by the classifier. The value assigned to this parameter is inversely proportional to the strength of the regularization. In other words, the larger the value of C, the weaker the regularization.
For this case study, we will build a model with the LinearSVC class.
By default, the model assigns the penalty
and C
parameters with the values of ‘l2’ and 1, respectively. Let’s see how well a baseline model performs against the test set with these settings based on the f1-score metric.

The baseline model yields an f-1 score of about 0.9012 against the testing set. The use of regularization played a significant role in enabling the model to perform well against the testing data.
That being said, the default values assigned to the model’s hyperparameters may not be optimal for the given dataset. It is possible that a different type or strength of regularization will result in greater performance.
Thus, it would be beneficial to fine-tune the model by considering other combinations of values for the penalty
and C
parameters and identifying the ones that yield the best performance. This can be achieved by carrying out a hyperparameter tuning procedure.
For this case, the Scikit-learn module offers the GridSearchCV class, which allows users to test different combinations of hyperparameters. One of the best features of the tool is its built-in cross-validation splitting strategy that further decreases the likelihood of overfitting.
Let’s use the GridSearchCV to determine the best hyperparameters for the model.

The optimized model still uses L2 regularization. However, it also assigns the C
parameter with the value of 0.0001. This means that the model performance improves when using regularization of a higher strength than that of the baseline model.
Let’s see how a model with these parameters performs against the testing set.

The optimized model yields an f-1 score of 0.9727 against the dataset, which is a considerable improvement over the baseline model.
Key Takeaways

Regularization helps combat overfitting by limiting model complexity.
Users can take full advantage of this technique by understanding the features of their models that allow them to select the best regularization approach.
If you’re in doubt about whether or not you are making full use of regularization in your machine learning task, feel free to explore the documentation of the model of interest and see what tools are provided with regard to optimizing regularization.
I wish you the best of luck in your Data Science endeavors!