In-depth analysis of the regularized least-squares algorithm over the empirical risk minimization

Published in

Towards Data Science

11 min readOct 29, 2020

This article will introduce key concepts about Regularized Loss Minimization (RLM) and Empirical Risk Minimization (ERM), and it’ll walk you through the implementation of the least-squares algorithm using MATLAB. The models obtained using RLM and ERM will then be compared and discussed against each other.

We’ll use a polynomial curve-fitting problem to predict the best polynomial for this data. The least-squares algorithm will be implemented step-by-step using MATLAB.

By the end of this post, you’ll understand the least-squares algorithm and be aware of the advantages and downsides of RLM and ERM. Additionally, we’ll discuss some important concepts about overfitting and underfitting.

Dataset

We’ll use a simple one input dataset with N = 100 data points. This dataset was originally proposed by Dr. Ruth Urner on one of her assignments for a machine learning course. In the repository below, you’ll find two TXT files: dataset1_inputs.txt and dataset1_outputs.txt.

jaimedantas/least-squares-regression

The regularized least-squares algorithm is used in regression for finding machine learning models. This implementation…

github.com

These files contain the input and output vectors. Using MATLAB, we’ll plot these data points in a chart. On MATLAB, I imported them in Home > Import Data. Then, I created the flowing script for plotting the data points.

You should see a chart like the one shown below.

Least squares

The least-square method is used to solve the ERM problem in linear regression using the squared loss approach. I recommend this quick article before moving on to the implementation itself.

Empirical Risk Minimization (ERM)

ERM is a widely known concept in machine learning, and I recommend going over this explanation about ERM before proceeding to the actual implementation. ERM is used to classify the performance of learning algorithms, and we can solve the ERM optimization problem by finding a vector w that minimizes the formula below [1].

In the formula above, X is the design matrix and t is the vector of output. We want to find the best fit for our polynomial fitting problem. In order to do so, we will calculate the polynomials of order W = 1, 2, …, 30 and analyze the empirical square loss to see what polynomial order best fits our data. The design matrix X for our problem is given by:

We can find the polynomials w by solving the linear equation below [1].

The equation above always has a solution if we consider that X’X is invertible [1]. Again, if you want to know why this is the case, you can read this explanation about the least-squares method. The solution is given by the equation below.

This is where MATLAB comes in handy. We need to solve the linear equation above to find and test our models.

In order to solve this equation on MATLAB, instead of using an efficient algorithm to solve linear equations, we’ll implement the operations manually by multiplying and inverting the matrices. It is important to point out that this may not be the ideal approach since is less efficient. To compute the solution, I first calculated the design matrix for W = 1 and computed the first polynomial.

Then, I created a loop that calculates the remaining Wᵢ and stored them in the cell array called polynomials_wi. The 30 design matrices are also stored in the cell array designmatrix.

Now, we need to compute the Empirical Square Loss for all the wᵢ polynomials on polynomials_wi. The number of data points for our problem is N= 100. Following the script, I created a loop that computes the ERM for each polynomial wᵢ. I also printed the order of the polynomial that gives us the smallest empirical square loss.

The output of our algorithm is shown in the figure below.

So, the polynomial of order 21 has the smallest empirical square loss. Now, let’s analyze the plot of the empirical square loss for ERM of all polynomials in the figure below. For this analysis, I plot the Empirical Square Loss vector E from W = 1 to W = 30.

Looking at Empirical Square Loss for ERM, we can see that the polynomial W= 21 is indeed the best fit for this dataset. We can also see that after W = 6, the empirical loss almost becomes stable decreasing just a little bit up to W = 21, and after W = 23, our models begin to overfit in such a way that the empirical loss skyrocketed. When we execute the command [ERM,W] = max(E) in the Command Window of MATLAB as shown in the figure below, we see that for order 27, the square loss was over 395.

The figure below shows the plot of the polynomial of order W = 21 against the dataset.

After analyzing the curve itself in the figure above, we can see that although this polynomial of order W= 21 has the smallest empirical loss, it overfits the data. So, taking only the empirical loss as the sole indicator for choosing the best fit may not be the best approach since it can lead us to pick an overfitting model.

Regularized Loss Minimization (RLM)

Now, let’s repeat the previous step using regularized least-squares polynomial regression. I recommend going over this explanation about RLM before going through this part. For RLM, we use a regularizer λ to calculate the vector w. For regularized least squares regression, we can calculate w by using the equation below [1].

Note that we use the regularizer λ multiplied by the identity matrix of the order of X.

If we isolate w, we can up of the following linear equation:

Again, I will implement the operations manually on MATLAB to solve this linear equation. The design matrix for this case remains the same as defined for ERM.

Importance of the Regularizer λ

Let’s first analyze the consequences of having a regularizer in the computation of our machine learning models. In order to do this, let’s analyze the polynomial of order W = 30 and see how it behaves when using different values of λ.

We will do this analysis for the interval λ where ln(λ) = -1,-2,…,-30. Thus, the value of λ will be given by

First, I computed the value of the design matrices for the polynomial of order 30.

Similarly to what I did on for ERM, I created a loop to compute the wᵢ for all values of λ. I also calculated the empirical square loss for the RLM, which is the same as ERM.

For each iteration, the value of λ is updated.

The polynomial of order 30 with regularizer

The output of our algorithm is shown in the figure below.

So, the polynomial of order 30 where ln(λ) =−30, i.e. i = 30, is the best fit for this data. The figure below shows the plot of the polynomial of order W = 30 with i = 30 against the dataset.

The polynomial of order 30 with i = 30 and dataset

As seen on ERM, our RLM model of order 30 and i = 30 also overfits the data.

ERM vs RLM

Now, let’s compare the results we obtained for ERM and RLM. When we analyze the equation for RLM, we can conclude that for large values of λ, i.e. small values of i, we have a larger empirical square loss. This means that for small values of i we get an underfitting model. On the other hand, for small values of λ, i.e. large values of i, we get a smaller empirical square loss. This behaviour, however, can lead to an overfitting model. We can note this in the right graph of the figure below. In addition, large values of λ also help to reduce the overfitting problem of polynomials of large orders. This can be seen in the right chart of the figure below.

ERM for polynomials of order W = 2 (underfitting) and W = 27 (overfitting) on the left, and RLM polynomial of order 30 with i = 2 and i = 26. (Plot from the RLM_ERM.m script)

When it comes to the EMR curve (represented in orange in the Empirical Square Loss figure), we have a similar outcome as well. For this case, however, we are varying the order of the polynomial up to W = 30. For polynomials with small orders, we get an underfitting model with a large empirical square loss, whereas for large orders (W > 23 in this case) we can overfit our data even though we get a small empirical square loss. This can be seen in the left chart in the figure above. Note that the regularizer in the RLM helped to reduce the overfitting for polynomials of larger orders. Therefore, analyzing the chart of the empirical square loss for both solutions proved to be paramount for choosing the model that best fits our dataset.

Additionally, if we analyze the regularized least squares for the limit of λ→0, i.e. the limit i→∞, we see that the regularized term of the RLM equation disappears, making the RLM the same as the ERM. This is evidenced when we analyze the RLM polynomial of order 10 with i= 12. Since we have a very small value of λ for this case (0.000006144), we can notice that the curve of this polynomial is approximately identical to the ERM polynomial of order 10.

The ERM polynomial of order 10 and the RLM polynomial of order 10 and i = 12. Note that the curves overlap with each other. (Plot from the RLM_ERM.m script)

As a final point, the model of order W = 30 and i = 30 for RLM has a significantly smaller empirical square loss in comparison to the ERM of order W = 30 (0.0453 and 4.7873 respectively). This supports the conclusion that the regularizer helps to reduce the overfitting. Additionally, the polynomial of degree W = 30 and i = 30 for RLM has approximately the same empirical square loss of the degree W = 21 for ERM.

After all, which model is the best?

Now that we have all this data in hand, we can choose the best model for our problem. The goal is always to find a polynomial of small order so we don’t overfit our data. In addition, the order has to be large enough so we don’t underfit it.

Let’s plot and look at the curves for ERM and RLM for the orders W= 1, 5, 10, 20, 30. This time, we’ll fix the regularizer λ = 0.0025. The script Visualization.m was used for plotting both curves.

Polynomials of order 1, 5, 10, 20 and 30 for RLM and dataset on the left, and Polynomials of order 1, 5, 10, 20 and 30 for RLM and dataset on the right

When we analyze the graphs for EMR and RLM, we can definitely notice the difference between them. The ERM models of larger orders have wild variations and adapt themselves too much to the training data. This means that, as the order of the models increases, the risk of overfitting also increases. This is especially true when we look at the polynomials of order W = 10. While the RLM polynomial doesn’t have huge variations and variance, the ERM one adapts itself to the data in a way that leads to overfitting. Therefore, the empirical loss for the ERM looks smaller in comparison to RLM. This is easily observed when we compare both empirical losses.

In order to better visualize this behaviour, we can analyze the bar chart below where the empirical loss for both the ERM and RLM for the polynomials of order W= 1, 5, 10, 20 and 30 are shown. When we looking at it, we can evidence that, with the exception of the W= 30 polynomial, the empirical square loss for ERM is smaller than for RLM.

Empirical Square Loss for RLM with regularizer λ = 0.0025 and ERM for W = 1, 5, 10, 20 and 30

The regularizer in the RML models plays an important role when it comes to representing the data. The models with the regularizer are more accurate to the dataset because they don’t overfit the data as seen with ERM. This is especially true when we see the figure below. On the left-hand side, we have the empirical square loss for RLM with different values of λ. Note that, as we decrease the value of λ, we also decrease the empirical loss for W > 6.

Empirical Square Loss for RLM with different regularizers on the left in reduced scale, and polynomial of order 10 for RLM with different regularizers on the right. (Plot from the and RML_For_W_Risk.m and RML_For_W_Poly.m scripts)

On the right side of the figure above we can see the polynomial of order W= 10 for different values of λ. As we decrease the value of λ, we increase the adaptability of our model. Thus, very small values of λ can lead to overfitting while large values of λ may cause underfitting.

Given these facts, the model that best fits this dataset is the RML polynomial of order W= 5. The polynomial is presented below.

0.6721 - 0.5555x - 6.5428x² - 0.9711x³+ 6.5883x⁴ + 0.7647x⁵

This is because this polynomial has a small order and a small empirical loss. Also, when we analyze the plots of the curves for ERM and EML, we can see that this model doesn’t overfit the data. In addition, the ERM polynomial of order W= 5 also seems to fit the data without overfitting.

The polynomial of order W = 5 with λ = 0.0025

Conclusion

We saw that using RML can reduce the overall complexity of our model and avoid overfitting. However, since we now have a new parameter λ to determine, RML models are more complex to compute.

Another important feature we should use in our analysis when selecting a model is the curve of the Empirical Square Loss. We can also conclude that the model with the minimum empirical square loss is not always the best solution for our data. Furthermore, finding a balance between the parameter λ and the order of the polynomial we are predicting showed to be paramount in this analysis.

Underfitting and overfitting the data is an extremely common problem data scientists face nowadays, and finding the ideal solution for a given problem requires a detailed analysis of the possible models.

Thanks for reading it!

About Me

I’m an M.A.Sc. student at York University, and a Software Engineer by heart. During the past decade, I’ve been working in several industries in areas such as software development, cloud computing and systems engineering. Currently, I’m developing research on cloud computing and distributes systems.

You can check my work on my website if you want to.

References

[1] Shai Shalev-Shwartz and Ben-David. Understanding Machine Learning: From Theory to Algorithms. Cambridge University Press, 2014. DOI:10.1017/CBO9781107298019. URL: https://www.cs.huji.ac.il/~shais/UnderstandingMachineLearning/understanding-machine-learning-theory-algorithms.pdf