The world’s leading publication for data science, AI, and ML professionals.

Building machine learning intuition through play

Why do we minimize the sum of the squared residuals in linear regression?

Using maximum likelihood to show why we minimize the sum of squared residuals in linear regression

Photo by Allen Taylor on Unsplash
Photo by Allen Taylor on Unsplash

Summary

While attempting to understand machine learning mathematical concepts more deeply, I’ve found that it’s pretty easy to get lost in the equations. My goal is to demonstrate how you can experiment with plots and basic math to build machine learning Intuition. Specifically, I’d like to show why we minimize the sum of the squared residuals in linear regression through a simple example.

Prerequisites

This article assumes you understand the basics of maximum likelihood and will provide additional intuition around the concept as it relates to Linear Regression. If you need a primer, check out this video: StatQuest – Maximum Likelihood.

Let’s get started

In linear regression, we are using input data to estimate coefficients that have the highest likelihood of fitting a Gaussian distribution. Linear regression takes the following form with the error terms (e) independently and identically distributed (iid) following a normal distribution (N) with mean 0 and variance sigma².

Since our errors are normally distributed, the probability density function of the error term is – you guessed it, the probability density function of the normal distribution:

The likelihood function L(theta) is the joint probability of the probability density function of the error term f(e). Joint probability just means we are multiplying each observation together.

Typically we would solve for our coefficients by maximizing this function (maximum likelihood estimation) via taking the logs and partial derivatives with respect to each coefficient. Instead of doing all that math, let’s try to intuitively understand what is happening.

First, what relationship does the residual have with the e-term?

The chart below plots the e-term vs the residuals. Here is a link to the chart. I encourage you to play around with the numbers in the chart to get a feel for what is happening.

Lower residuals result in a higher e-term – when the residual is zero, the e-term is the highest at 1. Now, let’s think about the likelihood function and how we would maximize it.

Let’s assume n=3 for example. Then the likelihood function is the product of three terms:

Try multiplying three e-term numbers together such as .7.7.7, then try other numbers. For example:

0.70.70.7 = 0.343

0.70.60.6 = 0.252

0.40.60.4 = 0.096

What you’ll find is that the larger the e-terms, the larger the resulting product.

What we’ve found:

  1. Lower residuals result in a larger e-term
  2. The product of higher e-terms result in a higher likelihood

We can thus conclude that we will maximize the likelihood function (the product of the e-terms) by minimizing each residual.

But what about minimizing the sum of squared residuals?

The expression "minimizing the sum of squared residuals" comes from the typical next step in calculating maximum likelihood – taking the log of the likelihood function. Taking the log converts the product into sums.

But let’s forget about all that math for a second and experiment with some values. Here are the residuals and e-terms from our chart

Let’s calculate the sum of the squared residuals for two of the previous examples.

Example 1:

e-terms: 0.7, 0.7, 0.7 | residuals: 0.61, 0.61, 0.61

product of e-terms: 0.7 0.7 0.7 = 0.343

sum of squared residuals: 0.6² + 0.6² + 0.6² = 1.1163

Example 2:

e-terms: 0.4, 0.6, 0.4 | residuals: 0.85, 0.70, 0.85

product of e-terms: 0.4 0.6 0.4 = 0.096

sum of squared residuals: 0.8² + 0.7² + 0.8² = 1.935

Example 1 had the largest product of e-terms (maximum likelihood) and also had the smallest sum of squared residuals. Feel free to take any set of residuals or e-terms and perform these computations. You will observe that we maximize the likelihood by minimizing the sum of squared residuals.

Conclusion

Many times we can derive and understand challenging concepts through basic plots and calculations. By experimenting, we were able to derive one of the most important foundational concepts in machine Learningmaximum likelihood for linear regression is analogous to minimizing the sum of the squared residuals. I encourage you to think of how you might use the techniques demonstrated here to learn other concepts.

If you want to dive deep into the math for log likelihood, check out this article.


Related Articles