Using maximum likelihood to show why we minimize the sum of squared residuals in linear regression

Summary
While attempting to understand machine learning mathematical concepts more deeply, I’ve found that it’s pretty easy to get lost in the equations. My goal is to demonstrate how you can experiment with plots and basic math to build machine learning Intuition. Specifically, I’d like to show why we minimize the sum of the squared residuals in linear regression through a simple example.
Prerequisites
This article assumes you understand the basics of maximum likelihood and will provide additional intuition around the concept as it relates to Linear Regression. If you need a primer, check out this video: StatQuest – Maximum Likelihood.
Let’s get started
In linear regression, we are using input data to estimate coefficients that have the highest likelihood of fitting a Gaussian distribution. Linear regression takes the following form with the error terms (e) independently and identically distributed (iid) following a normal distribution (N) with mean 0 and variance sigma².

Since our errors are normally distributed, the probability density function of the error term is – you guessed it, the probability density function of the normal distribution:

The likelihood function L(theta) is the joint probability of the probability density function of the error term f(e). Joint probability just means we are multiplying each observation together.

Typically we would solve for our coefficients by maximizing this function (maximum likelihood estimation) via taking the logs and partial derivatives with respect to each coefficient. Instead of doing all that math, let’s try to intuitively understand what is happening.

First, what relationship does the residual have with the e-term?

The chart below plots the e-term vs the residuals. Here is a link to the chart. I encourage you to play around with the numbers in the chart to get a feel for what is happening.

Lower residuals result in a higher e-term – when the residual is zero, the e-term is the highest at 1. Now, let’s think about the likelihood function and how we would maximize it.

Let’s assume n=3 for example. Then the likelihood function is the product of three terms:

Try multiplying three e-term numbers together such as .7.7.7, then try other numbers. For example:
0.70.70.7 = 0.343
0.70.60.6 = 0.252
0.40.60.4 = 0.096
What you’ll find is that the larger the e-terms, the larger the resulting product.
What we’ve found:
- Lower residuals result in a larger e-term
- The product of higher e-terms result in a higher likelihood
We can thus conclude that we will maximize the likelihood function (the product of the e-terms) by minimizing each residual.
But what about minimizing the sum of squared residuals?
The expression "minimizing the sum of squared residuals" comes from the typical next step in calculating maximum likelihood – taking the log of the likelihood function. Taking the log converts the product into sums.
But let’s forget about all that math for a second and experiment with some values. Here are the residuals and e-terms from our chart

Let’s calculate the sum of the squared residuals for two of the previous examples.
Example 1:
e-terms: 0.7, 0.7, 0.7 | residuals: 0.61, 0.61, 0.61
product of e-terms: 0.7 0.7 0.7 = 0.343
sum of squared residuals: 0.6² + 0.6² + 0.6² = 1.1163
Example 2:
e-terms: 0.4, 0.6, 0.4 | residuals: 0.85, 0.70, 0.85
product of e-terms: 0.4 0.6 0.4 = 0.096
sum of squared residuals: 0.8² + 0.7² + 0.8² = 1.935
Example 1 had the largest product of e-terms (maximum likelihood) and also had the smallest sum of squared residuals. Feel free to take any set of residuals or e-terms and perform these computations. You will observe that we maximize the likelihood by minimizing the sum of squared residuals.
Conclusion
Many times we can derive and understand challenging concepts through basic plots and calculations. By experimenting, we were able to derive one of the most important foundational concepts in machine Learning – maximum likelihood for linear regression is analogous to minimizing the sum of the squared residuals. I encourage you to think of how you might use the techniques demonstrated here to learn other concepts.
If you want to dive deep into the math for log likelihood, check out this article.