
Recently, I wrote an article about Linear Regression and how it is used in Data Science. As a general overview, I didn’t dive into too much detail on the tools or methods that Linear Regression uses. One such tool is Sum of Squares. At first, I was going to write just a brief explanation in the previous article. However, there are a few different formulas used to define the Sum of Squares, so limiting that to a sentence or two would be somewhat difficult to think up. Instead, and because I found the answers interesting, I figured it was time to write another blog. So with that in mind, in today’s article, we’ll be looking at the Sum of Squares. We’ll first describe what Sum of Squares is and why it’s used, and then we’ll look at the formulas needed and what they do. So, without any further delay, let’s take a deeper dive into Sum of Squares.
What is the Sum of Squares?
Sum of Squares is used to not only describe the relationship between data points and the Linear Regression line but also how accurately that line describes the data. You use a series of formulas to determine whether the regression line accurately portrays data, or how "good" or "bad" that line is.
One important note is to make sure your data is first describing regression and not correlation. Here is a simple checklist to find the difference:
- Regression will emphasize how one variable will affect the other, and not simply the relationship between variables.
- Correlation does not capture casualty, where Regression is based on it. This is important because instead of the degree of connection, it should show cause and effect.
- In Correlation, the correlation between x and y would be the same as y and x. In Regression, x and y with y and x would yield different results.
- Finally, the correlation would graphically represent a single point, whereas Regression graphically represents a line.
Now that we know a little more about Sum of Squares, let’s take a look at the formulas needed.
Sum of Squares Total
The first formula we’ll look at is the Sum Of Squares Total (denoted as SST or TSS). TSS finds the squared difference between each variable and the mean.

yi = The _i_th term in the set
ȳ = the mean of all items in the set
What this means is for each variable, you take the value and subtract the mean, then square the result. This gives you the distance from the linear line drawn to each particular variable. You could also describe TSS as the dispersion of observed variables around the mean, or the variance. So, the goal of TSS is to measure the total variability of the dataset.
Sum of Squares Regression
The next formula we’ll talk about is Sum of Squares Regression (denoted as SSR), also known as Explained Sum of Squares (denoted as ESS). SSR is used to describe the difference between the predicted value and the mean of the dependent variable.

ŷi – the value estimated by the regression line
ȳ – the mean value of a sample
To start, we will again need the mean. The estimated value is the one that lies on the regression line. That means instead of the actual value of each variable, take the value of where that variable would be on the line of regression. This will tell us how well the line fits the data. If the SSR matches the TSS, then that line would be a perfect fit.
Sum of Squares Error
The final formula to discuss is the Sum of Squares Error (denoted SSE), also known as Residual Sum of Squares (RSS). SSE finds the difference between the observed, or actual value of the variable, and the estimated value, which is what it should be according to the line of regression.

Where:
yi – the observed value
ŷi – the value estimated by the regression line
In the case of a perfect fit, the error would be 0, meaning the estimated value is the same as the actual value. Any value above 0 shows the error, or to what degree the line is inaccurate according to the values. The lower the value, the better line of regression fits the data. A high residual sum would demonstrate that the model poorly represents the data.
Now that we have all three explained, we can represent their relationship:

Conclusion
In today’s article, we talked about the Sum of Squares. First, we described what it is and why it’s used. Next, we listed the difference between Correlation and Regression. Finally, we looked at the formulas used (TSS, SSR, and SSE), and found a formula to represent the relationship between them. I hope that Sum of Squares is a little clearer and that you found this description helpful and interesting. We don’t necessarily need to calculate all the formulas by hand. Languages like R have functions to calculate each formula so you can determine whether the line of regression is a good fit without the extra leg work. I hope you enjoyed this explanation and, as always, I’ll see you in the next one. Cheers!
Read all my articles for free with my weekly newsletter, thanks!
Want to read all articles on Medium? Become a Medium member today!
Check out some of my recent articles:
Linear Regression in Data Science
Getting Started With Seq in Python
Javascript CDNs and How To Use Them Offline
References: