Understanding linear regression using only basic algebra and your own intuition.
Data Science, like so many other technical fields, often seems intimidating at first simply due to the terminology it uses. Terms like regression, mean squared error, bias, and variance sound complicated and unintuitive, but fancy words are sometimes just that; the underlying concepts may be much more straightforward than the dressed-up names suggest! In fact, with a little parsing out and some concrete examples, I hope you’ll see that these concepts are surprisingly intuitive.
If you know nothing about data science, you still have probably heard of predictive modeling, at least in passing. Simply put, predictive modeling is the process of using existing information to determine the outcome of a hypothetical, or future event. Even more simply put, if you and I are playing rock-paper-scissors, and if you notice that I, in an ill-conceived attempt at strategy, choose paper in ten consecutive rounds, chances are if you play scissors in the eleventh round, you will win. By using the previous ten rounds as data, you were able to correctly predict the outcome of the next round with a high probability, or likelihood, of certainty.
Naturally, the applications of predictive modeling involve slightly more numerical precision than that, but a solid conceptual understanding makes the numbers much easier to interpret. In the example above, there are three distinct possible outcomes for me to play: rock, paper, or scissors, and it was clear that I had established a pattern, i.e. I wasn’t playing each round randomly. But say instead of distinct categories of outcomes, we were trying to determine the value of a more continuous variable, like the cost of a house. And say instead of somebody arbitrarily deciding the price, as I had decided on playing paper, there were certain other variables, such as size, location, condition, etc. that had an effect on determining the price at which the house sells. That sounds… actually pretty accurate, doesn’t it? Intuitively, those factors should have an effect on the price of a house, and we can use a tool called Linear Regression to determine exactly how much of an effect each of those factors has on the price of a house.
To demonstrate, I’ll use a dataset of house sale prices for King County, courtesy of Kaggle. This dataset provides several features of houses that sold between 2014–2015 including square footage, condition, zip code, number of bedrooms, and more. We’ll start by looking at the measure of a home’s square footage.

In the graph above, each dot represents a home that was sold, its square footage represented by its position on the x-axis, and price (in millions) on the y-axis. We can see that there is a sort of pattern here; the dots fan out and seem to "flow" up and to the right. Next, we want to ask: does this numerical pattern line up with the pattern our intuition tells us? I say it does; as a home increases in size, it should increase in price as well.
But if our endgame is prediction, how well can we predict the price of a home with a certain square footage based on this graph alone? Not super well, since all we have is a vague shape. Since the data points on this graph are individual measurements, there is a natural "messiness" that occurs when we plot them. You may recall from algebra that if we want to determine the y-value (which in this case is price) given a certain x-value (square footage), a formula is often needed. The simplest way to do this is to approximate a line that cuts through these data points, sort of taking the averages of the points along the way. We call this the "line of best fit", since it is literally that; a line that best fits the given data.

Also recall from algebra that the equation of a line is y = mx+b, where (x,y) is a point on the line, m is its slope, and b is its y-intercept. Let’s think about what slope means in this context. The slope of the red line turns out to be around 280. Since slope is rise over run, a slope of 280 means we rise 280 in price for every increase of a square foot. That is how we can numerically describe the relationship on average between square footage and price for this dataset!
But, you may be looking at that graph and thinking, sure, maybe the red line is the average prediction, but there are some points that deviate pretty significantly from it, especially near the top of the graph. Turns out, there is a way for us to measure how far off our line is from each individual point on our data set. We do that by measuring the distance between the y-value of each data point and the corresponding y-value on our regression line, square those distances, add them all up and take the average. That is called the mean squared error, since it gives us the average squared distance by which our prediction missed the mark.
One possible reason our mean squared error may be high is that we are only looking at one feature, square footage, to determine the price of a home. What if a home is gigantic, but it has serious structural flaws? Or what if a home is tiny, but it’s one of those cool tiny-homes that people are really into for some reason these days? We know there are many other factors other than size that affect a home’s cost, and our red line prediction doesn’t take those into account. For that reason, we would say that this model has high bias, since it is biased towards square footage as the only predictor of price.
On the flip side, if we were to model every single imaginable feature of a house, we may end up with the opposite problem of predicting based on relationships that don’t really exist. For example, let’s say we discovered that the number of rooms in a home that have carpet, divided by the year the home was built, multiplied by the number of ceiling fans, is correlated with the price of a house. However, if we are trying to make a model that will predict the price of other houses we haven’t looked at yet, there’s a pretty good chance that relationship won’t hold up. For that reason, we’d say that this model has high variance.
The two examples above are meant to illustrate the "bias-variance tradeoff", which is a way of saying the more complicated a model gets, the lower the bias, but the lower the bias, the higher the variance. The concepts of bias and variance aren’t quite as intuitive as the other concepts, but hopefully, the notion that too few features in a model result in high bias and too many complicated features result in high variance seems reasonable.
If you found this interesting and want to learn more, there is no shortage of data science resources online! I recommend watching some StatQuest videos to get started.