The world’s leading publication for data science, AI, and ML professionals.

How to Code Linear Regression from Scratch

A numpy implementation based on the normal equation

A sample Linear Regression Fit (Image by Author)
A sample Linear Regression Fit (Image by Author)

These days, it’s easy to fit pretty much any model you can think of with one library or another, but how much do you really learn by calling .fit() and .predict()? While it’s certainly much more practical to use a framework like python’s statsmodels or scikit-learn for the normal use-case, it seems equally logical that when learning Data Science it makes a lot of sense to get a feel for how these models actually work. Below we show how to use numpy to implement a basic linear regression model from the ground up. Let’s get started!


It’s All About the Coefficients

Think back to your first algebra class: do you remember the equation for a line? If you said "y = mx + b", you’re absolutely right. I think it’s also helpful to start in two dimensions, because without using any matrices or vectors, we can already see that given inputs x, and outputs y, we are actually looking for not one, but two coefficients: m and b.

"But wait!" you might say. "That’s the slope coefficient m, and the intercept b." And you’d be right again! But in order to find the line that best fits our data, we need not just a slope, but an intercept as well, otherwise we’d be looking at infinitely many lines of best fit, instead of just the one that we’re after. Rather than thinking about b as being added to our x term, it’s useful to rewrite this simple equation as "y = mx + b1". This makes the next little bit of Linear Algebra much easier to understand.

Just Enough about Matrices and Vectors

Let’s imagine a very simple dataset for our slope intercept equation where the line of best fit is actually a perfect fit. We’ll say we have the points:

(1, 3)

(2, 5)

(3, 7)

(4, 9)

We’d like to solve for the coefficients m and b that best solve some cost function we’ll define in a moment, but in order to do this efficiently, we’ll first rewrite our equation one more time. We’ll define a single vector y = [3, 5, 7, 9], and we’ll be looking for some coefficients (often denoted by Θ, theta, or β, beta). How many elements of our coefficient vector depends on how many features there are in our feature space X (note, that we are switching to capital X to denote a matrix, which we’re going to discuss now). Instead of a vector like we used to define our y term, we’re going to add a column of ones to our column of x terms from above. By convention we’ll put the column of ones in front of the X values since you can think of our constant coefficient as having lower degree than our X. That will look something like this:

Image by Author
Image by Author

Now our equation looks like this:

XΘ = y

The next thing we’re going to do is employ a little trick, not all of matrix algebra works exactly like you might be used to, but it’s generally fair game to multiply by the same term on either side of the equation, as long as our dimensions are compatible, which is what we’ll do. We add the transpose of X on both sides, the transpose looks like this:

Image by Author
Image by Author

And our new equation can be written like this:

(XᵀX)Θ = Xᵀy

This is the point of our funky manipulations. This equation is called the normal equation, and it happens to have some special properties. Wolfram defines this equation as "that which minimizes the sum of the square distances between the left and right sides" of the equation Ax = b, which though they’ve used some different notation, is exactly what we’re looking for.

Our last trick is going to be to isolate Θ, which we’ll do by taking the inverse of (XᵀX), resulting in the following equation that will yield a solution in all cases where (XᵀX) is invertible. We’ll skip some of the details, but so long as the columns of X are linearly independent, this ought to work.

Θ = (XᵀX)⁻¹y


The Code

Here’s our solution written out in Python, feel free to try it out! I’ll claim this code works about the same as scikit-learn’s LinearRegression. It will yield the same results on, for example, the Boston Housing dataset, which you can retrieve with sklearn.datasets.load_boston. Can you think of some cases where it breaks? (Hint: check the last paragraph of the preceding section).

And last, for those who would like a simple convenience function to test outputs of this method vs. an implementation from another library, feel free to use this:


Related Articles