The world’s leading publication for data science, AI, and ML professionals.

Simple Linear Regression From Scratch in Numpy

Machine Learning doesn't have to be complex – if explained in simple terms.

Linear Regression is most probably the first ‘machine learning’ algorithm you’ve learned, or have the intention to learn. It is a simple algorithm initially developed in the field of statistics and was studied as a model for understanding the relationship between input and output variables.

As the name suggests, it’s a linear model, ergo it assumes a linear relationship between input variables (X) and the single (continuous) output variable (y). To be more precise, y can be calculated from a linear combination of the input variables.

In a case when there’s only one input variable, the method is referred to as a simple linear regression, and that will be the topic of this article. You can argue that in the real world you have more than one input variable, and that’s true, but as always it’s a good idea to start from basics.

The linear equation assigns one scale factor to each input value called a coefficient, and it is commonly represented by the Greek letter Beta (β). One more coefficient is added, giving the line an additional degree of freedom (moving the line upwards or downwards) and is called intercept or the bias coefficient.

A simple linear regression can be expressed as:

In case you have more than one input variable, the regression ‘line’ would be called a plane or a hyper-plane. Also, needless to say, you would have more of those beta coefficients, each one multiplied by the value of certain input. If the beta coefficient is zero, it tells you that the variable at that position has no influence on the model.

Learning a linear regression model means estimating the values of the coefficients used in the representation with the data you have available.


Assumptions of the Linear Regression

When preparing data for use with linear regression, here’s what you should keep in mind:

  1. Linear Assumption – model assumes the relationship between variables is linear
  2. No Noise – model assumes that the input and output variables are not noisy – so remove outliers if possible
  3. No Collinearity – model will overfit when you have highly correlated input variables
  4. Normal Distribution – the model will make more reliable predictions if your input and output variables are normally distributed. If that’s not the case, try using some transforms on your variables to make them more normal-looking
  5. Rescaled Inputs – use scalers or normalizer to make more reliable predictions

Beta Coefficient Formulas

In simple linear regression, there are two coefficients -beta zero and beta one. They don’t have to be ‘learned’, you can calculate them by a simple formula (only for simple linear regression):

You can make calculations by hand or with Python. I will use Python.


Dataset Introduction

I decided not to download some arbitrary dataset from the web, but to instead make it on my own. It’s made of 300 arbitrary points:

A quick scatter plot will uncover a clear linear trend among the variables:

You can now plug both x and y into the formulas from above. Firstly, I will calculate the beta one coefficient:

That wasn’t hard. The calculation of beta zero, or the bias coefficient will be even simpler:

And that’s it. You are now ready to make predictions. To do so, I decided to declare a function, _calc_predictions() which will take x_ term as an input. You can then calculate the predictions easily:

Now those _y_preds_ can be used to plot a regression line:

That was cool. Quick also. But you might now be wondering, is there a simpler and quicker way to calculate the coefficients?


Simpler Way

You don’t have to use the formulas above to obtain coefficients, there’s a shorter way. It also involves the usage of formula, but it’s much shorter. The formula for bias intercept stays the same, but the one for beta one changes:

And here’s how you would implement it in Python:

Note how you must use array indexing to obtain a correlation coefficient, and your task is to explore what would happen if you didn’t use it. You can see that the coefficient values are the same as ones calculated earlier, so everything works (hurray).


Model Evaluation

There are many ways to evaluate a regression model, but I will use the _Root Mean Squared Error_. It’s calculated as follows:

where Pi is the predicted value.

To use it in Python there are two options:

  1. Import MSE from Scikit-Learn and take a square root of it
  2. Write from scratch

I will use the second options because, well, the calculation is utterly simple:

The model can now be evaluated:


Final Words

It’s time to say goodbye. This was a rather short article, but I would say it is a good introduction to linear regression. Later down the road, I will publish an article on multiple linear regression from scratch, which has an actual application in the real world, because your dataset probably has more than one input variable.

Until then, try to use these equations on your dataset and then try to compare the results with Linear Regression from Scikit-Learn.


Loved the article? Become a Medium member to continue learning without limits. I’ll receive a portion of your membership fee if you use the following link, with no extra cost to you.

Join Medium with my referral link – Dario Radečić


Related Articles