The world’s leading publication for data science, AI, and ML professionals.

Linear Regression Explained in 5 Minutes

Arguably the most fundamental machine learning model explained as simply as possible.

Linear regression is one of the most widely used approaches used to model the relationship between two or more variables. It can be applied anywhere, from forecasting sales for inventory planning to determine the impact of greenhouse gases on global temperatures to predicting crop yield based on rainfall.

In this post, we’ll go over what linear regression is, how it works, and create a Machine Learning model to predict the average life expectancy of a person based on a number of factors.

What is Linear Regression?

According to Wikipedia, linear regression is a linear approach to modeling the relationship between a dependent variable and one or more independent variables. In simpler terms, it is the ‘line of best fit’ that represents a dataset.

Below is an example of a line that best fits the data points. By creating a line of best fit, you can predict where future points may be and identify outliers. For example, assume that this graph represents the price of diamonds based on weight. If we look at the red dot, we can see that this particular diamond is overvalued because it costs much more given the same weight as other diamonds. Similarly, the green dot is undervalued because it costs much less than other diamonds with similar weights.

So how do you find the line of best fit? Let’s find out.

How Simple Linear Regression works

We’re going to focus on simple linear regression. The line of best fit, or the equation that represents the data, is found by minimizing the squared distance between the points and the line of best fit, also called the squared error.

To give an example, there are two ‘line of best fits’ shown above, the red line and the green line. Notice how the error (the green lines between the line of best fit and the plots) is much greater than the red line. The goal of regression is to find an equation in which the sum of the errors is minimized.

If you want to know the math behind it, you can watch Khan Academy’s videos here, where they find the partial derivatives of m and b.

If you want to use simple linear regression, you can use the LinearRegression class from the scikit-learn library.

from sklearn.linear_model import LinearRegression

Multiple Linear Regression

Simple linear regression is useful when you want to find an equation that represents two variables, the independent variable (x) and the dependent variable (y). But what if you have many independent variables? For example, the price of a car is probably based on multiple factors, like its horsepower, the size of the car, and the value of the brand itself.

This is when multiple regression comes in. Multiple regression is used to explain the relationship between a dependent variable and more than one independent variable.

The image below shows a plot between income (y) and seniority and years of education (x). When there are two independent variables, a plane of best fit is found instead of a line of best fit.

Polynomial Regression

What if you had a set of data where its line of best fit is not linear (like the image below). This is when you would want to use polynomial regression. Using Wikipedia again, it’s defined as a form of regression analysis in which the relationship between the independent variable x and the dependent variable y are modeled as an nth degree polynomial in x. In simpler terms, it fits a non-linear relationship between x and y.

When you want to use polynomial regression, a few extra lines of code are needed:

from sklearn.preprocessing import PolynomialFeatures
poly_reg = PolynomialFeatures(degree = n) #where n is equal to the number of degrees that you want

Example: Predicting average life expectancy

To demonstrate how to build a regression model in Python, I used the ‘Life Expectancy (WHO) dataset on Kaggle here. My goal was to create a model that could predict the average life expectancy of a person in a given country on a given year based on a number of variables. Keep in mind that this is a very basic model – my next post will go through different methods to improve a regression model.

In terms of prepping the data, I more or less followed the steps that I laid out in my EDA blog posts.

Part 1 here.

Part 2 here.

There are a couple of new topics that I introduced in this model, like converting categorical data (countries) into dummy variables and evaluating the Variance Inflation Factor (VIF) of all variables. Again, I’m going to go through all of these new topics in my next blog post.

One thing that I wanted to share was the correlation heatmap because there are some really interesting correlations here:

  • There is a strong positive correlation between ‘Schooling’ and ‘Life Expectancy’ of 0.73. This may be because education is more established and prevalent in wealthier countries. This means countries with less corruption, infrastructure, healthcare, welfare, and so forth.
  • Similarly to the point above, there is a moderate positive correlation between ‘GDP’ and ‘Life Expectancy’ of 0.44, most likely due to the same reason.
  • Surprisingly there’s a moderate positive correlation between ‘Alcohol’ and ‘Life Expectancy’ of 0.40. I’m guessing that this is due to the fact that only wealthier countries can afford alcohol or the consumption of alcohol is more prevalent among wealthier populations.
# Create Model
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeaturespoly_reg = PolynomialFeatures(degree = 2)
X_poly = poly_reg.fit_transform(X)
poly_reg.fit(X_poly, y)X_train, X_test, y_train, y_test = train_test_split(X_poly, y, test_size=.30, random_state=0)
regressor = LinearRegression()
regressor.fit(X_train, y_train)
y_pred = regressor.predict(X_test)

After cleaning the data, I executed the code above to create my polynomial multiple regression model with an MAE of 8.22 and a range of 44.4. In my next blog post, I’ll actually introduce several methods to improve a regression model (also applicable to most machine learning models) using the same dataset that I used here.


Related Articles