The world’s leading publication for data science, AI, and ML professionals.

An Introduction to Linear Regression for Data Science

An overview of one the most popular machine learning algorithms

GETTING STARTED

Introduction to Linear Regression with Python

Introduction

In the world of machine learning – where humans find creative ways for machines to find creative ways to solve problems – few algorithms are as popular as linear regression.

To put it simply, linear regression is to machine learning as Yoda is to Star Wars: it’s super old, it’s easy to ignore, but it is very powerful. It would behoove you to befriend it if you are to save the galaxy.

In this article, I will explain linear regression in a way that I hope feels intuitive and accessible. Along the way, I’ll introduce some key ideas that I think are useful if you’re starting to learn Data Science. I’ll also include the complete Python code for applying linear regression to a real-world example so you can follow along.

Table of Contents

The article is divided intro three parts:

  • Part I: Why relationships matter
  • Part II: Why modeling relationships matters
  • Part III: Why evaluating models of relationships matters

Let’s begin this journey!

Part I: Why relationships matter

A really good starting question in Linear Regression is: why do we care about relationships? There’s a reason why the Harvard Business Review named data science "the sexiest job in the 21st century," and it comes down to three simple facts:

  • we have limited information about the world
  • we need to make some really important decisions (such as the dose of a drug we should administer to a patient or how much we should sell a house for)
  • we don’t like leaving things to chance

Since we can’t access information from the future, the only thing left for us to do is to make an educated decision based on the data we already know; more specifically, based on the relationships between the data we know. That’s where linear regression comes in.

Predicting the price of a house

Imagine that you’re hired by a real-estate company to determine the price of a house based on its features.

The challenge is real: if you choose a price that is too high, you risk not selling the house, but if you set the price too low, you lose out on a potential gain. So how do you set the right price?

Loading the King County Housing Data Set

To make this example concrete, we’ll use a popular real-world data set of housing prices from King County, WA, which includes homes sold between May 2014 and May 2015.

You can access the data set here (I’ve modified the original Kaggle data set to make it easier to examine the data). I’ll be running Python code on Google Colab, which I highly recommend (you can set up a Python notebook on your browser with one click).

First, let’s load some libraries:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
  • Pandas: a library for manipulating tabular data (e.g. data in rows and columns)
  • NumPy: a library that helps perform mathematical functions on arrays
  • Matplotlib: a library for creating plots and visualizations
  • Seaborn: a data visualization library that builds on top of matplotlib to offer a wider variety of visualizations (I often prefer seaborn plots over matplotlib plots)

We can use the pandas read_csv() method to load the csv data set as a DataFrame (which is structured in rows and columns). We’ll save it as ‘houses’ and print the first 5 rows:

url = 'https://github.com/dhyan6/data-science-projects/blob/main/kc_house_data.csv?raw=true' 
houses = pd.read_csv(url)
print(houses.head())

We get a sense of what this data set is like, but we can do better. Applying .info() to houses gives us a nice summary:

print(houses.info())

We have data on 21,613 houses that were sold in King County and a list of 10 variables that could potentially help us predict the price of a house. A note on variables that are not self-explanatory:

  • sqft_living: Square footage of the apartments interior living space
  • sqft_lot: Square footage of the land space
  • condition: An index from 1 to 5 on the condition of the apartment
  • grade: An index from 1 to 13 on quality of construction and design
  • sqft_above: The square footage of the interior housing space that is above ground level

The question is: which feature or explanatory variable has the strongest relationship with the target variable price? First, we need to a way to talk about relationships.

Correlation and the Pearson’s correlation coefficient

Correlation is the relationship between two variables, and the strength of this relationship is called the Pearson’s correlation coefficient, Pearson’s r, or just r for short.

The correlation coefficient can take any value from -1 to 1.

  • r = 1 means that two variables are perfectly correlated (a positive change in one variable perfectly predicts a positive change in the other)
  • r = 0 means that two variables are not correlated (a change in one variable does not predict a change in the other)
  • r = -1 means that two variables are negatively correlated (a positive change in one variable perfectly predicts a negative change in the other)

Here’s a neat chart that shows different correlation coefficients (r values) for different scatter plots:

Two important facts about r

Note that r does not translate directly to a percentage of accuracy. An r value of .6 does not mean that 60% of the data are close to the line of best nor that the model is 60% accurate. The coefficient r is simply a measure of how strong the linear relationship is and whether it’s positive or negative.

  1. The sign of r indicates whether the linear relationship is positive or negative.
  2. The value of r (without the sign) indicates the strength of the linear relationship.

Display a Heat Map

Let’s display the correlation coefficients between all possible variables in our houses data set using a heat map.

A few pointers on how to read a heat map:

  • the value for each box represents the correlation coefficient for the variables in the corresponding row and column
  • in our current color scheme, lighter boxes represent a higher correlation coefficient
  • the diagonal values are all 1 because every variable has an perfect correlation with itself
correlations = houses.corr()
sns.heatmap(correlations, annot=True)

To find the strongest relationship, we look for the box with the lightest color or the highest value. The variable _sqftliving has the strongest relationship to price (r = 0.7).

Scatter plot

Let’s visualize this relationship with a scatter plot.

plt.title("House Price vs Squared Feet")
plt.xlabel("Squared Feet")
plt.ylabel("House Price (in millions)")
sns.scatterplot(x='sqft_living', y='price', data=houses)

Visually, it appears like there is a general linear relationship between these variables. The house prices seem to increase as the squared footage increases (we’ll talk more about how to describe this soon). It’s time to build a model!

Part II: Why modeling relationships matters

By now, hopefully you subscribe to the idea that relationships really matter in data science (also in real life, but let’s stick to data science for now). By building models of relationships, we can make predictions about data that we’re interested in.

A linear model is the equation of a line that describes the relationship between a predictor variable X and an outcome variable Y.

Here, we’ve introduced new terms. You can think of a linear model as a function f that receives some input X and returns an output Y. We call X a predictor variable, and Y the target or outcome variable.

In this case, we are building a linear model, so our equation will be the slope-intercept equation of a line.

As a mathematical convention, we’ll write the previous equation as:

where β0 and β1 are the model’s coefficients or parameters. Finding the best model for our data is equivalent to finding the parameters β0 and β1 (y-intercept and slope) that minimize the distance between the estimated values of the model and the actual values in our data.

Finding the line of best fit

How do we describe how well our linear model fits the data? One way to think about accuracy is to think about error. The difference between the predicted value and the actual value is called the residual, and can be visualized as follows:

where:

  • ŷᵢ (read y hat) is the predicted value of the model
  • yᵢ is the actual value in the data set

One way to find the line of best fit is to try to minimize the Mean Squared Error.

Mean Squared Error (MSE)

The Mean Squared Error is a measure of the average of the squares of the residuals. A line of best fit is a line whose coefficients β0 (y-intercept) and β1 (slope) minimize the mean squared error.

Mathematically, this can be expressed as:

Building a Simple Linear Regression Model

We’ll use the Python library sklearn to build a simple linear regression model that finds the line of best fit. Again, we are trying to calculate the coefficients β0 and β1 that minimize the residuals.

Because we’ll want to evaluate the model’s predictions on data it hasn’t seen before (to give us a sense of accuracy), we’ll train the model on a subset of the data and test it on another subset. This can be visualized as follows:

Step 1: Define our train and test data

# Import the module train_test_split
from sklearn.model_selection import train_test_split
# Define our predictor and target variables
X = houses[['sqft_living']]
Y = houses['price']
# Create four groups using train_test_split. By default, 75% of data is assigned to train, the other 25% to test.
x_train, x_test, y_train, y_test = train_test_split(X, Y)

Step 2: Build and fit the model

# Import the library
from sklearn.linear_model import LinearRegression
# Initialize a linear regression model object
lr = LinearRegression() 
# Fit the linear regression model object to our data
lr.fit(x_train, y_train)
# Print the intercept and the slope of the model
print(lr.intercept_) 
print(lr.coef_) 
# Show line of best fit
plt.plot(x_train, lr.coef_*x_train + lr.intercept_, '-r', label='Intercept: -39,163 nSlope: 279.4')

The coefficient β1 of our model tells us that the price increases approximately $280 for every additional square foot in a house. So now, let’s say we want to predict the price of a house with 4,600 squared feet.

We can use the .predict() method on our model lr to obtain the price $1,246,552.

lr.predict([[4,600]])
>> 1246552 

Part III: Why evaluating models of relationships matters

Once you have a model of the relationship between a predictor variable X and an outcome variable Y, a logical question to ask is: how good is this model? After all, the decisions you make in real life will depend on how much you trust your model.

A few things to keep in mind:

  • You may think your model is good, but that’s not enough. You will probably need to convince other stakeholders (your boss, professor, investors, etc.) that your model is actually good. You’ll need quantitative proof to do this.
  • The model should fit the data well enough that you feel comfortable making predictions on data that it hasn’t seen yet.

Model evaluation is a fascinating topic on its own, which I can’t possibly do justice in this section. For now, I want to leave you with a few pointers on how to determine the accuracy of a linear regression model.

Root Mean Squared Error

A popular way to measure the error of a model is to calculate the average of the residuals squared, which is essentially a measure of the average distance between our model’s predicted points (on the line of best fit) and the actual data points.

The Root Mean Squared Error (RMSE) measures the root of the average of the squares of the residuals. This can be expressed as:

In other words, we are taking the square root of the average of the squared residuals.

We can use a module from sklearn to calculate the MSE, and then apply the square root. Note that we’ll use the _ytest values below, which is the subset of values the model has never seen before:

# Define a set of predictions for y based on subset x_test
y_pred = lr.predict(x_test)
# Import module
from sklearn.metrics import mean_squared_error
# We pass the test values and the predicted values
mse = mean_squared_error(y_test, y_pred)
# Let's take the square root
rmse = np.sqrt(mse)
# Print the result
print('Root Mean Squared Error: ' + str(rmse))

Root Mean Squared Error: $266,725! As you can tell, this number is a bit on the high end. We would ideally like the RMSE to be smaller.

Given that the price of a house is dependent on more factors than just the squared footage, it’s not a big surprise that the RMSE is large. One way to solve this problem is to include more predictor variables in a regression model to make it more accurate (I’ll be writing more about this soon).

R-Squared

A second popular measure of how well the fitted regression line fits the data is called r-squared (or r²).

R-squared, also known as the coefficient of determination, is the proportion of variation that is explained by a linear model. In general, a higher r-squared value represents a better fit of the data.

Let’s calculate R-Squared. We’ll provide as inputs the subset y_test and the subset y_pred, which we obtained by applying our model to x_test.

# Import r2_score module
from sklearn.metrics import r2_score
# Print R2 Score
print(r2_score(y_test, y_pred))

R-Squared = .48. This means that 48% of the variation in the data is explained or predicted by the linear model. It’s a good start.

Conclusion

I hope this article was useful as an introduction to why we care about building and evaluating linear models. In a future post, I’ll explore the advantages of using multiple linear regression and discovering non-linear relationships.

In short, linear regression is a powerful supervised Machine Learning algorithm that can help us model linear relationships between two variables. Simple linear regression is often a good starting point for exploring our data and thinking about how to build more complex models.

If you want to check out more resources, I highly recommend:

  • An Introduction to Statistical Learning by Gareth James (PDF)
  • Machine Learning by Andrew Ng (Coursera)

Thanks for reading! If you enjoyed this piece, feel free to request new topics or start a conversation thread below.

I’ll be explaining a new data science concept each week in a way that I hope is fun and intuitive.


Related Articles