The world’s leading publication for data science, AI, and ML professionals.

Simple Linear Regression Model using Python: Machine Learning

Learning how to build a simple linear regression model in machine learning using Jupyter notebook in Python

Photo by Kevin Ku on Unsplash
Photo by Kevin Ku on Unsplash

In the previous article, the Linear Regression Model, we have seen how the Linear Regression model works theoretically using Microsoft Excel. This article will see how we can build a linear regression model using Python in the Jupyter notebook.

Simple Linear Regression

To predict the relationship between two variables, we’ll use a simple linear regression model.

In a simple linear regression model, we’ll predict the outcome of a variable known as the dependent variable using only one independent variable.

We’ll directly dive into building the model in this article. More about the linear regression model and the factors we have to consider are explained in detail here.

Building a linear regression model

To build a linear regression model in python, we’ll follow five steps:

  1. Reading and understanding the data
  2. Visualizing the data
  3. Performing simple linear regression
  4. Residual analysis
  5. Predictions on the test set

Reading and understanding the data

In this step, first, we’ll import the necessary libraries to import the data. After that, we’ll perform some basic commands to understand the structure of the data.

We can download the sample dataset, which we’ll be using in this article from here.

Let’s assume we have a company’s data, where there is the amount spent on different types of advertisements and its subsequent sales.

Import librariesWe’ll import the numpy and pandas library in the Jupyter notebook and read the data using pandas.

The dataset looks like this. Here our target variable is the Sales column.

Advertising data of a company
Advertising data of a company

Understand the dataLet’s perform some tasks to understand the data like shape, info, and describe.

The shape of our dataset is,

(200, 4)

Using the info, we can see whether there are any null values in the data. If yes, then we have to do some data manipulation.

Info of the dataset
Info of the dataset

As we can observe, there are no null values present in the data.

Using describe, we’ll see whether there is any sudden jump in the data’s values.

Describing the dataset
Describing the dataset

The values present in the columns are pretty consistent throughout the data.

Visualizing the data

Let’s now visualize the data using the matplolib and seaborn library. We’ll make a pairplot of all the columns and see which columns are the most correlated to Sales.

It is always better to use a scatter plot between two numeric variables. The pairplot for the above code looks like,

Pairplot of each Column w.r.t. Sales column
Pairplot of each Column w.r.t. Sales column

If we cannot determine the correlation using a scatter plot, we can use the seaborn heatmap to visualize the data.

The heatmap looks like this,

Heatmap of all the columns in the data
Heatmap of all the columns in the data

As we can see from the above graphs, the TV column seems most correlated to Sales.

Let’s perform the simple linear regression model using TV as our feature variable.

Performing Simple Linear Regression

Equation of simple linear regression y = c + mX

In our case: y = c + m * TV The m values are known as model coefficients or model parameters.

We’ll perform simple linear regression in four steps.

  1. Create X and y
  2. Create Train and Test set
  3. Train your model
  4. Evaluate the model

Create X and yFirst, we’ll assign our feature variable/column TV as X and our target variable Sales as y.

To generalize,

The independent variable represents X, and y represents the target variable in a simple linear regression model.

Create Train and Test setsWe need to split our variables into training and testing sets. Using the training set, we’ll build the model and perform the model on the testing set. We’ll divide the training and testing sets into a 7:3 ratio, respectively.

We’ll split the data by importing train_test_split from the sklearn.model_selection library.

Let’s take a look at the training dataset,

X_train data looks like this after splitting.

X_train data after splitting
X_train data after splitting

y_train data looks like this after splitting.

y_train data after splitting
y_train data after splitting

Building and training the modelUsing the following two packages, we can build a simple linear regression model.

  • statsmodel
  • sklearn

First, we’ll build the model using the statsmodel package. To do that, we need to import the statsmodel.api library to perform linear regression.

By default, the statsmodel library fits a line that passes through the origin. But if we observe the simple linear regression equation y = c + mX, it has an intercept value as c. So, to have an intercept, we need to add the add_constant attribute manually.

Once we’ve added constant, we can fit the regression line using OLS (Ordinary Least Square) method present in the statsmodel. After that, we’ll see the parameters, i.e., c and m of the straight line.

The output is,

Intercept and Slope of the line
Intercept and Slope of the line

Let’s see the summary of all the different parameters of the regression line fitted like , probability of F-statistic, and p-value.

The statistics for the above regression line is,

All the statistics for the above best-fit line
All the statistics for the above best-fit line

So, the statistics we are mainly concerned with to determine whether the model is viable or not are:

  1. The coefficients and its p-value(significance)
  2. R-squared value
  3. F-statistic and its significance
Statistics we need to look
Statistics we need to look
  1. The coefficient for TV is 0.054, and its corresponding p-value is very low, almost 0. That means the coefficient is statistically significant.

We have to make sure that the p-value should always be less for the coefficient to be significant

  1. R-squared value is 0.816, which means that 81.6% of the Sales variance can be explained by the TV column using this line.
  2. Prob F-statistic has a very low p-value, practically zero, which gives us that the model fit is statistically significant.

Since the fit is significant, let’s go ahead and visualize how well the straight-line fits the scatter plot between TV and Sales columns.

From the parameters, we got the values of the intercept and the slope for the straight line. The equation of the line is,

Sales = 6.948 + 0.054 * TV

The graph looks like this,

Best-fit regression line
Best-fit regression line

This is how we build a simple linear regression model using training data. Now before evaluating the model on test data, we have to perform residual analysis.

Residual Analysis

One of the major assumptions of the linear regression model is the error terms are normally distributed.

Error = Actual y value - y predicted value

Now from the dataset, We have to predict the y value from the training dataset of X using the predict attribute. After that, we’ll create the error terms(Residuals) from the predicted data.

Now, let’s plot the histogram of the residuals and see whether it looks like normal distribution or not.

The histogram of the residuals looks like,

Residuals distribution
Residuals distribution

As we can see, the residuals are following the normal distribution graph with a mean 0.

Now, make sure that the residuals are not following any specific pattern.

The scatter plot looks like,

Scatter plot of Residual values
Scatter plot of Residual values

Since the Residuals follow a normal distribution and do not follow any specific pattern, we can use the linear regression model we have built to evaluate test data.

Predictions on the Test data or Evaluating the model

Now that we have fitted the regression line on our train dataset, we can make some predictions to the test data. Similar to the training dataset, we have to add_constant to the test data and predict the y values using the predict attribute present in the statsmodel.

The predicted y-values on test data are,

Predicted y-values
Predicted y-values

Now, let’s calculate the value for the above-predicted y-values. We can do that by merely importing the r2_score library from sklearn.metrics package.

The R² value by using the above code = 0.792

If we can remember from the training data, the R² value = 0.815

Since the R² value on test data is within 5% of the R² value on training data, we can conclude that the model is pretty stable. Which means, what the model has learned on the training set can generalize on the unseen test set.

Let’s visualize the line on the test data.

The scatter-plot with best-fit line looks like,

Best-fit line on test data
Best-fit line on test data

This is how we build a linear regression model using the statsmodel package.

Apart from the statsmodel, we can build a linear regression model using sklearn. Using the linear_model library from sklearn, we can make the model.

Similar to statsmodel, we’ll split the data into train and test.

For simple linear regression, we need to add a column to perform the regression fit properly.

The shape of X_train before adding a column is (140, ). The shape of X for train and test data is (140, 1).

Now, let’s fit the line to the plot importing the LinearRegression library from the sklearn.linear_model.

Now, let’s find the coefficients of the model.

The value of intercept and slope is,

Coefficient Values
Coefficient Values

The straight-line equation we get for the above values is, Sales = 6.948 + 0.054 * TV If we observe, the equation we got here is the same as the one we got in the statsmodel.

After that, we’ll make the predictions and on the data and evaluate the model by comparing the R² values.

The R² values of the train and test data are R² train_data = 0.816 R² test_data = 0.792

Same as the statesmodel, the R² value on test data is within 5% of the R² value on training data. We can apply the model to the unseen test set in the future.

Conclusion

As we have seen, we can build a linear regression model using either a statsmodel or sklearn.

We have to make sure to follow these five steps to build the simple linear regression model:

  1. Reading and understanding the data
  2. Visualizing the data
  3. Performing simple linear regression
  4. Residual analysis
  5. Predictions on the test set

In the next article, we’ll see how the multiple linear regression model works.

Thank you for reading and happy coding!!!

Check out my previous articles here

References


Related Articles