
In the previous article, the Linear Regression Model, we have seen how the Linear Regression model works theoretically using Microsoft Excel. This article will see how we can build a linear regression model using Python in the Jupyter notebook.
Simple Linear Regression
To predict the relationship between two variables, we’ll use a simple linear regression model.
In a simple linear regression model, we’ll predict the outcome of a variable known as the dependent variable using only one independent variable.
We’ll directly dive into building the model in this article. More about the linear regression model and the factors we have to consider are explained in detail here.
Building a linear regression model
To build a linear regression model in python, we’ll follow five steps:
- Reading and understanding the data
- Visualizing the data
- Performing simple linear regression
- Residual analysis
- Predictions on the test set
Reading and understanding the data
In this step, first, we’ll import the necessary libraries to import the data. After that, we’ll perform some basic commands to understand the structure of the data.
We can download the sample dataset, which we’ll be using in this article from here.
Let’s assume we have a company’s data, where there is the amount spent on different types of advertisements and its subsequent sales.
Import librariesWe’ll import the numpy
and pandas
library in the Jupyter notebook and read the data using pandas
.
The dataset looks like this. Here our target variable is the Sales column.

Understand the dataLet’s perform some tasks to understand the data like shape
, info
, and describe
.
The shape
of our dataset is,
(200, 4)
Using the info
, we can see whether there are any null values in the data. If yes, then we have to do some data manipulation.

As we can observe, there are no null values present in the data.
Using describe
, we’ll see whether there is any sudden jump in the data’s values.

The values present in the columns are pretty consistent throughout the data.
Visualizing the data
Let’s now visualize the data using the matplolib
and seaborn
library. We’ll make a pairplot of all the columns and see which columns are the most correlated to Sales
.
It is always better to use a scatter plot between two numeric variables. The pairplot for the above code looks like,

If we cannot determine the correlation using a scatter plot, we can use the seaborn heatmap to visualize the data.
The heatmap looks like this,

As we can see from the above graphs, the TV column seems most correlated to Sales.
Let’s perform the simple linear regression model using TV as our feature variable.
Performing Simple Linear Regression
Equation of simple linear regression
y = c + mX
In our case:
y = c + m * TV
The m values are known as model coefficients or model parameters.
We’ll perform simple linear regression in four steps.
- Create X and y
- Create Train and Test set
- Train your model
- Evaluate the model
Create X and yFirst, we’ll assign our feature variable/column TV
as X
and our target variable Sales
as y
.
To generalize,
The independent variable represents X
, and y
represents the target variable in a simple linear regression model.
Create Train and Test setsWe need to split our variables into training and testing sets. Using the training set, we’ll build the model and perform the model on the testing set. We’ll divide the training and testing sets into a 7:3 ratio, respectively.
We’ll split the data by importing train_test_split
from the sklearn.model_selection
library.
Let’s take a look at the training dataset,
X_train data looks like this after splitting.

y_train data looks like this after splitting.

Building and training the modelUsing the following two packages, we can build a simple linear regression model.
statsmodel
sklearn
First, we’ll build the model using the statsmodel
package. To do that, we need to import the statsmodel.api
library to perform linear regression.
By default, the statsmodel
library fits a line that passes through the origin. But if we observe the simple linear regression equation y = c + mX
, it has an intercept value as c
. So, to have an intercept, we need to add the add_constant
attribute manually.
Once we’ve added constant, we can fit the regression line using OLS
(Ordinary Least Square) method present in the statsmodel
. After that, we’ll see the parameters, i.e., c
and m
of the straight line.
The output is,

Let’s see the summary of all the different parameters of the regression line fitted like R²
, probability of F-statistic
, and p-value
.
The statistics for the above regression line is,

So, the statistics we are mainly concerned with to determine whether the model is viable or not are:
- The
coefficients
and itsp-value
(significance) R-squared
valueF-statistic
and its significance

- The
coefficient
for TV is 0.054, and its correspondingp-value
is very low, almost 0. That means thecoefficient
is statistically significant.
We have to make sure that the p-value should always be less for the coefficient to be significant
R-squared
value is 0.816, which means that 81.6% of theSales
variance can be explained by theTV
column using this line.- Prob
F-statistic
has a very lowp-value
, practically zero, which gives us that the model fit is statistically significant.
Since the fit is significant, let’s go ahead and visualize how well the straight-line fits the scatter plot between TV
and Sales
columns.
From the parameters, we got the values of the intercept
and the slope
for the straight line. The equation of the line is,
Sales = 6.948 + 0.054 * TV
The graph looks like this,

This is how we build a simple linear regression model using training data. Now before evaluating the model on test data, we have to perform residual analysis.
Residual Analysis
One of the major assumptions of the linear regression model is the error terms are normally distributed.
Error = Actual y value - y predicted value
Now from the dataset,
We have to predict the y value from the training dataset of X using the predict
attribute. After that, we’ll create the error terms(Residuals) from the predicted data.
Now, let’s plot the histogram of the residuals and see whether it looks like normal distribution or not.
The histogram of the residuals looks like,

As we can see, the residuals are following the normal distribution graph with a mean 0.
Now, make sure that the residuals are not following any specific pattern.
The scatter plot looks like,

Since the Residuals follow a normal distribution and do not follow any specific pattern, we can use the linear regression model we have built to evaluate test data.
Predictions on the Test data or Evaluating the model
Now that we have fitted the regression line on our train dataset, we can make some predictions to the test data. Similar to the training dataset, we have to add_constant
to the test data and predict the y values using the predict
attribute present in the statsmodel
.
The predicted y-values on test data are,

Now, let’s calculate the R²
value for the above-predicted y-values. We can do that by merely importing the r2_score
library from sklearn.metrics
package.
The R² value by using the above code = 0.792
If we can remember from the training data, the R² value = 0.815
Since the R² value on test data is within 5% of the R² value on training data, we can conclude that the model is pretty stable. Which means, what the model has learned on the training set can generalize on the unseen test set.
Let’s visualize the line on the test data.
The scatter-plot with best-fit line looks like,

This is how we build a linear regression model using the statsmodel
package.
Apart from the statsmodel
, we can build a linear regression model using sklearn
. Using the linear_model
library from sklearn
, we can make the model.
Similar to statsmodel
, we’ll split the data into train
and test
.
For simple linear regression, we need to add a column to perform the regression fit properly.
The shape of X_train before adding a column is (140, )
.
The shape of X for train and test data is (140, 1)
.
Now, let’s fit the line to the plot importing the LinearRegression
library from the sklearn.linear_model
.
Now, let’s find the coefficients of the model.
The value of intercept and slope is,

The straight-line equation we get for the above values is,
Sales = 6.948 + 0.054 * TV
If we observe, the equation we got here is the same as the one we got in the statsmodel
.
After that, we’ll make the predictions and on the data and evaluate the model by comparing the R² values.
The R² values of the train and test data are R² train_data = 0.816 R² test_data = 0.792
Same as the statesmodel
, the R² value on test data is within 5% of the R² value on training data. We can apply the model to the unseen test set in the future.
Conclusion
As we have seen, we can build a linear regression model using either a statsmodel
or sklearn
.
We have to make sure to follow these five steps to build the simple linear regression model:
- Reading and understanding the data
- Visualizing the data
- Performing simple linear regression
- Residual analysis
- Predictions on the test set
In the next article, we’ll see how the multiple linear regression model works.
Thank you for reading and happy coding!!!
Check out my previous articles here
- Linear Regression Model: Machine Learning
- Exploratory Data Analysis(EDA): Python
- Central Limit Theorem(CLT): Data Science
- Inferential Statistics: Data Analysis
- Seaborn: Python
- Pandas: Python
- Matplotlib: Python
- NumPy: Python
References
- Machine Learning – Linear Regression: https://www.w3schools.com/python/python_ml_linear_regression.asp
- Linear Regression in Python: https://realpython.com/linear-regression-in-python/
- Linear Regression (Python Implementation): https://www.geeksforgeeks.org/linear-regression-python-implementation/
- A Beginner’s Guide to Linear Regression in Python with Scikit-Learn: https://www.kdnuggets.com/2019/03/beginners-guide-linear-regression-python-scikit-learn.html