
Linear regression performs a regression task on a target variable based on independent variables in a given data. It is a machine learning algorithm and is often used to find the relationship between the target and independent variables.
The Simple Linear Regression model is to predict the target variable using one independent variable.
When one variable/column in a dataset is not sufficient to create a good model and make more accurate predictions, we’ll use a multiple linear regression model instead of a simple linear regression model.
The line equation for the multiple linear regression model is:
y = β0 + β1X1 + β2X2 + β3X3 + .... + βpXp + e
Before proceeding further on building the model using python, we need to consider some things:
- Adding more variables isn’t always helpful because the model may ‘over-fit,’ and it’ll be too complicated. The trained model doesn’t generalize with the new data. It only works on the trained data.
- All the variables/columns in the dataset may not be independent. This condition is called
multicollinearity
, where there is an association between predictor variables. - We have to select the appropriate variables to build the best model. This process of selecting variables is called
Feature selection
.
We’ll discuss points 2 & 3 using python code.
Now, let’s dive into the Jupyter Notebook
and see how we can build the Python model.
Reading and Understanding the Dataset
We read the data into our system and understand if the data has any anomalies.
For the remainder of the article, we are using the dataset, which can be downloaded from here.
The target variable/column in the dataset is Price
.
We’ll import the necessary libraries to read the data and convert it into a pandas dataframe.
The sample data frame looks like this,

Let’s see for any null values in the dataset using .info()
, and also, we have to check for any outliers using .describe()
.
The output is,

Observe, there are no null values in the data, and also, there are no outliers in the data.
Data Preparation
If we observe the dataset, there are numeric values and columns with values as ‘Yes’ or ‘No.’ But to fit a regression line, we need numeric values, so we’ll convert ‘Yes’ and ‘No’ as 1s and 0s.
Let’s see the dataset now,

The furnishingstatus
column has three levels furnished
, semi_furnished
, and unfurnished
.
We need to convert this column into numerical as well. To do that, we’ll use dummy
variables.
When you have a categorical variable with n-levels
, the idea of creating a dummy variable is to build 'n-1' variables
, indicating the levels.
We can create a dummy variable using the get_dummies
method in pandas.
Let’s see how the furnishstatus
column looks like in a dataset.

column into dataset
Now, we don’t need three columns. We can drop the furnished
column, as it can be identified with just the last two columns values where:
00
will correspond tofurnished
01
will correspond tounfurnished
10
will correspond tosemi-furnished
Let’s drop the furnished
column and add the status dataset into our original dataset. After that, we’ll drop the furnishstatus
column from the dataset.
The modified dataset looks like,

Now let’s build the model. As we have seen in the simple linear regression model article, the first step is to split the dataset into train and test data.
Splitting the Data into two different sets
We’ll split the data into two datasets to a 7:3 ratio.
Re-scaling the Features
We can see that all the columns have smaller integer values in the dataset except the area
column. So it is important to re-scale the variables so that they all have a comparable scale. If we don’t have relative scales, then some of the regression model coefficients will be of different units compared to the other coefficients.
To do that, we use the MinMax
scaling method.
The training dataset looks like this,

Building a linear model
Before building the model, we need to divide the data into X and Y sets.
First, we’ll add the variables except for the target variable to the model.
Adding all the variables to the model
The summary of the model is,

If we look at the p-values
of some of the variables, the values seem to be pretty high, which means they aren’t significant. That means we can drop those variables from the model.
Before dropping the variables, as discussed above, we have to see the multicollinearity
between the variables. We do that by calculating the VIF value.
Variance Inflation Factor or VIF is a quantitative value that says how much the feature variables are correlated with each other. It is an extremely important parameter to test our linear model. The formula for VIF
is:
VIF = 1/(1-Ri²)
In python, we can calculate the VIF values by importing variance_inflation_factor
from statsmodels
The VIF values for every column is,

We consider the variables generally having a value <5. If we observe the above image clearly, there are some variables we need to drop.
While dropping the variables, the first preference will go to the p-value
. Also, we have to drop one variable at a time.
Dropping the variable and updating the modelAs we can see from the summary and the VIF, some variables are still insignificant. One of these variables is semi-furnished
, as it has a very high p-value of 0.938
. Let’s go ahead and drop this variable.
The summary of the newly created model is,

semi-furnished
variableNow, let’s calculate the VIF values for the new model.
The VIF values for the new model is,

Now, the variable bedroom
has a high VIF (6.6)
and a p-value (0.206)
. Hence, it isn’t of much use and should be dropped from the model. We’ll repeat the same process as before.
The summary of the model,

The next step is calculating VIF,
The VIF values are as follows,

We’ll repeat this process till every column’s p-value is <0.005
and VIF is <5
After dropping all the necessary variables one by one, the final model will be,
The summary for the final model looks like,

The VIFs for the final model is,
The VIF values are as follows,

As we can see, the p-value
and VIF
are in the acceptable range. It’s time for us to go ahead and make predictions using the final model. This is how we select the Feature variables
, which we discussed earlier.
Now, before making predictions, we have to see whether the error terms
are normally distributed or not. We’ll do that by using Residual Analysis
.
Error-terms = y_actual - y_predicted
The difference between the actual y-value and the predicted y-value using the model at that particular x-value is the error term.
Residual Analysis of the train data
We have to check if the error terms are normally distributed (which is one of the major assumptions of Linear Regression); let us plot the error terms’ histogram.
The histogram looks like the following,

As we can see, the error terms resemble closely to a normal distribution. So we can move ahead and make predictions using the model in the test dataset.
Making Predictions Using the Final Model
We have fitted the model and checked the normality of error terms. Let’s make predictions using the final model.
Similar to the training dataset. First, we have to scale the test data.
The test dataset looks like this,

Dividing the test data into X and Y, after that, we’ll drop the unnecessary variables from the test data based on our model.
Now, we have to see if the final predicted model is best fitted or not. To do that, we’ll calculate the R² value for the expected test model.
We do that by importing the r2_score
library from sklearn
The R² value for the test data = 0.660134403021964, The R² value for the train data = 0.667; we can see the value from the final model summary above.
Since the R² values for both the train and test data are almost equal, the model we built is the best-fitted model.
This is one type of process to build the multiple linear regression model where we select and drop the variables manually. There is another process called Recursive Feature Elimination (RFE).
Recursive Feature Elimination (RFE)
RFE is an automatic process where we don’t need to select variables manually. We follow the same steps we have done earlier until Re-scaling the features and dividing the data into X and Y.
We will use the LinearRegression
function ** from sklearn**
for RFE (which is a utility from sklearn
)
We have to run the RFE. In the code, we have to provide the number of variables the RFE has to consider to build the model.
The output for the above code is,

As we can see, the variables showing True
is essential for the model, and the False
variable is not needed. If we want to add the False
variable to the model, there is also a rank associated with them to add the variables in that order.
Building Model
Now, we build the model using statsmodel
for detailed statistics.
The summary of the model is,

Since the bedrooms
column is insignificant to other variables, it can be dropped from the model.
The summary of the model after dropping the bedroom
variable

Now, we calculate the VIFs for the model.
The VIF values for the above code is,

Since the p-values and VIF are in the desired range, we’ll move forward with the analysis.
The next step is the residual analysis of error terms.
Residual Analysis
So, let’s check if the error terms are also normally distributed using a histogram.
The histogram looks like,

Evaluating the model on test data
Applying the scaling on the test sets and dividing the data into X and Y.
After that, let’s evaluate the model,
The R² value for the test data = 0.6481740917926483, which is pretty similar to the train data.
Since the R² values for both the train and test data are almost equal, the model we built is the best-fitted model.
Conclusion
We built a basic multiple linear regression model in Machine Learning manually and using an automatic RFE approach. Most of the time, we use multiple linear regression instead of a simple linear regression model because the target variable is always dependent on more than one variable.
So, it is crucial to learn how multiple linear regression works in machine learning, and without knowing simple linear regression, it is challenging to understand the multiple linear regression model.
Thank you for reading and happy coding!!!
Check out my previous articles here
- Simple Linear Regression Model using Python: Machine Learning
- Linear Regression Model: Machine Learning
- Exploratory Data Analysis(EDA): Python
- Central Limit Theorem(CLT): Data Science
- Inferential Statistics: Data Analysis
- Seaborn: Python
- Pandas: Python
- Matplotlib: Python
- NumPy: Python
References
- Multiple Linear Regression: https://acadgild.com/blog/multiple-linear-regression
- ML | Multiple Linear Regression: https://www.geeksforgeeks.org/ml-multiple-linear-regression-using-python/
- Multiple Linear Regression: https://www.coursera.org/lecture/machine-learning-with-python/multiple-linear-regression-0y8Cq
- Multiple Linear Regression: https://www.tutorialspoint.com/machine_learning_with_python/machine_learning_with_python_multiple_linear_regression.htm