The world’s leading publication for data science, AI, and ML professionals.

Multiple Linear Regression model using Python: Machine Learning

Learning how to build a basic multiple linear regression model in machine learning using python

Image by Gordon Johnson from Pixabay
Image by Gordon Johnson from Pixabay

Linear regression performs a regression task on a target variable based on independent variables in a given data. It is a machine learning algorithm and is often used to find the relationship between the target and independent variables.

The Simple Linear Regression model is to predict the target variable using one independent variable.

When one variable/column in a dataset is not sufficient to create a good model and make more accurate predictions, we’ll use a multiple linear regression model instead of a simple linear regression model.

The line equation for the multiple linear regression model is:

y = β0 + β1X1 + β2X2 + β3X3 + .... + βpXp + e

Before proceeding further on building the model using python, we need to consider some things:

  1. Adding more variables isn’t always helpful because the model may ‘over-fit,’ and it’ll be too complicated. The trained model doesn’t generalize with the new data. It only works on the trained data.
  2. All the variables/columns in the dataset may not be independent. This condition is called multicollinearity, where there is an association between predictor variables.
  3. We have to select the appropriate variables to build the best model. This process of selecting variables is called Feature selection.

We’ll discuss points 2 & 3 using python code.

Now, let’s dive into the Jupyter Notebook and see how we can build the Python model.

Reading and Understanding the Dataset

We read the data into our system and understand if the data has any anomalies.

For the remainder of the article, we are using the dataset, which can be downloaded from here.

The target variable/column in the dataset is Price.

We’ll import the necessary libraries to read the data and convert it into a pandas dataframe.

The sample data frame looks like this,

Image by Author - Sample Dataset
Image by Author – Sample Dataset

Let’s see for any null values in the dataset using .info(), and also, we have to check for any outliers using .describe().

The output is,

Image by Author - Checking for Null Values and Outliers
Image by Author – Checking for Null Values and Outliers

Observe, there are no null values in the data, and also, there are no outliers in the data.

Data Preparation

If we observe the dataset, there are numeric values and columns with values as ‘Yes’ or ‘No.’ But to fit a regression line, we need numeric values, so we’ll convert ‘Yes’ and ‘No’ as 1s and 0s.

Let’s see the dataset now,

Image by Author - Converting the category variables into numeric variables
Image by Author – Converting the category variables into numeric variables

The furnishingstatus column has three levels furnished, semi_furnished, and unfurnished.

We need to convert this column into numerical as well. To do that, we’ll use dummy variables.

When you have a categorical variable with n-levels, the idea of creating a dummy variable is to build 'n-1' variables, indicating the levels.

We can create a dummy variable using the get_dummies method in pandas.

Let’s see how the furnishstatus column looks like in a dataset.

Image by Author - furnishstatus column into dataset
Image by Author – furnishstatus column into dataset

Now, we don’t need three columns. We can drop the furnished column, as it can be identified with just the last two columns values where:

  • 00 will correspond to furnished
  • 01 will correspond to unfurnished
  • 10 will correspond to semi-furnished

Let’s drop the furnished column and add the status dataset into our original dataset. After that, we’ll drop the furnishstatus column from the dataset.

The modified dataset looks like,

Image by Author - Sample dataset after adding dummy variables
Image by Author – Sample dataset after adding dummy variables

Now let’s build the model. As we have seen in the simple linear regression model article, the first step is to split the dataset into train and test data.

Splitting the Data into two different sets

We’ll split the data into two datasets to a 7:3 ratio.

Re-scaling the Features

We can see that all the columns have smaller integer values in the dataset except the area column. So it is important to re-scale the variables so that they all have a comparable scale. If we don’t have relative scales, then some of the regression model coefficients will be of different units compared to the other coefficients.

To do that, we use the MinMax scaling method.

The training dataset looks like this,

Image by Author - Training dataset after re-scaling
Image by Author – Training dataset after re-scaling

Building a linear model

Before building the model, we need to divide the data into X and Y sets.

First, we’ll add the variables except for the target variable to the model.

Adding all the variables to the model

The summary of the model is,

Image by Author - Summary of the model
Image by Author – Summary of the model

If we look at the p-values of some of the variables, the values seem to be pretty high, which means they aren’t significant. That means we can drop those variables from the model.

Before dropping the variables, as discussed above, we have to see the multicollinearity between the variables. We do that by calculating the VIF value.

Variance Inflation Factor or VIF is a quantitative value that says how much the feature variables are correlated with each other. It is an extremely important parameter to test our linear model. The formula for VIF is:

VIF = 1/(1-Ri²)

In python, we can calculate the VIF values by importing variance_inflation_factor from statsmodels

The VIF values for every column is,

Image by Author - VIF Values for every variable
Image by Author – VIF Values for every variable

We consider the variables generally having a value <5. If we observe the above image clearly, there are some variables we need to drop.

While dropping the variables, the first preference will go to the p-value. Also, we have to drop one variable at a time.

Dropping the variable and updating the modelAs we can see from the summary and the VIF, some variables are still insignificant. One of these variables is semi-furnished, as it has a very high p-value of 0.938. Let’s go ahead and drop this variable.

The summary of the newly created model is,

Image by Author - Summary of the model after dropping semi-furnished variable
Image by Author – Summary of the model after dropping semi-furnished variable

Now, let’s calculate the VIF values for the new model.

The VIF values for the new model is,

Image by Author - New VIFs after dropping semi-furnished column
Image by Author – New VIFs after dropping semi-furnished column

Now, the variable bedroom has a high VIF (6.6) and a p-value (0.206). Hence, it isn’t of much use and should be dropped from the model. We’ll repeat the same process as before.

The summary of the model,

Image by Author - Summary of the model
Image by Author – Summary of the model

The next step is calculating VIF,

The VIF values are as follows,

Image by Author - VIF values after dropping bedrooms column
Image by Author – VIF values after dropping bedrooms column

We’ll repeat this process till every column’s p-value is <0.005 and VIF is <5

After dropping all the necessary variables one by one, the final model will be,

The summary for the final model looks like,

Image by Author - Final model after dropping all the necessary variables
Image by Author – Final model after dropping all the necessary variables

The VIFs for the final model is,

The VIF values are as follows,

Image by Author - VIF values for the final model
Image by Author – VIF values for the final model

As we can see, the p-value and VIF are in the acceptable range. It’s time for us to go ahead and make predictions using the final model. This is how we select the Feature variables, which we discussed earlier.

Now, before making predictions, we have to see whether the error terms are normally distributed or not. We’ll do that by using Residual Analysis.

Error-terms = y_actual - y_predicted

The difference between the actual y-value and the predicted y-value using the model at that particular x-value is the error term.

Residual Analysis of the train data

We have to check if the error terms are normally distributed (which is one of the major assumptions of Linear Regression); let us plot the error terms’ histogram.

The histogram looks like the following,

Image by Author - Histogram of Error terms
Image by Author – Histogram of Error terms

As we can see, the error terms resemble closely to a normal distribution. So we can move ahead and make predictions using the model in the test dataset.

Making Predictions Using the Final Model

We have fitted the model and checked the normality of error terms. Let’s make predictions using the final model.

Similar to the training dataset. First, we have to scale the test data.

The test dataset looks like this,

Image by Author - Test Dataset
Image by Author – Test Dataset

Dividing the test data into X and Y, after that, we’ll drop the unnecessary variables from the test data based on our model.

Now, we have to see if the final predicted model is best fitted or not. To do that, we’ll calculate the R² value for the expected test model.

We do that by importing the r2_score library from sklearn

The R² value for the test data = 0.660134403021964, The R² value for the train data = 0.667; we can see the value from the final model summary above.

Since the R² values for both the train and test data are almost equal, the model we built is the best-fitted model.

This is one type of process to build the multiple linear regression model where we select and drop the variables manually. There is another process called Recursive Feature Elimination (RFE).

Recursive Feature Elimination (RFE)

RFE is an automatic process where we don’t need to select variables manually. We follow the same steps we have done earlier until Re-scaling the features and dividing the data into X and Y.

We will use the LinearRegression function ** from sklearn** for RFE (which is a utility from sklearn)

We have to run the RFE. In the code, we have to provide the number of variables the RFE has to consider to build the model.

The output for the above code is,

Image by Author - RFE values for all the variables
Image by Author – RFE values for all the variables

As we can see, the variables showing True is essential for the model, and the False variable is not needed. If we want to add the False variable to the model, there is also a rank associated with them to add the variables in that order.

Building Model

Now, we build the model using statsmodel for detailed statistics.

The summary of the model is,

Image by Author - Summary of the initial model
Image by Author – Summary of the initial model

Since the bedrooms column is insignificant to other variables, it can be dropped from the model.

The summary of the model after dropping the bedroom variable

Image by Author - Summary of model after dropping bedrooms column
Image by Author – Summary of model after dropping bedrooms column

Now, we calculate the VIFs for the model.

The VIF values for the above code is,

Image by Author - VIF values
Image by Author – VIF values

Since the p-values and VIF are in the desired range, we’ll move forward with the analysis.

The next step is the residual analysis of error terms.

Residual Analysis

So, let’s check if the error terms are also normally distributed using a histogram.

The histogram looks like,

Image by Author
Image by Author

Evaluating the model on test data

Applying the scaling on the test sets and dividing the data into X and Y.

After that, let’s evaluate the model,

The R² value for the test data = 0.6481740917926483, which is pretty similar to the train data.

Since the R² values for both the train and test data are almost equal, the model we built is the best-fitted model.

Conclusion

We built a basic multiple linear regression model in Machine Learning manually and using an automatic RFE approach. Most of the time, we use multiple linear regression instead of a simple linear regression model because the target variable is always dependent on more than one variable.

So, it is crucial to learn how multiple linear regression works in machine learning, and without knowing simple linear regression, it is challenging to understand the multiple linear regression model.

Thank you for reading and happy coding!!!

Check out my previous articles here

References


Related Articles