Multiple Linear Regression

A complete study — Model Interpretation →Hypothesis Testing →Feature Selection

Published in

Towards Data Science

12 min readMay 13, 2020

Linear Regression, one of the most popular and discussed models, is certainly the gateway to go deeper into Machine Learning (ML). Such a simplistic, straightforward approach to modeling is worth learning as one of your first steps into ML.

Before moving forward, let us recall that Linear Regression can be broadly classified into two categories.

Simple Linear Regression: It’s the simplest form of Linear Regression that is used when there is a single input variable for the output variable.

If you are new to regression, then I strongly suggest you first read about Simple Linear Regression from the link below, where you would understand the underlying maths behind and the approach to this model using interesting data and hands-on coding.

Simple Linear Regression

Everything you need to know about Simple Linear Regression

towardsdatascience.com

Multiple Linear Regression: It’s a form of linear regression that is used when there are two or more predictors.

We will see how multiple input variables together influence the output variable, while also learning how the calculations differ from that of Simple LR model. We will also build a regression model using Python.

At last, we will go deeper into Linear Regression and will learn things like Collinearity, Hypothesis Testing, Feature Selection, and much more.

Now one might wonder, we could also use simple linear regression to study our output against all independent variables separately. That would have made lives much easier right?

No, it wouldn’t.

Why Multiple Linear Regression?

“ To predict the outcome from multiple input variables. Duh!”. But, is that it? Well, hold that thought.

Consider this, suppose you have to estimate the price of a certain house you want to buy. You know the floor area, the age of the house, its distance from your workplace, the crime rate of the place, etc.

Now, some of these factors will affect the price of the house positively. For example more the area, the more the price. On the other hand, factors like distance from the workplace, and the crime rate can influence your estimate of the house negatively (unless you are a rich criminal with interest in Machine Learning looking for a hideout, yeah I don’t think so).

Disadvantages of Simple Linear Regression → Running separate simple linear regressions will lead to different outcomes when we are interested in just one. Besides that, there may be an input variable that is itself correlated with or dependent on some other predictor. This can cause wrong predictions and unsatisfactory results.

This is where Multiple Linear Regression comes into the picture.

Mathematically…

Here, Y is the output variable, and X terms are the corresponding input variables. Notice that this equation is just an extension of Simple Linear Regression, and each predictor has a corresponding slope coefficient (β).

The first β term (βo) is the intercept constant and is the value of Y in absence of all predictors (i.e when all X terms are 0). It may or may or may not hold any significance in a given regression problem. It’s generally there to give a relevant nudge to the line/plane of regression.

Let’s now understand this with the help of some data.

Visualizing the data

We are going to use Advertising data which is available on the site of USC Marshall School of Business. You can download it here.

If you have read my post on Simple Linear Regression, then you are already familiar with this data. If you haven’t, let me give you a quick brief.

The advertising data set consists of the sales of a product in 200 different markets, along with advertising budgets for three different media: TV, radio, and newspaper. Here’s how it looks like:

Sales (*1000 units) vs Advertising budget (*1000 USD)

The first row of the data says that the advertising budgets for TV, radio, and newspaper were $230.1k, $37.8k, and $69.2k respectively, and the corresponding number of units that were sold was 22.1k (or 22,100).

In Simple Linear Regression, we can see how each advertising medium affects sales when applied without the other two media. However, in practice, all three might be working together to impact net sales. We did not consider the combined effect of these media on sales.

Multiple Linear Regression solves the problem by taking account of all the variables in a single expression. Hence, our Linear Regression model can now be expressed as:

Finding the values of these constants(β) is what regression model does by minimizing the error function and fitting the best line or hyperplane (depending on the number of input variables).

This is done by minimizing the Residual Sum of Squares (RSS), which is obtained by squaring the differences between actual and predicted outcomes.

Ordinary Least Squares

Because this method finds the least sum of squares, it is also known as the Ordinary Least Squares (OLS) method. In Python, there are two primary ways to implement the OLS algorithm.

SciKit Learn: Just import the Linear Regression module from the Sklearn package and fit the model on the data. This method is pretty straightforward and you can see how to use it below.

from sklearn.linear_model import LinearRegressionmodel = LinearRegression()
model.fit(data.drop('sales', axis=1), data.sales)

StatsModels: Another way is to use the Statsmodels package to implement OLS. Statsmodels is a Python package that allows performing various statistical tests on the data. We will use it here so that you can learn about this great Python library, and because it will be helpful for us in the later sections.

Building the model and interpreting the coefficients

# Importing required libraries
import pandas as pd
import statsmodels.formula.api as sm# Loading data - You can give the complete path to your data here
ad = pd.read_csv("Advertising.csv")# Fitting the OLS on data
model = sm.ols('sales ~ TV + radio + newspaper', ad).fit()
print(model.params)

You should get the following output.

Intercept    2.938889
TV           0.045765
radio        0.188530
newspaper   -0.001037

I encourage you to run the regression model using Scikit Learn as well and find the above parameters using model.coef_ & model.intercept_. Did you see the same results?

Now that we have these values, how to interpret them? Here’s how:

If we fix the budget for TV & newspaper, then increasing the radio budget by $1000 will lead to an increase in sales by around 189 units(0.189*1000).
Similarly, by fixing the radio & newspaper, we infer an approximate rise of 46 units of products per $1000 increase in the TV budget.
However, for the newspaper budget, since the coefficient is quite negligible (close to zero), it’s evident that the newspaper is not affecting the sales. In fact, it’s on the negative side of zero(-0.001) which, if the magnitude was big enough, could have meant that this agent is rather causing the sales to fall. But we cannot make that kind of inference with such negligible value.

Let me tell you an interesting thing here. If we run Simple Linear Regression using just the newspaper budget against sales, we’ll observe the coefficient value of around 0.055, which is quite significant in comparison to what we saw above. Now, why is that?

Collinearity

To understand this, let’s see how these variables are correlated with each other.

ad.corr()

Let’s visualize these numbers using a heatmap.

import matplotlib.pyplot as plt
%matplotlib inline> plt.imshow(ad.corr(), cmap=plt.cm.GnBu,        interpolation='nearest',data=True)
> plt.colorbar()
> tick_marks = [i for i in range(len(ad.columns))]
> plt.xticks(tick_marks, data.columns, rotation=45)
> plt.yticks(tick_marks, data.columns, rotation=45)

Correlation Heatmap for Advertising Data

Here the dark squares represent a strong correlation (close to 1) while the lighter ones represent the weaker correlation(close to 0). That’s the reason, all the diagonals are dark blue, as a variable is fully correlated with itself.

Now, the thing worth noticing here is that the correlation between newspaper and radio is 0.35. This indicates a fair relationship between newspaper and radio budgets. Hence, it can be inferred that → when the radio budget is increased for a product, there’s a tendency to spend more on newspapers as well.

This is called collinearity and is referred to as a situation in which two or more input variables are linearly related.

Hence, even though the Multiple Regression model shows no impact on sales by the newspaper, the Simple Regression model still does due to this multicollinearity and the absence of other input variables.

Sales & Radio → probable causation
Newspaper & Radio → multicollinearity
Sales & Newspaper → transitive correlation

Alright! We understood Linear Regression, we built the model and even interpreted the results. What we learned so far were the fundamentals of Linear Regression. However, while dealing with real-world problems, we generally go beyond this point to statistically analyze our model and do the necessary changes if required.

Hypothesis Test for Predictors

One of the fundamental questions that should be answered while running Multiple Linear Regression is, whether or not, at least one of the predictors is useful in predicting the output.

We saw that the three predictors TV, radio and newspaper had a different degree of linear relationship with the sales. But what if the relationship is just by chance and there is no actual impact on sales due to any of the predictors?

The model can only give us numbers to establish a close enough linear relationship between the response variable and the predictors. However, it cannot prove the credibility of these relationships.

To have some confidence, we take help from statistics and do something known as a Hypothesis Test. We start by forming a Null Hypothesis and a corresponding Alternative Hypothesis.

Since our goal is to find if at least one predictor is useful in predicting the output, we are in a way hoping that at least one of the coefficients(not intercept) is non-zero, not just by a random chance but due to actual cause.

To do this, we start by forming a Null Hypothesis: All the coefficients are equal to zero.

General Null Hypothesis for Multiple Linear Regression

Null Hypothesis for Advertising Data

Hence the Alternative Hypothesis would be: At least one coefficient is not zero. It is proved by rejecting the Null Hypothesis by finding strong statistical evidence.

Alternative Hypothesis

The hypothesis test is performed by using F-Statistic. The formula for this statistic contains Residual Sum of Squares (RSS) and the Total Sum of Squares (TSS), which we don’t have to worry about because the Statsmodels package takes care of this. The summary of the OLS model that we fit above contains the summary of all such statistics and can be obtained with this simple line of code:

print(model.summary2())

If the value of F-statistic is equal to or very close to 1, then the results are in favor of the Null Hypothesis and we fail to reject it.

But as we can see that the F-statistic is many folds larger than 1, thus providing strong evidence against the Null Hypothesis (that all coefficients are zero). Hence, we reject the Null Hypothesis and are confident that at least one predictor is useful in predicting the output.

Note that F-statistic is not suitable when the number of predictors(p) is large, or if p is greater than the number of data samples (n).

Hence, we can say that at least one of the three advertising agents is useful in predicting sales.

But which one or which two are important? Are all of them important? To find this out, we will perform Feature Selection or variable selection. Now one way of doing this is trying all possible combinations i.e.

Only TV
Only radio
Only newspaper
TV & radio
TV & newspaper
radio & newspaper
TV, radio & newspaper

Here, it still looks feasible to try all 7 combinations, but if there are more predictors, the number of combinations will increase exponentially. For example, by adding only one more predictor to our case study, the total combinations would become 15. Just imagine having a dozen predictors.

Hence we need more efficient ways to perform Feature Selection.

Feature Selection

Two of the most popular approaches to do feature selection are:

Forward Selection: We start with a model without any predictor and just the intercept term. We then perform simple linear regression for each predictor to find the best performer(lowest RSS). We then add another variable to it and check for the best 2-variable combination again by calculating the lowest RSS(Residual Sum of Squares). After that the best 3-variable combination is checked, and so on. The approach is stopped when some stopping rule is satisfied.
Backward Selection: We start with all variables in the model, and remove the variable that is the least statistically significant (greater p-value: check the model summary above to find p-values of variables). This is repeated until a stopping rule is reached. For instance, we may stop when there is no further improvement in the model score.

In this post, I’ll walk you through the forward selection method. To begin with, let’s understand how we are going to select or reject the added variable.

We are going to use 2 measures to evaluate our new model after each addition: RSS and R².

We are already familiar with RSS which is the Residual Sum of Squares and is calculated by squaring the difference between actual outputs and predicted outcomes. It should be minimum for the model to perform well.

R² is the measure of the degree to which variance in data is explained by the model. Mathematically, it’s the square of the correlation between actual and predicted outcomes. R² closer to 1 indicates that the model is good and explains the variance in data well. A value closer to zero indicates a poor model.

Luckily, it’s calculated for us by the OLS module in Statsmodels. So let’s begin.

# Defining a function to evaluate a model
def evaluateModel(model):
    print("RSS = ", ((ad.sales - model.predict())**2).sum())
    print("R2 = ", model.rsquared)

Let’s first evaluate models with single predictors one by one, starting with TV.

# For TV
model_TV = sm.ols('sales ~ TV', ad).fit()
evaluateModel(model_TV)

RSS = 2102.5305831313512
R^2 = 0.611875050850071

# For radio
model_radio = sm.ols('sales ~ radio', ad).fit()
evaluateModel(model_radio)

RSS = 3618.479549025088
R^2 = 0.33203245544529525

# For newspaper
model_newspaper = sm.ols('sales ~ newspaper', ad).fit()
evaluateModel(model_newspaper)

RSS = 5134.804544111939
R^2 = 0.05212044544430516

We observe that for model_TV, the RSS is least and R² value is the most among all the models. Hence we select model_TV as our base model to move forward.

Now, we will add the radio and newspaper one by one and check the new values.

# For TV & radio
model_TV_radio = sm.ols('sales ~ TV + radio', ad).fit()
evaluateModel(model_TV_radio)

RSS = 556.9139800676184
R^2 = 0.8971942610828957

As we can see that our values have improved tremendously. RSS has decreased and R² has increased further, as compared to model_TV. It’s a good sign. Let’s now check the same for TV and newspaper.

# For TV & newspaper
model_TV_radio = sm.ols('sales ~ TV + newspaper', ad).fit()
evaluateModel(model_TV_newspaper)

RSS = 1918.5618118968275
R^2 = 0.6458354938293271

The values have improved by adding newspaper too, but not as much as with the radio. Hence, at this step, we will proceed with the TV & radio model and will observe the difference when we add newspaper to this model.

# For TV, radio & newspaper
model_all = sm.ols('sales ~ TV + radio + newspaper', ad).fit()
evaluateModel(model_all)

RSS = 556.8252629021872
R^2 = 0.8972106381789522

The values have not improved with any significance. Hence, it’s imperative to not add newspaper and finalize the model with TV and radio as selected features.

So our final model can be expressed as below:

Plotting the variables TV, radio, and sales in the 3D graph, we can visualize how our model has fit a regression plane to the data.

**3D Plot to understand the regression plane**. Image by Sangeet Aggarwal

That’s it for Multiple Linear Regression. You can find the full code behind this post here. I hope you had a good time reading and learning. For more, stay tuned.

If you are new to Data Science and Machine Learning and wondering where to begin your journey from, do check the link below, where I have mentioned step by step method to learn Data Science, with lots of sources for you to choose from.

Data science from scratch

How to step into Data Science as a complete beginner

towardsdatascience.com

Can’t wait? If you want to dive right into a course, check out the career tracks in Data Science that suits you, from the link below.

Multiple Linear Regression

A complete study — Model Interpretation →Hypothesis Testing →Feature Selection

Simple Linear Regression

Everything you need to know about Simple Linear Regression

Why Multiple Linear Regression?

Mathematically…

Visualizing the data

Ordinary Least Squares

Building the model and interpreting the coefficients

Collinearity

Hypothesis Test for Predictors

Feature Selection

Data science from scratch

How to step into Data Science as a complete beginner

Learn R, Python & Data Science Online

Learn Data Science from the comfort of your browser, at your own pace with DataCamp's video tutorials & coding…

Written by Sangeet Aggarwal

Multiple Linear Regression

A complete study — Model Interpretation →Hypothesis Testing →Feature Selection

Simple Linear Regression

Everything you need to know about Simple Linear Regression

Why Multiple Linear Regression?

Mathematically…

Visualizing the data

Ordinary Least Squares

Building the model and interpreting the coefficients

Collinearity

Hypothesis Test for Predictors

Feature Selection

Data science from scratch

How to step into Data Science as a complete beginner

Learn R, Python & Data Science Online

Learn Data Science from the comfort of your browser, at your own pace with DataCamp's video tutorials &amp; coding…

Written by Sangeet Aggarwal

Learn Data Science from the comfort of your browser, at your own pace with DataCamp's video tutorials & coding…