The Complete Guide to Linear Regression Analysis

This article is about understanding the linear regression with all the statistical terms.

Abhay Jidge

Published in

Towards Data Science

11 min readMay 24, 2020

Introduction

In this article, we will analyse a business problem with linear regression in a step by step manner and try to interpret the statistical terms at each step to understand its inner workings. Although the liner regression algorithm is simple, for proper analysis, one should interpret the statistical results.

First, we will take a look at simple linear regression and after extending the problem to multiple linear regression.

For easy understanding, follow the python notebook side by side.

What is Linear Regression?

Regression is the statistical approach to find the relationship between variables. Hence, the Linear Regression assumes a linear relationship between variables. Depending on the number of input variables, the regression problem classified into

1) Simple linear regression

2) Multiple linear regression

Business problem

In this article, we are using the Advertisement dataset.

Let’s consider there is a company and it has to improve the sales of the product. The company spends money on different advertising media such as TV, radio, and newspaper to increase the sales of its products. The company records the money spent on each advertising media (in thousands of dollars) and the number of units of product sold (in thousands of units).

Now we have to help the company to find out the most effective way to spend money on advertising media to improve sales for the next year with a less advertising budget.

Simple Linear Regression

Simple linear is an approach for predicting the quantitative response Y based on single predictor variable X.

This is the equation of straight-line having slope β1 and intercept β0.

Let’s start the regression analysis for given advertisement data with simple linear regression. Initially, we will consider the simple linear regression model for the sales and money spent on TV advertising media.

Then the mathematical equation becomes 𝑆𝑎𝑙𝑒𝑠 = 𝛽0 + 𝛽1 * 𝑇𝑉.

Step 1: Estimating the coefficients: (Let’s find the coefficients)

Now to find the estimate of the sales for the advertising budget, we have to know the values of the β1 and β0. For the best estimate, the difference between predicted sales and the actual sales (called as residual) should be minimum.

As the residual may be negative or positive, so while calculating the net residual it can be lead to cancellation terms and reduction of net effect which leads to a non-optimal estimate of coefficients. To overcome this, we use a Residual sum of squares (RSS).

With a simple calculation, we can find the value of β0 and β1 for minimum RSS value.

With the stats model library in python, we can find out the coefficients,

Table 1: Simple regression of sales on TV

Values for β0 and β1 are 7.03 and 0.047 respectively. Then the relation becomes, Sales = 7.03 + 0.047 * TV.

This means if we spend an additional 1000 dollars on TV advertising media it increases the sales of products by 47 units.

This gives us how strongly the TV advertising media associated with the sales.

Step 2: Assessing the Accuracy of the Coefficient Estimates ( How accurate these coefficients are? )

Why the coefficients are not perfect estimates?

The true relationship may not be perfectly linear, so there is an error that can be reduced by using a more complex model such as the polynomial regression model. These types of errors are called reducible errors.

On the other hand, errors may introduce because of errors in measurement and environmental conditions such as the office is closed for one week due to heavy rain which affects the sales. These types of errors are called irreducible errors.

Because of these errors, we can say that the coefficients are not perfect estimates.

Now, How to address these errors?

To find this error in coefficient estimates we use Standard Error (SE). SE of a coefficient represents the average distance that observed values deviate from the regression line. If the standard errors of the coefficient estimate of a variable are smaller then the model can estimate the coefficient for that variable with greater precision.

SE of the coefficient of TV and intercept can be given by,

The standard error is used to compute,

1] Confidence Interval:

The 95% confidence interval means that there is a 95% probability that the range will contain the true unknown values of the parameter. The range is defined by the upper and lower limit. 95% confidence interval for a β can be calculated as β ± 2*SE(β).

From table [1], the 95% confidence interval for β0 is [6.130, 7.935] which shows that in the absence of any advertising, sales will, on average, fall somewhere between 6,130 and 7,935 units.

95% confidence interval for β1 is [0.042, 0.053] shows that for each $1,000 increase in TV advertising, there will be an average increase in sales of between 42 and 53 units.

2] To Perform Hypothesis Testing:

Now with the help of hypothesis testing let’s find out, Is there is a real relationship/association between Sales and TV advertising budget or we got the results by chance?

Let’s define the hypothesis for the model.

H0 = There is no relationship between sales and TV advertising. (β1 = 0)

Ha = There is a relationship between sales and TV advertising. (β1 ≠ 0)

To prove that there exists a real relationship/association between sales and TV advertising budget, we need to determine β1 is sufficiently far from zero that we can be confident that β1 is non-zero with the calculated SE.

To find how far β1 from zero (This, depends on the accuracy of β1 — that is, it depends on SE( β1).) we use t statistics as

From the table[1], the t statistics for β1 is 17.668.

Let’s consider the significance level(α) = 0.01, it is the probability of making the wrong decision when the null hypothesis is true.

We can perform hypothesis testing with two methods

1] critical value Method:

Critical value for α = 0.01 for a two-tailed hypothesis test is ±2.345 means, an area of 0.01 is equal to a t-score of ±2.345 as shown in the figure.

Now the t value calculated from the above formula is 17.668. As calculated T-value is numerically greater than the critical value so it falls in the rejection region as shown in the diagram.

So we have enough evidence to reject the null hypothesis.

So β1 ≠ 0.

2] P-value method

P-value for t statistics = ±17.668 is 0.0001 .

i.e. By assuming the Null hypothesis (β1 = 0) is true, the probability of getting a T-value equal to 17.668 or more is only 0.0001.

Significance level (α) = 0.01 means we can accept the null hypothesis only if there is at least 1 in 100 chance of getting the T value equal to 17.668 or extreme. So as P-value (0.0001) << α (0.01) So we can reject the null hypothesis. Simply, the P-value is area corresponds to the given test statistics.

So, from the above results, we can conclude that β0 ≠ 0 and ≠ 0.

For a detailed understanding of hypothesis testing, you can read this article.

Step 3: Accuracy of Model (How well does the model fit the data?)

After verifying the coefficients, now we want to quantify how well the model can fit the data. This can be assessed by Residual standard error (RSE) and R squared statistics.

Residual standard error (RSE):

Though we know the true values of the unknown coefficients (β0 and β1) then also there will be an average amount of error equal to RSE because of irreducible errors (epsilon as defined before) and it is given by,

In the case of advertising data with the linear regression, we have RSE value equal to 3.242 which means, actual sales deviate from the true regression line by approximately 3,260 units, on average.

The RSE is measure of the lack of fit of the model to the data in terms of y. Lower the residual errors, the better the model fits the data (in this case, the closer the data is to a linear relationship).

RSE has not fixed scale and the value is in terms of Y (TV advertising), to overcome this, we use R squared statistics.

R squared statistics:

Where TSS (Total sum of squares) and RSS (Residual sum of squares)

R Squared statistics measures the proportion of variability in Y that can be explained using X. If the R Squared statistic close to 1 shows that a large proportion of the variability in the response has been explained by the regression. The R squared statistic is always between 0 and 1.

The model has R squared statistics as 0.61 which means just 61% of the variability in sales is explained by linear regression on TV.

Now we have analysed the relationship between TV advertising and sales with the help of Simple Linear regression. With a similar approach, let’s analyse the relationship between Radio advertising, newspaper advertising with sales using simple linear regression.

Simple linear regression of sales on radio

Simple linear regression of sales on newspaper

Now from the above results, we can see that simple linear regression cannot explain the variability in the sales, and the models do not work well.

Let’s see the multiple regression How it works,

Multiple Linear Regression:

In multiple linear regression, we will analyse the relationship between sales and three advertising media collectively.

𝑆𝑎𝑙𝑒𝑠 = 𝛽0 + 𝛽1 * 𝑇𝑉 + 𝛽2 * Radio+ 𝛽3 * Newspaper + epsilon

Now let’s follow the steps similar to the simple linear regression,

1] Estimating the Coefficients:

Table 4: Multiple linear regression of sales on TV, radio, newspaper

The above table shows the multiple regression coefficient estimates when TV, radio, and newspaper advertising budgets are used to predict product sales using the Advertising data.

𝑆𝑎𝑙𝑒𝑠 = 2.94 + 0.045 * 𝑇𝑉 + 0.189 * Radio + (- 0.001) * Newspaper

We can analyse that the coefficient estimate for the newspaper is close to zero and the p-value is no longer significant(p-value >> 0.005) with a value around 0.86. This shows that money spent on newspaper advertising media has no relation to the sale of the product.

Q) Why in case of multiple linear regression money spent on newspaper advertising media has no relation with the sale of the product, but on the other hand with simple linear regression, this variable is highly significant?

Now to understand why this is happening, let’s analyse the correlation matrix.

*Correlation matrix for Advisement data*

Correlation between radio and newspaper is 0.354, this reveals a tendency to spend more on newspaper advertising in markets where more is spent on radio. The correlation between sales and newspaper advertising is less, this shows that newspaper advertising has no direct effect on sales.

More money spent on newspaper advertisement tends to more money spent on radio advertisement, so an increase in the budget for radio advertising increases sales. Simple linear regression only examines the sales versus the newspaper so the newspaper gets credit for the effect of radio on sales.

So, from the above analysis, we can say that the newspaper advertisement variable do not increase the sales of product. So let’s build a model by removing the newspaper advertisement variable.

Table 5: Multiple linear regression of sales on TV and radio

𝑆𝑎𝑙𝑒𝑠 = 2.92 + 0.045 * 𝑇𝑉 + 0.188 * Radio

Model Comparison

From the above table, we can say that multiple linear regression of sales on TV and radio will give a better estimate for the sale.

Business Plan:

From the above regression analysis, let’s create a business plan to help the company to spend money wisely.

1] Which media do not contribute to sales?

The money spent on newspaper advertising media does not affect sales.

2] How large is the effect of each medium on sales?

The advertising media which has a larger value of coefficient estimate will have more effect on sales.

From Multiple linear regression of sales on TV and radio (Table 6), radio advertisement has the highest effect on sales. Every 1000 dollar money spent on radio advertising and TV advertising, increases the sales of product by 188 and 45 units respectively.

3] How strong is the relationship?

Table 6 shows a comparison of the measure of model accuracy. RSE value for the Multiple linear regression of sales on TV and radio is 1.67. The mean value for the sales is 14022, so the Percent error is 1670/14022 ≈ 12%.

R squared value is 0.90 which shows that 90% variance in the sales is explained by the multiple linear regression of sales on TV and radio.

4] How accurately can we predict future sales?

For this, we use the confidence interval and prediction interval. The prediction interval is wider than the confidence interval as the prediction interval catches the irreducible errors.

5] Is there a relationship between sales and advertising budget?

In the case of simple linear regression we performed the hypothesis testing by using the t statistics to see is there any relationship between the TV advertisement and sales.

In the same manner, for multiple linear regression, we can perform the F test to test the hypothesis as,

H0: β1 = β2 = · · · = βp = 0

Ha: At least one βj is non-zero.

Resources

Lastly, I would like to mention a few great resources which you can use to learn more about linear regression.

An introduction to statistical learning: with applications in r (Book)

Khan Academy Statistics Course (Video lectures)

Linear Regression (Statsmodels Documentation)

Conclusion

In this article, we went over what Linear Regression is, how it works and how can we analyse the results at each step of model building with python implementation.

If you have any questions feel free to dm me on LinkedIn or leave a comment.