The world’s leading publication for data science, AI, and ML professionals.

Linear Regression Model: Machine Learning

Learning about the linear regression model in machine learning for predictive analysis

Anscombe Quartet - Correlations can be fickle.
By Anscombe.svg: Schutz(label using subscripts): Avenue – Anscombe.svg, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=9838454

Linear regression is one of the most important regression models which are used in machine learning. In the regression model, the output variable, which has to be predicted, should be a continuous variable, such as predicting the weight of a person in a class.

The regression model also follows the supervised learning method, which means that to build the model, we’ll use past data with labels, which helps predict the output variable in the future.

Linear Regression

Using the linear regression model, we’ll predict the relationship between the two factors/ variables. The variable which we are expecting is called the dependent variable.

The linear regression model is of two types:

  • Simple linear regression: It contains only one independent variable, which we use to predict the dependent variable using one straight line.
  • Multiple linear regression, which includes more than one independent variable.

In this article, we’ll concentrate on the Simple linear regression model.

Simple Linear Regression

We have data from a company containing the amount spent on Marketing and its sales corresponding to that marketing budget.

The data looks like this,

Sample Marketing Data
Sample Marketing Data

Download the above excel data from here.

Using Microsoft Excel charts, we can make a scatter plot that looks like the following for the above data.

Scatter Plot for the above data
Scatter Plot for the above data

The above plot signifies the scatter plot of all the data points according to our given data. Now, we have to fit a straight line through the data points, which helps us predict future sales.

We know that a straight line is represented as:

y = mx + c

Here, we call the line as Regression Line, which is represented as:

Y = β0 + β1X

Now, there can be so many straight lines that can be passed through the data points. We have to find out the best fit line that can be a model to use it for future predictions.

To find the best fit line among all the lines, we’ll introduce a parameter called Residual(e).

Residual is the difference between Y-axis’s actual value and the Y-axis’s predicted value based on the straight-line equation for that particular X.

Let’s say we have the scatter plot and straight line like the following figure,

Image by Author - Calculating Residual value using the graph
Image by Author – Calculating Residual value using the graph

Now, using the above figure, the residual value for x = 2 is:

Residual(e) = Actual value of Y - the predicted value of Y using the line

e = 3–4 = -1

So, the residual for the value x = 2 is -1.

Similarly, we have a residual value for every data point, which is the difference between the actual Y value and predicted Y value.

ei = yi - y^i

So, to find out the best fit line, we’ll use a method called Ordinary Least Squares Method or Residual Sum of Square (RSS) Method.

RSS = e1²+e2²+e3²+......+en²

The RSS value will be least for the best fit line.

Cost Function

Typically machine learning models define a Cost Function for a particular problem. Then we try to minimize or maximize the cost function based on our requirement. In the above regression model, the RSS is the cost function; we would like to reduce the cost and find out the β0 and β1 for the straight-line equation.

Now, let’s come back to our marketing dataset in the excel sheet. Using the Linear Forecast option in Trendline for the above scatter plot, we’ll directly get the best-fit line for scatter plot without manually calculating the residual values.

Best-fit line using Microsoft Excel scatter plot options
Best-fit line using Microsoft Excel scatter plot options

As we can see, Slope(β1) = 0.0528 Intercept(β0) = 3.3525

Let’s calculate the predicted sales(Y) for all the data points(X) using the above straight-line equation.

The Predicted sales will be,

Predicting the sales using (y=0.0528x+3.33525) equation
Predicting the sales using (y=0.0528x+3.33525) equation

After that, let’s also calculate the Residual Square value for each data point.

Residual Square = (Actual Y value – Predicted Y value)²

Let’s see the excel sheet after applying the above formula to calculate residual square.

Dataset after calculating the Residual Squares
Dataset after calculating the Residual Squares

Now, RSS is the sum of all the Residual square values from the above sheet.

RSS = 28.77190461

Since this is the best-fit line, the RSS value we got here is the minimum.

If we observe RSS value here, it is an absolute quantity. In the future, if we change the problem setting where we measure sales in terms of billions instead of millions, the RSS quantity is going to change.

So, we need to define an alternate measure that is relative and not an absolute quantity. That alternate measure is called Total Sum of Squares (TSS). Using TSS, we’ll calculate the R² value, which will determine if the model is viable or not.

TSS = (Y1-Ȳ)² + (Y2-Ȳ)² + (Y3-Ȳ)² + ....... + (Yn-Ȳ)²

Where, Y1, Y2, Y3,….., Yn are the values from the data points. Ȳ is the average value of the Y-axis column

Now, after calculating TSS, we will compute R². R² = 1-(RSS/TSS)

R² value always lies between 0 and 1. If R² is close to 1, then our model is excellent, and we can use the model to predict the analysis. The model is not suitable for the predictive analysis if the value is close to 0.

Now, let’s calculate the TSS in our excel dataset.

First, we’ll find out the (Yn-Ȳ)² value for every data point, the Ȳ (Average Y-value) is 15.56470588

Now, the dataset looks like,

Computing the Sum of Squares using the Y-value and average of all Y-values
Computing the Sum of Squares using the Y-value and average of all Y-values

TSS = Sum of all the Sum of Squares from the dataset

TSS = 297.5188235

Since we have already calculated the RSS above. Let’s find out the value of R², R² = 1-(RSS/TSS) = 0.903293834.

If we observe the scatter plot graph with the best-fit line above, there below the straight line equation, the excel already calculated the R² value as 0.9033, which is what we got using all the calculations.

Since the R² value is more than 90%, this model is highly recommended to predict future analysis.

Conclusion

The regression model is one of the essential models in machine learning. Using this model, we can predict the outcome of the variable. If the output variable is categorical, we’ll use another type of model called the Classification model.

In the next article, we’ll see how to use the Linear Regression model in Python.

Thank you for reading and happy coding!!!

Check out my previous articles about Python here

References


Related Articles