Linear regression is one of the most important regression models which are used in machine learning. In the regression model, the output variable, which has to be predicted, should be a continuous variable, such as predicting the weight of a person in a class.
The regression model also follows the supervised learning method, which means that to build the model, we’ll use past data with labels, which helps predict the output variable in the future.
Linear Regression
Using the linear regression model, we’ll predict the relationship between the two factors/ variables. The variable which we are expecting is called the dependent variable.
The linear regression model is of two types:
- Simple linear regression: It contains only one independent variable, which we use to predict the dependent variable using one straight line.
- Multiple linear regression, which includes more than one independent variable.
In this article, we’ll concentrate on the Simple linear regression model.
Simple Linear Regression
We have data from a company containing the amount spent on Marketing and its sales corresponding to that marketing budget.
The data looks like this,

Download the above excel data from here.
Using Microsoft Excel charts, we can make a scatter plot that looks like the following for the above data.

The above plot signifies the scatter plot of all the data points according to our given data. Now, we have to fit a straight line through the data points, which helps us predict future sales.
We know that a straight line is represented as:
y = mx + c
Here, we call the line as Regression Line, which is represented as:
Y = β0 + β1X
Now, there can be so many straight lines that can be passed through the data points. We have to find out the best fit line that can be a model to use it for future predictions.
To find the best fit line among all the lines, we’ll introduce a parameter called Residual(e).
Residual is the difference between Y-axis’s actual value and the Y-axis’s predicted value based on the straight-line equation for that particular X.
Let’s say we have the scatter plot and straight line like the following figure,

Now, using the above figure, the residual value for x = 2 is:
Residual(e) = Actual value of Y - the predicted value of Y using the line
e = 3–4 = -1
So, the residual for the value x = 2 is -1.
Similarly, we have a residual value for every data point, which is the difference between the actual Y value and predicted Y value.
ei = yi - y^i
So, to find out the best fit line, we’ll use a method called Ordinary Least Squares Method or Residual Sum of Square (RSS) Method.
RSS = e1²+e2²+e3²+......+en²
The RSS value will be least for the best fit line.
Cost Function
Typically machine learning models define a Cost Function for a particular problem. Then we try to minimize or maximize the cost function based on our requirement. In the above regression model, the RSS is the cost function; we would like to reduce the cost and find out the β0 and β1 for the straight-line equation.
Now, let’s come back to our marketing dataset in the excel sheet. Using the Linear Forecast
option in Trendline
for the above scatter plot, we’ll directly get the best-fit line for scatter plot without manually calculating the residual values.

As we can see, Slope(β1) = 0.0528 Intercept(β0) = 3.3525
Let’s calculate the predicted sales(Y) for all the data points(X) using the above straight-line equation.
The Predicted sales will be,

After that, let’s also calculate the Residual Square value for each data point.
Residual Square = (Actual Y value – Predicted Y value)²
Let’s see the excel sheet after applying the above formula to calculate residual square.

Now, RSS is the sum of all the Residual square values from the above sheet.
RSS = 28.77190461
Since this is the best-fit line, the RSS value we got here is the minimum.
If we observe RSS value here, it is an absolute quantity. In the future, if we change the problem setting where we measure sales in terms of billions instead of millions, the RSS quantity is going to change.
So, we need to define an alternate measure that is relative and not an absolute quantity. That alternate measure is called Total Sum of Squares (TSS). Using TSS, we’ll calculate the R² value, which will determine if the model is viable or not.
TSS = (Y1-Ȳ)² + (Y2-Ȳ)² + (Y3-Ȳ)² + ....... + (Yn-Ȳ)²
Where, Y1, Y2, Y3,….., Yn are the values from the data points. Ȳ is the average value of the Y-axis column
Now, after calculating TSS, we will compute R².
R² = 1-(RSS/TSS)
R² value always lies between 0 and 1. If R² is close to 1, then our model is excellent, and we can use the model to predict the analysis. The model is not suitable for the predictive analysis if the value is close to 0.
Now, let’s calculate the TSS in our excel dataset.
First, we’ll find out the (Yn-Ȳ)² value for every data point, the Ȳ (Average Y-value) is 15.56470588
Now, the dataset looks like,

TSS = Sum of all the Sum of Squares from the dataset
TSS = 297.5188235
Since we have already calculated the RSS above. Let’s find out the value of R², R² = 1-(RSS/TSS) = 0.903293834.
If we observe the scatter plot graph with the best-fit line above, there below the straight line equation, the excel already calculated the R² value as 0.9033, which is what we got using all the calculations.
Since the R² value is more than 90%, this model is highly recommended to predict future analysis.
Conclusion
The regression model is one of the essential models in machine learning. Using this model, we can predict the outcome of the variable. If the output variable is categorical, we’ll use another type of model called the Classification model.
In the next article, we’ll see how to use the Linear Regression model in Python.
Thank you for reading and happy coding!!!
Check out my previous articles about Python here
- Joins, Views, and CTEs: MySQL Workbench
- Data Analysis using basic commands: MySQL Workbench
- Exploratory Data Analysis(EDA): Python
- Central Limit Theorem(CLT): Data Science
- Inferential Statistics: Data Analysis
- Seaborn: Python
- Pandas: Python
- Matplotlib: Python
- NumPy: Python
References
- Linear Regression for Machine Learning: https://machinelearningmastery.com/linear-regression-for-machine-learning/
- Linear Regression: https://ml-cheatsheet.readthedocs.io/en/latest/linear_regression.html
- Implement Linear Regression for Machine Learning: https://www.edureka.co/blog/linear-regression-for-machine-learning/
- What is Linear Regression in Machine Learning: https://www.knowledgehut.com/blog/data-science/linear-regression-for-machine-learning