Linear Regression Simplified - Ordinary Least Square vs Gradient Descent
What is Linear Regression?
Linear regression is a statistical method of finding the relationship between independent and dependent variables. Let us take a simple dataset to explain the linear regression model.
Why do we call them as Independent and Dependent variables?
If you take our example dataset, the “Years of Experience” columns are independent variables and the “Salary in 1000$” column values are dependent variables. Our independent variables are independent because we cannot mathematically determine the years of experience. But, we can determine / predict salary column values (Dependent Variables) based on years of experience. If you look at the data, the dependent column values (Salary in 1000$) are increasing / decreasing based on years of experience.
Total Sum of Squares (SST): The SST is the sum of all squared differences between the mean of a sample and the individual values in that sample. It is represented mathematically with the formula.
The total sum of squared errors SST output is 5226.19. To do the best fit of line intercept, we need to apply a linear regression model to reduce the SSE value as minimum as possible. To identify a slope-intercept, we use the equation
y = mx + b,
- ‘m’ is the slope
- ‘x’ → independent variables
- ‘b’ is intercept
We will use Ordinary Least Squares method to find the best line intercept (b) slope (m)
Ordinary Least Squares (OLS) Method
To use OLS method, we apply the below formula to find the equation
We need to calculate slope ‘m’ and line intercept ‘b’. Below is the simpler table to calculate those values.
m = 1037.8 / 216.19
m = 4.80
b = 45.44 - 4.80 * 7.56 = 9.15
Hence, y = mx + b → 4.80x + 9.15
y = 4.80x + 9.15
Let’s compare our OLS method result with MS-Excel. Yes, we can test our linear regression best line fit in Microsoft Excel.
Wonderful! Our OLS method is pretty much the same as MS-Excel’s output of ‘y’.
- Our OLS method output → y = 4.80x + 9.15
- MS-Excel Linear Reg. Output → y = 4.79x + 9.18
Let us calculate SSE again by using our output equation.
Now Sum of Squared Error got reduced significantly from 5226.19 to 245.38.
Ordinary Least Square method looks simple and computation is easy. But, this OLS method will work for both univariate dataset which is single independent variables and single dependent variables and multi-variate dataset. Multi-variate dataset contains a single independent variables set and multiple dependent variables sets, require us to use a machine learning algorithm called “Gradient Descent”.
Gradient Descent Algorithm
Gradient descent algorithm’s main objective is to minimise the cost function. It is one of the best optimisation algorithms to minimise errors (difference of actual value and predicted value).
Let’s represent the hypothesis h, which is function or a learning algorithm.
The goal is similar like the above operation that we did to find out a best fit of intercept line ‘y’ in the slope ‘m’. Using Gradient descent algorithm also, we will figure out a minimal cost function by applying various parameters for theta 0 and theta 1 and see the slope intercept until it reaches convergence.
In a real world example, it is similar to find out a best direction to take a step downhill.
We take a step towards the direction to get down. From the each step, you look out the direction again to get down faster and downhill quickly. The similar approach is using in this algorithm to minimise cost function.
We can measure the accuracy of our hypothesis function by using a cost function and the formula is
Gradient Descent for Linear Regression
Why do we use partial derivative in the equation? Partial derivatives represents the rate of change of the functions as the variable change. In our case we change values for theta 0 and theta 1 and identifies the rate of change. To apply rate of change values for theta 0 and theta 1, the below are the equations for theta 0 and theta 1 to apply it on each epoch.
To find the best minimum, repeat steps to apply various values for theta 0 and theta 1. In other words, repeat steps until convergence.
- where alpha (a) is a learning rate / how big a step take to downhill.
Types of Gradient Descent Algorithms
There are three types of Gradient Descent Algorithms:
1. Batch Gradient Descent
2. Stochastic Gradient Descent
3. Mini-Batch Gradient Descent
Batch Gradient Descent
- In the batch gradient descent, to calculate the gradient of the cost function, we need to sum all training examples for each steps
- If we have 3 millions samples (m training examples) then the gradient descent algorithm should sum 3 millions samples for every epoch. To move a single step, we have to calculate each with 3 million times!
- Batch Gradient Descent is not good fit for large datasets
- Below is python code implementation for Batch Gradient Descent algorithm.
Stochastic Gradient Descent (SGD)
- In stochastic Gradient Descent, we use one example or one training sample at each iteration instead of using whole dataset to sum all for every steps
- SGD is widely used for larger dataset trainings and computationally faster and can be trained in parallel
- Need to randomly shuffle the training examples before calculating it
- Python code implementation for SGD in below
Mini-Batch Gradient Descent
- It is similar like SGD, it uses n samples instead of 1 at each iteration.
Summary:
In a summary, explained about the following topics in detail.
- Linear regression’s independent and dependent variables
- Ordinary Least Squares (OLS) method and Sum of Squared Errors (SSE) details
- Gradient descent for linear regression model and types gradient descent algorithms.
Reference: