Linear Regression Simplified - Ordinary Least Square vs Gradient Descent

Prabhu Raghav
Towards Data Science
6 min readMay 15, 2018

--

What is Linear Regression?
Linear regression is a statistical method of finding the relationship between independent and dependent variables. Let us take a simple dataset to explain the linear regression model.

Table 1 : Sample Salary Data
Figure 1: Scatter Plot Diagram

Why do we call them as Independent and Dependent variables?

Table 2: Sample dataset with independent & dependent variables hints

If you take our example dataset, the “Years of Experience” columns are independent variables and the “Salary in 1000$” column values are dependent variables. Our independent variables are independent because we cannot mathematically determine the years of experience. But, we can determine / predict salary column values (Dependent Variables) based on years of experience. If you look at the data, the dependent column values (Salary in 1000$) are increasing / decreasing based on years of experience.

Total Sum of Squares (SST): The SST is the sum of all squared differences between the mean of a sample and the individual values in that sample. It is represented mathematically with the formula.

Table: SST Calculation

The total sum of squared errors SST output is 5226.19. To do the best fit of line intercept, we need to apply a linear regression model to reduce the SSE value as minimum as possible. To identify a slope-intercept, we use the equation

y = mx + b,

  • ‘m’ is the slope
  • ‘x’ → independent variables
  • ‘b’ is intercept

We will use Ordinary Least Squares method to find the best line intercept (b) slope (m)

Ordinary Least Squares (OLS) Method

To use OLS method, we apply the below formula to find the equation

We need to calculate slope ‘m’ and line intercept ‘b’. Below is the simpler table to calculate those values.

Table 4: OLS method calculations

m = 1037.8 / 216.19
m = 4.80
b = 45.44 - 4.80 * 7.56 = 9.15
Hence, y = mx + b → 4.80x + 9.15
y = 4.80x + 9.15

Let’s compare our OLS method result with MS-Excel. Yes, we can test our linear regression best line fit in Microsoft Excel.

Figure 2: Linear Regression using MS-Excel

Wonderful! Our OLS method is pretty much the same as MS-Excel’s output of ‘y’.

  • Our OLS method output → y = 4.80x + 9.15
  • MS-Excel Linear Reg. Output → y = 4.79x + 9.18

Let us calculate SSE again by using our output equation.

Table 5: SSE calculations again after OLS method implementation

Now Sum of Squared Error got reduced significantly from 5226.19 to 245.38.

Ordinary Least Square method looks simple and computation is easy. But, this OLS method will work for both univariate dataset which is single independent variables and single dependent variables and multi-variate dataset. Multi-variate dataset contains a single independent variables set and multiple dependent variables sets, require us to use a machine learning algorithm called “Gradient Descent”.

Gradient Descent Algorithm

Gradient descent algorithm’s main objective is to minimise the cost function. It is one of the best optimisation algorithms to minimise errors (difference of actual value and predicted value).

Let’s represent the hypothesis h, which is function or a learning algorithm.

The goal is similar like the above operation that we did to find out a best fit of intercept line ‘y’ in the slope ‘m’. Using Gradient descent algorithm also, we will figure out a minimal cost function by applying various parameters for theta 0 and theta 1 and see the slope intercept until it reaches convergence.

In a real world example, it is similar to find out a best direction to take a step downhill.

Figure 3: Gradient Descent 3D diagram. Source: Coursera — Andrew Ng

We take a step towards the direction to get down. From the each step, you look out the direction again to get down faster and downhill quickly. The similar approach is using in this algorithm to minimise cost function.

We can measure the accuracy of our hypothesis function by using a cost function and the formula is

Gradient Descent for Linear Regression

Why do we use partial derivative in the equation? Partial derivatives represents the rate of change of the functions as the variable change. In our case we change values for theta 0 and theta 1 and identifies the rate of change. To apply rate of change values for theta 0 and theta 1, the below are the equations for theta 0 and theta 1 to apply it on each epoch.

To find the best minimum, repeat steps to apply various values for theta 0 and theta 1. In other words, repeat steps until convergence.

Figure 4: Gradient Descent for linear regression. Source: http://sebastianraschka.com/ — Python Machine Learning, 2nd Edition
  • where alpha (a) is a learning rate / how big a step take to downhill.

Types of Gradient Descent Algorithms

There are three types of Gradient Descent Algorithms:

1. Batch Gradient Descent
2. Stochastic Gradient Descent
3. Mini-Batch Gradient Descent

Batch Gradient Descent

  • In the batch gradient descent, to calculate the gradient of the cost function, we need to sum all training examples for each steps
  • If we have 3 millions samples (m training examples) then the gradient descent algorithm should sum 3 millions samples for every epoch. To move a single step, we have to calculate each with 3 million times!
  • Batch Gradient Descent is not good fit for large datasets
  • Below is python code implementation for Batch Gradient Descent algorithm.

Stochastic Gradient Descent (SGD)

  • In stochastic Gradient Descent, we use one example or one training sample at each iteration instead of using whole dataset to sum all for every steps
  • SGD is widely used for larger dataset trainings and computationally faster and can be trained in parallel
  • Need to randomly shuffle the training examples before calculating it
  • Python code implementation for SGD in below

Mini-Batch Gradient Descent

  • It is similar like SGD, it uses n samples instead of 1 at each iteration.

Summary:

In a summary, explained about the following topics in detail.

  • Linear regression’s independent and dependent variables
  • Ordinary Least Squares (OLS) method and Sum of Squared Errors (SSE) details
  • Gradient descent for linear regression model and types gradient descent algorithms.

Reference:

--

--

Co-Founder & CTO — Decisionfacts.ai - Data Scientist, Architect, Full Stack Developer — Tech Enthusiast, Learner