The world’s leading publication for data science, AI, and ML professionals.

Simple Linear Regression: What’s inside?

Let's dive into the mathematical significance of simple linear regression and implement it.

Step 0: Getting started
Step 0: Getting started

Regression is a statistical approach that suggests predicting a dependent variable (goal feature) with the help of other independent variables (data). Regression is one of the most known and understood statistical methods.

Linear regression is a model that assumes a linear relationship between its dependent and independent variables. Linear regression further branches out to Simple Linear Regression (SLR) and Multiple Linear Regression (MLR). We will explore Single Linear Regression, regression with one dependent and one independent variable, because of its simplicity. SLR’s math is the base of many other Machine Learning models.

Here I will elaborate on Simple Linear Regression to gain intuition on how it works. I will use an NBA Game Score dataset (link below) to demonstrate SLR and finally compare it to Scikit-learn’s Linear Regression model.

Simple Linear Regression

To understand SLR, let’s break down the concepts we must go through

  • SLR line and its coefficients
  • Loss function
  • Gradient descent
  • Deriving the coefficients (optional)

SLR Line and its coefficients

The slope intercept form of a line is Y= MX+B.

Y is the dependent variable (Goal), X is the independent variable (Data), M and B are the characteristics of the line. Slope (M) gives us how related variables X and Y are and Intercept (B) gives us information on the value of the dependent variable when the rate of change is eliminated.

Source: onlinemath4all
Source: onlinemath4all

In SLR the equation is written as y = b0 + x b1. b0 and b1 are the intercept and slope respectively. They are determined by the given formulas below to find the line of best fit.

But, this is not always the case in regression. Let’s understand why in the gradient descent section. If you are curious about how we stumbled upon these, do check out the optional section.

Loss Function

The loss function is a metric that suggests how much the predicted value deviates from the actual value. There are plenty of loss functions available, we will look at Mean Squared Error (MSE).

MSE, as the name suggests, squares the difference between the actual and predicted value for each record, sums it up and divides it by the number of records. Our goal is to find a model that yields the smallest loss.

Gradient Descent

Gradient descent is an optimizing algorithm that updates the parameters iteratively to find the model with the slightest loss. The loss function for a model with one or two parameters can be partially differentiated to find the minimum. But as our dimensions increase it’s hard to visualize the parameters, let alone the eigenvalues for each solution. Due to multiple occurrences of local minima, we will have to traverse through all the combinations of eigenvalues to make sure we found the global minima. Although the global minima problem is not fully rectified, Gradient Descent helps to find the minima for models with higher orders.

But we haven’t explored the base of the problem: the loss function. MSE (being a quadratic function) guarantees us there will always be a point on the curve whose gradient is zero, but there are loss functions that do not guarantee a point with zero gradient or the point with a zero gradient might not always be the global minima. To overcome this problem gradient descent is employed.

Specific to our situation, we can choose to find the coefficients by the formulae presented above or we can start with random non zero values and let it work its way to the best fit. The mathematical significance of the gradient descent algorithm deserves an article for itself. For now, I will go through the intuition required to implement the algorithm. The mathematical approach is similar to that of the coefficients, I figured it would be redundant to include it (I will link it in the end if you are curious).

Now, let’s see what gradient descent is all about. Imagine a person is hiking downhill with no optical senses; the person’s goal is to reach the bottom of the valley. Intuitively he takes a step forward, if the slope is downward he would continue to move until he encounters a change in slope. Once the person feels no elevation while moving, he/she stops.

Source: kdnuggets
Source: kdnuggets

But as expressed above, it wouldn’t make sense to take steps of a defined length and then evaluate for course correction, as the person could have passed the minimum only to realize he/she moved in the wrong direction. This is where the learning rate comes into the picture. It penalizes large steps to make sure the person does not take a step over the minimum.

How does the person choose the learning rate?

Unfortunately, there’s no "one fits all " form for the learning rate. One way we could get a rough estimate on the value is by trial and error. Issues we must be on the lookout for are high and low learning rates. Both can be computationally expensive, so always run the model for a few iterations in the beginning to examine the movement of the loss. the wasted computational power and time.

One analogy that fits with a high learning rate is a metallic ball bearing and bowl. When a ball is released from the brim of the bowl, its velocity keeps increasing as the direction of the force is towards the minimum. When it passes the minimum, the direction of the ball will be away from the minimum, but the direction of the force acting on the ball will be the opposite. This way, through some lossy oscillations, it finally attains stable equilibrium at the bottom of the bowl. The lossy oscillations are the wasted computational power and time.

Source: deeplearningwizard
Source: deeplearningwizard

Now, let’s merge this with SLR. The person we were referring to are b0 and b1 coefficients. The valley we saw is the curve of MSE plotted against b0 and b1 parameters. Learning Rate(alpha) gives us the step size on how much to modify the parameters without having to skip the minimum on each pass.

Deriving the Coefficients (Optional)

Although not crucial, it’s fun to explore the mechanics of the model. Let’s first establish the basic equations and lingo for deriving the equations. Before we start, make sure you are familiar with the basics of calculus.

Now that we’re all set with the equations, let’s dive into b0 and b1 are obtained. Now if we plot any of the coefficients (keeping the other constant) against E, it will look something like this

The minimum of the function can be found by partially differentiating the function w.r.t the coefficient and equating the gradient to zero. Now our goals are

  • To find the point where the gradient is zero (approximately) for both coefficients
  • To find an equation which is a function only of the given data.

Find b0

Now to find the minimum equate the gradient to zero.

Although we did find the value of b0, it is dependent on b1.

Find b1

Substitute the value of b0 and partially differentiate

Similar to b0, equate the gradient to zero

This checks both our goals; we found an equation for b1 and it is only dependent on data that is available to us. We were able to find the values using the basic properties of curves.

Implementation

First, let’s process the data and then implement the models using

  • Coefficients
  • Gradient descent
  • Scikit-learn’s linear regression

Data

Using standard python libraries, let’s import our data and visualize the spread.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
#load data
df=pd.read_csv('teampts_fg.csv')
X=df['TeamPoints'].values
y=df['FieldGoals'].values
#plot data
plt.rcParams["figure.figsize"] = (12,8)
plt.xlim(40,160)
sns.scatterplot(x='TeamPoints', y='FieldGoals', data=df);
#define the loss function
def loss(data, data_pred):
    N=data.shape[0]
    loss=np.sum(np.square(data-data_pred))/N
    return loss

Implementing the model using coefficients

Bc1=sum((X-np.mean(X)) * (y - np.mean(y)))/sum((X-np.mean(X))**2)
Bc0=np.mean(y) - Bc1*(np.mean(X))
x_coeff_model=np.linspace(40,160,1000)
y_coeff_model= Bc1*x_coeff_model + Bc0
#plot the line with the original data
plt.rcParams["figure.figsize"] = (12,8)
plt.xlim(40,160)
sns.scatterplot(x='TeamPoints', y='FieldGoals', data=df);
plt.plot(x_coeff_model, y_coeff_model,c='r');
plt.show()
print("The value of b0 is {} and b1 is {}".format(Bc0, Bc1))
Line of best fit using coefficients
Line of best fit using coefficients

Implementing the model using gradient descent

loss_history=[]
#define gradient descent
def gradient_descent(epochs, X , y , alpha):
    B0=0.0001
    B1=0.0001
    N=X.shape[0]
    for i in np.arange(epochs):
        y_pred= B0 + B1*X
        loss_history.append(loss(y,y_pred))
        dB0=(-2/N)*np.sum(y-y_pred)        
        dB1=(-2/N)*np.sum(X*(y-y_pred))
        B0= B0- alpha*dB0
        B1= B1- alpha*dB1
    return [B0,B1]
#call the gradient_descent function for 30 iterations on the data
Bgd0,Bgd1=gradient_descent(30, X, y, 0.00001)

Now plot the line from the obtained parameters.

x_gd_model=np.linspace(40,160,1000)
y_gd_model= Bgd1*x_coeff_model + Bgd0
plt.rcParams["figure.figsize"] = (12,8)
plt.xlim(40,160)
sns.scatterplot(x='TeamPoints', y='FieldGoals', data=df);
plt.plot(x_gd_model, y_gd_model,c='r')
plt.show()
print("The value of b0 is {} and b1 is {}".format(Bc0, Bc1))
Line of best fit using gradient descent
Line of best fit using gradient descent

I stored every value for a few hundred iterations with a small learning rate to visualize how the model matures. Using pyplot I saved all the plots to make a gif.

Let’s now plot the loss history of the model.

plt.plot(loss_history);
plt.xlabel("Number of iterations")
plt.ylabel("Loss")
plt.show()
print("Loss at iteration 25 is {}".format(loss_history[24]))

Implementing the model using scikit-learn

from sklearn.linear_model import LinearRegression
lr= LinearRegression()
lr.fit(X.reshape(-1,1),y.reshape(-1,1))
x_lr_model=np.linspace(40,160,1000)
y_lr_model=lr.predict(x_lr_model.reshape(-1,1)).reshape(1,-1)[0]
#plot the line
plt.rcParams["figure.figsize"] = (12,8)
plt.xlim(40,160)
sns.scatterplot(x='TeamPoints', y='FieldGoals', data=df);
plt.plot(x_lr_model, y_lr_model,c='r')
plt.show()
Line of best fit using scikit-learn
Line of best fit using scikit-learn

Comparison

plt.plot(x_lr_model, y_lr_model,c='b', label="LinearRegression");
plt.plot(x_gd_model, y_gd_model, c='r', label="Gradient Descent");
plt.plot(x_coeff_model, y_coeff_model, c='g', label="Coefficient");
plt.legend()
plt.show()
#let's print the loss between lines to see the difference
print("The loss between Coeff model and Gradient descent {}".format(loss(y_coeff_model,y_gd_model)))
print("The loss between Coeff model and Linear Regression Model {}".format(loss(y_coeff_model,y_lr_model)))
print("The loss between Gradient descent and Linear Regression Model {}".format(loss(y_gd_model,y_lr_model)))

In the comparison plot above, we can see the LinearRegression line and the Coefficient line overlap. But on the other hand, the gradient descent line is slightly skewed to the LinearRegression line. What does this mean? Let’s investigate.

loss_gd=loss_history[-1]
loss_lr=loss(Bc0+Bc1*X, y)
print("The loss of the LinearRegression line is {}".format(loss_lr))
print("The loss of the GradientDescent line is {}".format(loss_gd))

Although there is a contrast in the loss of the lines, the loss between the predictions and the actual values is quite similar for all the models. We can safely infer the gradient descent line- although it does not have the same parameters- has positioned itself in a way that is approximately equal to the ideal condition.

Step steps_count[-1]: Wrapping up
Step steps_count[-1]: Wrapping up

Finally, this is what’s inside simple linear regression. Thank you.


References

How To Implement Simple Linear Regression From Scratch With Python – Machine Learning Mastery

Linear Regression using Gradient Descent

NBA Team Game Stats from 2014 to 2018

All pictures other than the cited ones are produced by the author


Related Articles