The world’s leading publication for data science, AI, and ML professionals.

Machine Learning Algorithms from Start to Finish in Python: Linear Regression

Probably one of the most common algorithms around, Linear Regression is a must know for Machine Learning Practitioners. This is usually a…

Learn, Understand and implement the most important and fundamental algorithm in all of Data Science and Machine Learning.

Photo by Julian Ebert on Unsplash
Photo by Julian Ebert on Unsplash

Probably one of the most common Algorithms around, Linear Regression is a must know for Machine Learning Practitioners. This is usually a beginner’s first exposure to a real Machine Learning algorithm, and knowing how it operates on a deeper level is crucial to gain a better understanding of it.

So, briefly, let’s break down the real question; What really is Linear Regression?

Linear Regression Definition

Linear Regression is a supervised learning algorithm that aims at taking a linear approach at modelling the relation between a dependent variable and an independent variable. In other words, It aims to fit a linear trendline that best captures the relationship of the data, and, from this line, it can predict what the target values may be.

Photo by Wikipedia
Photo by Wikipedia

Great, I know the definition, but how does it really work? Great question! In order to answer the question, let’s run through a step by step process of how Linear Regression really operates:

  1. A trendline is fit to the data(as illustrated above).
  2. The distance between the points(the red dots on the figure being the points, and the green line being the distance) is calculated, and then squared, before they are summed(The values are squared to ensure that negative values do not produce an incorrect value and hinder the calculation). This is the error of the algorithm, or better knows as the residual
  3. The residual for the iteration is stored
  4. Based on an optimisation algorithm, the __ line is slightly "shifted" so that the line may fit the data better.
  5. Steps 2–4 are repeated until a desirable result is reached, or the residual error has decreased to zero.

This method of fitting a line is known as Least Squares.

The math behind Linear Regression

Note: Feel free to skip this part, as it unfortunately does contain some bizarre-looking LaTeX. However, if you want to really understand what is going on, I suggest to risk it for a biscuit and continue reading!

Photo by Antoine Dautry on Unsplash
Photo by Antoine Dautry on Unsplash

The Linear Regression algorithm is the following:

Photo By Author
Photo By Author

Which can be shortened to:

Photo By Author
Photo By Author

The following algorithm will basically do the following:

  1. Take in a Y vector(the labels of your data,(house price, stock price, etc..))
Photo By Author
Photo By Author

This is your target vector and will be used to later evaluate your data(more on that later).

  1. Take in a matrix X(the features of the data):
Photo By Author
Photo By Author

This is your data’s features, i.e Age, Gender, Sex, Height, etc. This is the data that the algorithm will actually use to make predictions. Note how there is a feature x0. This is called your intercept term and is always equal to 1.

  1. Take is a vector of weights, and transpose them:
Photo By Author
Photo By Author
Photo By Author
Photo By Author

This is the magic bit of the algorithm. All your feature vectors will get multiplied by these weights. This is known as a dot product. Essentially, you will be trying to find the best combination of these values for a given dataset. This is known as optimisation.

  1. And get a vector of outputs:
Photo By Author
Photo By Author

This is the prediction vector that is outputted from your data. You can then use this to evaluate your model’s performance by using a cost function.


This is essentially the whole algorithm denoted mathematically. Now you should have a solid understanding of how Linear Regression functions. But the question is, what is an optimisation algorithm? and how do we pick the optimal weights? and how do we evaluate the performance?

Cost Functions

A cost function is essentially a formula that measures the loss, or the "cost" of your model. If you have ever done any Kaggle competitions, you may have come across some of them. A few common ones include:

  • Mean Squared Error
  • Root Mean Squared Error
  • Mean Absolute Error

These functions are essential to model training and development as they answer the fundamental question of "how well is my model predicting new instances?". Keep this in mind, as this ties in with our next topic.

Optimisation Algorithms

Optimisation is usually defined as the process of improving something so that it operates at its full potential. This is also applicable in Machine Learning. In the world of ML, optimisation is essentially trying to find the best combination of parameters for a certain dataset. This is essentially the "learning" bit of Machine Learning.

While may optimisation algorithms exist, I will discuss two of the most common ones: Gradient Descent and The Normal Equation.

Gradient Descent

Gradient Descent is an optimisation algorithm that aims to find the minimum of a function. It achieves this goal by iteratively taking steps in the negative direction of the slope. In our example, gradient descent would continuously update the weights by moving in the slope of the tangential line to the function. Well fantastic, sound great. English please? 🙂

A concrete example of Gradient Descent

Photo by Lucas Clara on Unsplash
Photo by Lucas Clara on Unsplash

To better illustrate Gradient Descent, Let’s go through a simple example. Imagine a human is at the top of a mountain, and he/she want to get to the bottom. What they might do is look around and see in what direction they should take a step in in order to get down quicker. Then, they might take a step in that direction and now they are closer to their goal. However, they have to be careful when coming down as they might get stuck at a certain point, so we have to make sure to choose our step sizes accordingly.


Similarly, The objective of gradient descent is to minimise a function. In our case, it is to minimise the cost of our model. It does this by finding the tangential line to the function and moving in that direction. The size of the "step" of the algorithm is defined by what is known as a learning rate. This essentially controls how far we move down. With this parameter, we have to be careful of two cases:

  1. The learning rate is too large, the algorithm might not converge(reach a minimum) and bounce around the minimum, but never be at it
  2. The learning rate is too small, the algorithm will take too long to get to the minimum, also might get "stuck" at a sub-optimal point.

We also have a parameter that controls the number of times the algorithm iterates over the dataset.

Visually, the algorithm would do something like this:

Photo By Wikipedia
Photo By Wikipedia

Because this algorithm is so essential to Machine Learning, let’s recap what it does:

  1. Randomly initialises the weights. This is called(you guessed it) random initialisation)
  2. The model then makes predictions using these random weights.
  3. The model’s predictions are evaluated through a cost function.
  4. The model then runs gradient descent, by finding the tangential line of the function, and then taking a step in that slope of the tangent
  5. The process is repeated for N iterations, or if a criteria is met.

Advantages and Disadvantages of Gradient Descent

Advantages:

  1. Is very likely to reduce the cost function to a global minimum(very close to or = 0)
  2. One of the most effective optimisation algorithms

Disadvantages:

  1. Can be slow on large datasets, as it uses the whole dataset to compute the gradients of the tangential line of the function
  2. Is liable to getting stuck at a sub-optimal point(or local minima)
  3. The user has to manually choose the learning rate and the number of iterations, which can be time consuming

Now that you have been introduced to Gradient Descent, let’s introduce the Normal Equation.

Normal Equation

If we were to go back to our example, instead of taking steps iteratively down the mountain, we would be able to immediately get to the bottom. This is the case with the Normal Equation. It leverages linear algebra to produce weights that can produce results just as well as Gradient Descent would, in a fraction of the time.

Advantages and Disadvantages of the Normal Equation

Advantages

  1. No need to chose a learning rate or the number of iterations
  2. Extremely fast

Disadvantages

  1. Does not scale well to large datasets
  2. Tends to produce good weights, but not optimal ones

Whew, a lot to take in! I suggest you take a break and grab a quick coffee before moving on 🙂

Photo by Helena Lopes on Unsplash
Photo by Helena Lopes on Unsplash

Feature Scaling

This is an important preprocessing step of many Machine Learning Algorithms, especially those that use distance metrics and calculations(like Linear Regression and Gradient Descent). It essentially scales our features so that they are in a similar range. Think of it like a house, and a scaled model of a house. The shape of both are the same(they are both houses), but the size is different(5m != 500m). We do this for the following reasons:

  1. It speeds up algorithms
  2. Some algorithms are sensitive to scale. In other words, if the features have different scales, there is a chance that higher weightage is given to features with higher magnitude. This will impact the performance of the machine learning algorithm and obviously, we do not want our algorithm to be biassed towards one feature.

To demonstrate this, let’s suppose we have three features, named A, B and C:

  • Distance of AB before scaling =>
Photo by Analytics Vidhya
Photo by Analytics Vidhya

Distance of BC before scaling =>

Photo by Analytics Vidhya
Photo by Analytics Vidhya

Distance of AB after scaling =>

Photo by Analytics Vidhya
Photo by Analytics Vidhya

Distance of BC after scaling =>

Photo by Analytics Vidhya
Photo by Analytics Vidhya

We can clearly see that the features are much more comparable and unbiased than they were before scaling. If you want a great tutorial on feature scaling, please check out this blogpost by Analytics Vidhya.

Whew! That is a lot of information to take in. So, I suggest you take a break, grab a coffee, enjoy life, and when you feel, ready, you can dive in to actually coding this algorithm from scratch.


Coding Linear Regression from Scratch

Photo by Chris Ried on Unsplash
Photo by Chris Ried on Unsplash

Ok, now the moment you have been waiting for; the implementation! Without further ado, let’s begin!

Note: all the code can be downloaded from this Github repo. However, I recommend you to follow along the tutorial before you do so, because then you will gain a better understanding of what you are actually coding!

First, let’s do some basic imports:

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_boston

Yep, that’s all the imports! All we are using is numpy for the mathematical implementation, matplotlib to plot graphs, and the boston dataset from scikit-learn.

Next, let’s load our data and define our features and our labels:

# Load and split data
data = load_boston()
X,y = data['data'],data['target']

Next, let’s create a custom train_test_split function to split our data into a training and test set:

# Custom train test split
def train_test_divide(X,y,test_size=0.3,random_state=42):
    np.random.seed(random_state)
    train_size = 1 - test_size
    arr_rand = np.random.rand(X.shape[0])
    split = arr_rand < np.percentile(arr_rand,(100*train_size))

    X_train = X[split]
    y_train = y[split]
    X_test =  X[~split]
    y_test = y[~split]

    return X_train, X_test, y_train, y_test
X_train,X_test,y_train,y_test = train_test_divide(X,y,test_size=0.3,random_state=42)

Basically, we are just

  1. getting in a test size.
  2. setting a random seed to make sure our results and repeatable.
  3. Obtaining the train set size based on the test set size
  4. Picking random samples from our features
  5. Splitting the randomly selected instances into train and test set

Our cost function

We will implement MSE, or Mean Squared Error, a common cost function used for regression tasks:

Photo By Author
Photo By Author
def mse(preds,y):
        m = len(y)
        return 1/(m) * np.sum(np.square((y - preds)))
  • M referring to the number of training examples
  • yi referring to a single instance in our label vector
  • and preds refers to our predictions made

In order to make clean, repeatable and efficient code, as well as adhere to software development practices, we will make create a Linear Regression class:

class LinReg:
    def __init__(self,X,y):
        self.X = X
        self.y = y
        self.m = len(y)
        self.bgd = False
  • The bgd boolean is a parameter that defines whether or not we should use Batch Gradient Descent or not.

We will now create a method to add an intercept term:

def add_intercept_term(self,X):
        X = np.insert(X,1,np.ones(X.shape[0:1]),axis=1).copy()
        return X
  • This basically inserts a column on at the beginning of our features.
  • If we did not add this, then we would be forcing the hyperplane to pass through the origin, causing it to tilt considerably and hence not fitting the data properly.

We will now scale our features:

Photo By Author
Photo By Author
def feature_scale(self,X):
        X = (X - X.mean()) / (X.std())
        return X

Next, we will randomly initialise the weights:

def initialise_thetas(self):
        np.random.seed(42)
        self.thetas = np.random.rand(self.X.shape[1])

We will now code the normal equation from scratch using the following formula:

Photo By Author
Photo By Author
def normal_equation(self):
        A = np.linalg.inv(np.dot(self.X.T,self.X))
        B = np.dot(self.X.T,self.y)
        thetas = np.dot(A,B)
        return thetas

Essentially, we split the algorithm into 3 parts:

  1. We get the inverse of the dot product of X transposed and X
  2. We get the dot product of our weights and our labels
  3. We get the dot product of our two calculated values

So, that’s the Normal Equation! Not too bad! Now, we will implement Batch Gradient Descent using the following formula:

Photo By Author
Photo By Author
def batch_gradient_descent(self,alpha,n_iterations):
        self.cost_history = [0] * (n_iterations)
        self.n_iterations = n_iterations

        for i in range(n_iterations):
            h = np.dot(self.X,self.thetas.T)
            gradient = alpha * (1/self.m) * ((h - self.y)).dot(self.X)

            self.thetas = self.thetas - gradient
            self.cost_history[i] = mse(np.dot(self.X,self.thetas.T),self.y)

        return self.thetas

Here, we do the following:

  1. We take in alpha, or the learning rate, and the number of iterations
  2. We create a list to store our cost functions history to later plot in a line plot
  3. We loop through the dataset n_iterations times,
  4. We obtain the predictions, and calculate the gradient(the slope of the tangential line of the function). This is referred to as h(x)
  5. We update the weights to move down the gradient, by subtracting our predictions from their actual values and multiplying to by each feature
  6. We record the values using our custom MSE function
  7. Repeat, and when finished, return our results

Let’s define a fit function to fit our data:

def fit(self,bgd=False,alpha=0.158,n_iterations=4000):
        self.X = self.add_intercept_term(self.X)
        self.X = self.feature_scale(self.X)
        if bgd == False:

            self.thetas = self.normal_equation()
        else:
            self.bgd = True
            self.initialise_thetas()
            self.thetas = self.batch_gradient_descent(alpha,n_iterations)

Here, we just check if the user wants Gradient Descent or not, and do our steps accordingly.

Let’s Build a function to plot our cost function:

def plot_cost_function(self):

        if self.bgd == True:
            plt.plot(range((self.n_iterations)),self.cost_history)
            plt.xlabel('No. of iterations')
            plt.ylabel('Cost Function')
            plt.title('Gradient Descent Cost Function Line Plot')
            plt.show()
        else:
            print('Batch Gradient Descent was not used!')

And finally a method that predicts unlabelled instances:

def predict(self,X_test):
        self.X_test = X_test.copy()
        self.X_test = self.add_intercept_term(self.X_test)
        self.X_test = self.feature_scale(self.X_test)
        predictions = np.dot(self.X_test,self.thetas.T)
        return predictions

Now, let’s see which optimisation produced better results. First, let’s try Gradient Descent:

lin_reg_bgd = LinReg(X_train,y_train)
lin_reg_bgd.fit(bgd=True)
mse(y_test,lin_reg_bgd.predict(X_test))
OUT:
28.824024414708344

Let’s plot our function and see how the cost function decreased and if Gradient Descent converged:

Photo By Author
Photo By Author

So we can see that at around 1000 iterations was when it began to converge.

And now the Normal Equation:

lin_reg_normal = LinReg(X_train,y_train)
lin_reg_normal.fit()
mse(y_test,lin_reg_normal.predict(X_test))
OUT:
22.151417764247284

So we can see that the Normal Equation slightly outperformed Gradient Descent. This could be because the dataset is small and we have not picked the best parameters for the learning rate.

Activities

  1. Try to increase the learning rate greatly. What happens?
  2. Don’t apply feature scaling. Is there a difference?
  3. Try research and see if you can implement an even better optimisation algorithm. Evaluate your model on the test set

It’s been really fun writing this, and though it’s a bit long, I hope you have learned something today. Stay tuned and keep reading!

Photo by Jonathan Kemper on Unsplash
Photo by Jonathan Kemper on Unsplash

Related Articles