Performing Linear Regression Using the Normal Equation

It is not always necessary to run an optimization algorithm to perform linear regression. You can solve a specific algebraic equation — the normal equation — to get the results directly. Although for big datasets it is not even close to being computationally optimal, it‘s still one of the options good to be aware of.

Robert Kwiatkowski
Towards Data Science

--

Photo by Antoine Dautry on Unsplash

1. Introduction

Linear regression is one of the most important and popular predictive techniques in data analysis. It’s also one of the oldest - famous C.F. Gauss at the beginning of 19th-century was using it in the astronomy for calculation of orbits (more).

Its objective is to fit the best line (or a hyper-/plane) to the set of given points (observations) by calculating regression function parameters that minimize specific cost function (error), e.g. mean squared error (MSE).

As a reminder, below there is a linear regression equation in the expanded form.

Eq. 1: A linear regression equation

In a vectorized form it looks like that:

Eq. 2: A linear regression equation in a vectorized form

where θ is a vector of parameters weights.

Usually finding the best model parameters is performed by running some kind of optimization algorithm (e.g. gradient descent) to minimize a cost function. However, it is possible to obtain values (weights) of these parameters by solving an algebraic equation called the normal equation as well. It is defined as below.

Eq. 3: the normal equation

2. Hand calculations

In this article, we will perform linear regression for a very basic case so we can avoid lengthy hand calculations. By the way, if you think you need to refresh your linear algebra skills, there are many good resources on the Internet (e.g. YouTube series I recommend).

In this example, there are only three points (observations) with only one variable (X₁). On the graph, they look like below.

A scatter plot showing points used in this example

In this case, the linear regression equation has a form of:

Eq. 4: A linear regression equation for this example

Features (X) and labels (y) are:

Features and Labels matrices

Note that we add a default bias term of 1 — it will be updated during our calculations. Not adding this term will lead to a wrong solution.

Step 1: Transposition of matrix X

This is a relatively simple task — rows become new columns.

Step 2: Multiplication on the transposed matrix and matrix X

Step 3: Inversion of a resultant matrix

To inverse a simple 2x2 matrix we can use the formula:

Therefore, we get:

Note: for bigger matrices (bigger than 3X3) inverting them becomes a much more cumbersome task and usually, the algorithmic approach is used — like Gaussian elimination. This is important to remember!

Step 4: Multiplication of the inverted matrix with X transposed

Step 5: Final multiplication to obtain the vector of best parameters

Finally our linear regression equations takes form of:

Eq. 5: A linear regression equation with the best weights

Plotting this line onto a previous graph looks like below.

A scatter plot with original points and a regression line (red)

3. Implementation in Python

The same calculations can be implemented in Python using Numpy library which contains a set of linear algebra functions in numpy.linalg collection.

import numpy as npX = np.c_[[1,1,1],[1,2,3]] # defining features
y = np.c_[[1,3,2]] # defining labels
theta = np.linalg.inv(X.T.dot(X)).dot(X.T).dot(y) # normal equation
print(theta)
Results of running the code above

Now we can define new features we would like to predict values for.

X_new = np.c_[[1,1,1,1],[0, 0.5,1.5,4]]  # new features

By implementing equation 2 we obtain the predicted values.

y_pred = X_new.dot(theta)  # making predictions
print(y_pred)
Predicted values

4. Comments

As you see it‘s pretty easy to use the normal equation and to implement it in Python — it’s only one line of code. So why it’s not commonly used?

The problem is in its numerical complexity. Solving this equation requires inverting a matrix and this is a computationally expensive operation — depending on the implementation, in a big O notation it is O(n³) or slightly less. This means is scales up horribly, practically meaning that when you double number of features, the computational times increases by 2³ = 8 times. There is also some chance that the result of step 2 is not invertible at all — causing big troubles. These are the reasons why in practice this approach is uncommon.

On the bright side, this approach is calculated just in one step and you don’t have to choose the learning rate parameter. Additionally, it terms of memory usage this approach is linear O(m) meaning it stores huge datasets effectively if they only fit into your computer’s memory.

--

--