Machine Learning 101: Linear Regression

Going Back to the Very Foundations of Data Science

Published in

Towards Data Science

7 min readJan 12, 2022

Machine learning and data science have come a long way since being described as the “sexiest job of the 21st century” — we now have very powerful deep learning models capable of self driving automobiles, or seamlessly translating between different languages. Right at the foundation of all these powerful deep learning models is a humble model which I want to explore in deep detail today. That model is linear regression!

Despite its simplicity, a good understanding of linear regression is prerequisite for understanding how more advanced models work! After all, one must learn to walk before one can run! Today in this article I will explore in detail:

The basic mathematics behind linear regression.
Loss functions.
The analytical solution for ordinary least squares regression.
How to create a Python ordinary least squares solver 🐍

If you are just starting out in machine learning and data science, or you just want to revise some fundamental concepts which you have since left behind in your advanced career, go on ahead and read till the end of this article!

Mount Taranaki, Egmont National Park, New Zealand. Photo by David Billings on Unsplash.

Mathematical Linear Relationships in Data

The linear regression model assumes that the dependent variable d has a linear relationship with m explanatory variables g₁, g₂, … gₘ:

d = c₁g₁ + c₂g₂ + … + cₘgₘ,

where c₁, c₂, … cₘ are the coefficients associated with the explanatory variables. For example, the dependent variable d could be something like the price of a house, and the explanatory variables g₁, g₂, … gₘ could be the geographical coordinates, the age and the floor area etc. of that house.

The linear equation above can be written compactly as the following inner product:

d = Σᵢ cᵢgᵢ = mᵀg,

where m is a vector of length m: m = [c₁, c₂, … cₘ]ᵀ, and g is also a vector of length m: g = [g₁, g₂, … gₘ]ᵀ.

Since m contains the vector of coefficients which allow us to make predictions on d using g, m is also called the vector of model parameters defining the linear relationship between d and g.

In data analysis we usually have to deal with a set of dependent variables rather than a single value. Therefore, we usually work with a vector containing n dependent variables: d = [d₁, d₂, … dₙ]ᵀ, where the uth element in d has a linear relationship with a corresponding set of m explanatory variables Gᵤ₁, Gᵤ₂, … Gᵤₘ:

dᵤ = Gᵤ₁c₁ + Gᵤ₂c₂ + … + Gᵤₘcₘ = Σᵥ Gᵤᵥcᵥ.

This linear relationship for all n elements in the vector d = [d₁, d₂, … dₙ]ᵀ can be written compactly as the following matrix product:

d = Gm,

where G is a matrix of size n×m with the structure:

G = [[G₁₁, G₁₂, … G₁ₘ],
.…….[G₂₁, G₂₂, … G₂ₘ],
.……. …
.…….[Gₙ₁, Gₙ₂, … Gₙₘ]],

with the uth row of G containing the vector of explanatory variables for the uth element in d. As with above, m is a vector of length m: m = [c₁, c₂, … cₘ]ᵀ containing the model parameters.

Therefore, given some set of explanatory variables G and corresponding dependent variables d, the goal in linear regression is to determine m which best describes the linear relationship between d and G.

Over and Under Determined Linear Systems

Now, if n, the number data points in d, is larger than m, the number of model parameters in m, then the data provides enough information to uniquely determine all the model parameters. We say that this system is over determined.

However, if m is larger than n, then unfortunately the data does not provide sufficient information to uniquely determine all the model parameters. We say that this system is under determined.

For the rest of this article, we shall deal only with over determined linear systems, which is usually the case in most data science projects. For under determined systems, Lagrange multipliers are required to obtain the solution making the mathematics involved a little more complicated.

Prediction Errors and Loss Functions

In order to best determine m, we need to minimize the prediction errors (or the loss function) between d and Gm:

e = d - Gm.

If m perfectly describes the relationship between d and G, then Gm will equal d, and the prediction error e will be 0.

In order to further quantify the overall prediction error across all n elements in d, we reduce the error vector e to a scalar. This is usually done by using some sort of norm of the vector e. Commonly used norms are:

The L1 norm: Σᵤ|eᵤ|
The L2 norm: √(Σᵤeᵤ²)

The L1 norm assumes that the data in d is drawn from an exponential distribution, while the L2 norm assumes that the data is drawn from a Gaussian distribution.

As shown in the figure below, the exponential distribution is longer tailed than the Gaussian distribution. This means that the probability of a data point being drawn far from the mean is higher in the exponential distribution than in the Gaussian distribution. This result in turn implies that the exponential distribution is more robust to outlier data points located far away from the mean than the Gaussian distribution.

The exponential distribution is longer tailed than the Gaussian distribution, making it more robust to outliers. Image created by author.

When determining which norm to use, it is important to understand which probability distribution is governing d, as this will greatly affect the estimated values of m. Thankfully however, due to the central limit theorem, Gaussian distributions arise very frequently, and for the most part using the L2 norm should lead to a reasonably good estimation of the model parameters m.

On a side note, more complex loss functions can be constructed by using combinations of both the L1 and L2 norms such as the Huber loss, or by introducing regularization terms make the optimal solution unique. The possibilities are endless!

Over Determined Ordinary Least Squares Solution

Assuming that the data in d is governed by a Gaussian distribution we want to minimize the loss function L(m) associated with the L2 norm of the prediction error e = d - Gm:

L(m) = Σᵤeᵤ² = eᵀe = (d - Gm)ᵀ(d - Gm) = (dᵀ - mᵀGᵀ)(d - Gm)
∴ L(m) = dᵀd - dᵀGm - mᵀGᵀd + mᵀGᵀGm.

As L(m) is composed of the squares of the error, this problem is also known as the ordinary least squares problem. We minimize the loss function L(m) with respect to the model parameters m by differentiating L(m) with respect to m and equating the result to zero:

∂L(m)/∂m = 0 - Gᵀd - Gᵀd + 2GᵀGm = 0,
GᵀGm - Gᵀd = 0,
GᵀGm = Gᵀd,
∴ m = [GᵀG]⁻¹Gᵀd.

(If you are not very familiar with differentiation with respect to vectors, you might want to refer to this set of notes by Yang, Cao, Chung, and Morris.)

Hence, the over determined ordinary least squares solution to d = Gm is simply: m = [GᵀG]⁻¹Gᵀd!

Ordinary Least Squares Python Solver

Enough of mathematics! Let us now implement the ordinary least squares solution derived above in Python!

import numpy as npdef least_squares(G, d):
    """
    Over determined ordinary least squares solver.    Inputs
    G: np.array of nxm kernel matrix of explanatory variables. 
    d: np.array of n observations. 
    Returns
    M: np.array of m estimated model parameters
    """
    M = np.dot(np.linalg.inv(np.dot(G.T, G)), G.T)
    M = np.dot(M, d)
    return M

Let us also create a one dimensional linear system to solve using our code. We assume that we have 10 noiseless measurements generated by a very simple linear system with 2 model parameters:

d = c₀ + c₁g,

where we define c₁ = π and c₀ = exp(1).

g = np.arange(0, 10)
d = np.exp(1) + np.pi * g

We now need to prepare the matrix of explanatory variables G. To do that, we note that the equation above is equivalent to c₀ being the coefficient for an explanatory variable that always has the value of 1:

d = c₀g⁰ + c₁g = 1c₀ + c₁g.

The matrix G will therefore have 2 columns corresponding to the 2 model parameters c₀ and c₁. The first column contains the values of g, while the second column contains 1s only. This matrix G can be created using a second order Vandermonde matrix using np.vander:

# The 2 here means: create a 2nd order Vandermonde matrix from g.
G = np.vander(g, 2)
print(G)[[0 1]  
 [1 1]  
 [2 1]  
 [3 1]  
 [4 1]  
 [5 1]  
 [6 1] 
 [7 1] 
 [8 1]  
 [9 1]]

All we need to do now is to pass both G and d to the least squares solver to solve for the model parameters:

print(least_squares(G, d))[3.14159265, 2.71828183]

which correctly returns the values for c₁ = π and c₀ = exp(1)!

While this might be a very simple example (with data in which we already know the model parameters before hand), it shows exactly how linear regression works, and carries over to more complex problems.

More Advanced Linear Regression

We have done a lot of mathematics and some coding today to finally reach this point! However by no means are we fully done with linear regression! There are other more complex ways to obtain the least squares solution obtained above, for example such as using singular value decomposition of the matrix G, or through maximum likelihood methods! Each method has its own peculiarities suited to different scenarios.

In addition, we have not yet explored the under determined case which requires a little more complicated mathematics to obtain the solution! Neither have we attempted to create a solver for the L1 norm loss function, which requires more advanced numerical methods to solve, such as linear programming methods or stochastic gradient descent methods.

Linear regression might be the simplest regression model, but it is also extremely deep and filled with complex details! Do take the time to explore these more advanced themes to get an even better understanding of how linear regression works!

Summary

Today we explored the mathematics behind linear regression and created an ordinary least squares solver using Python. The concepts explored today such as minimizing the loss function might be fundamental, but apply to all other more advanced models such as neural networks! I hope you were able to gain a better understanding of how regression works in general from this article. Thank you for reading!

References

[1] W. M. Menke (2012), Geophysical Data Analysis: Discrete Inverse Theory MATLAB Edition, Elsevier.
[2] Yang, Cao, Chung and Morris (2007), Applied Numerical Methods Using MATLAB, Wiley.