The world’s leading publication for data science, AI, and ML professionals.

Linear Regression: From math to code

A guide through the math and implementation using Python.

Image by Author
Image by Author

Introduction

If you have a background in statistics, perhaps even at high school level, you are probably familiar with Linear Regression (LR). Briefly, it is one of the core techniques used in statistics and machine learning (ML) to define a model that best describes a set observed data, D, generated by an unknown process. While you can find plenty of examples online that explains how to do LR along with example code, my post attempts to provide a stepwise breakdown of the mathematics behind Bayesian LR and further attempts to explain how to implement it using Python code. Please note that my understanding and explanation of this concept including notation, is entirely based on the chapter 3 from the book [1]. The references are provided at the end for you to have a look. For those who are familiar with the algorithm, this post may act as a revision of the concepts behind it. For those who are not familiar, I hope this post acts a guide as you dive through the concepts provided by other authors. It is essential to have some knowledge of linear algebra, probability theory and calculus to get enough out of this post. Okay let’s begin!

Math

Say you have a set, D, of observed data, (tᵢ,xᵢ), obtained from an unknown process where x and tᵢ are inputs and outputs respectively. Bold x means it’s a vector. The relationship between an output of this unknown process, tᵢ and its input xᵢ, can be described by a function tᵢ = f(xᵢ,w). Geometrically, From a linear algebra point of view, once we have the right set of basis vectors, we can represent any vector/point in the vector space. The idea here is that our input lives in a D-Dimensional space, ℝᴰ, and our output lives in an M-dimensional space,ℝᴹ, which i like to call the output space. Our modeling problem now becomes finding a suitable value of M i.e. suitable number of basis and parameter values, w, that combines these basis. These basis are again, functions of x, which I believe is for tuning purposes!. These functions are referred to as basis functions. There are many functions that can be used as basis functions, such as polynomials, gaussians, sigmoids e.t.c. In the code section of this post we are going to use polynomials.

Image by Author based on [1] and [2]
Image by Author based on [1] and [2]

This function describes a pattern in this data, and if there is indeed a pattern, next time we have an input xᵢ, we expect to measure the same output tᵢ, which can also be predicted by our function. However, practically this will not be entirely accurate as there is always some noise in our measurements. Therefore, our function should account for this noise. Our observations will tend to center around our function prediction. Because of the central limit theorem, the distribution of these predictions approximates to a Gaussian distribution.

Image by Author based on [1]
Image by Author based on [1]

The joint probability of all of observations, assuming that they are independent and identically distributed (i.i.d) is a product of the probability distributions of the individual observations, expressed below.

Image by Author based on [1]
Image by Author based on [1]

This expression is called likelihood, and if we apply log on both sides so that we can work with additions instead of products we obtain the expression below.

Image by Author based on [1] and [2]
Image by Author based on [1] and [2]

The greek Φ symbol is called the design matrix, we will see one way of building this in the code section. t is a column vector of output observations tᵢ and X is a matrix where each row is an observation xᵢ. Assuming we fix a value for M, the values of the parameters of our model that maximizes the likelihood has a closed form solution, referred to as the Maximum likelihood estimate.

Image by Author
Image by Author

To get the intuition for maximum likelihood, let’s do a simple thought experiment. Say we flip a coin once, maybe we get a Tail. Okay, let’s flip for a second time, aand we get a Tail again. Okay, maybe this coin is flawed. Let’s flip one more time, and again this time we still get a tail. Based on these observations we might as well conclude that the chances of getting a Tail are 100%. But we know that our coin is fair i.e. prior knowledge. Using Baye’s theorem, we use this prior knowledge to come up with a better expression for computing our parameter w.

Image by Author based on [1]
Image by Author based on [1]

Applying log to both sides of the equation and solving for the optimum parameter w , as it was done before, we obtain the expression below and the parameter estimate below it. This is called Maximum a Posteriori estimate (MAP).

Image by Author
Image by Author

Both both maximum likelihood and maximum a posteriori only gives us an exact estimate of model parameters which we can later use to make predictions. However, when making predictions it’s important to have a measure of uncertainty so that we make proper decisions from the predictions. Clearly if we end here it’s not sufficient for building a robust model.

The Bayesian approach (BLR), takes maximum a posteriori one step further to form a predictive distribution. From this distribution, each future input will have corresponding target mean and variance/covariance. Below is a high level view of the math.

Image by Author
Image by Author

Now the predictive distribution can be obtained using the equation below

Image by Author
Image by Author

Both the posterior parameter distribution and predictive distribution can be obtained using the conditional and marginal distribution properties of gaussian distribution. See chapter 2 of [1] for more details.

In all of the methods above, we have assumed that the variance of our data and parameters (i.e w), β⁻¹ and α⁻¹, respectively are known. In practice these values are unknown and can be estimated from the dataset. A full Bayesian treatment, which i’m not going to cover in this post, offers a technique called evidence approximation that can be used to approximate these values. However, there are a couple of things worth pointing out and I will just mention them here. If α⁻¹ is very high, w will approximate to the maximum likelihood estimate. If the dataset is very small, w will approximate to the prior. As the dataset size increases towards infinity, the variance of our predictions become smaller approaching the limit β⁻¹.

Code

Using Python code, let’s begin by constructing the design matrix. As stated earlier, we are going to use polynomial basis functions in this case.

Lets first create a class for doing regression using maximum likelihood estimate (ML):

Now let’s create a class for doing prediction using Maximum a Posteriori (MAP):

Finally, let’s create a class for doing Bayesian Linear Regression. Note that this is not fully Bayesian, as the α⁻¹ and β⁻¹ are assumed to be known and should be passed in as standard deviations during class instantiation.

Testing

Now let’s test our models. We are going to use synthetic data, so first we create a simple function to generate synthetic data and then set some parameters and prepare training and testing data.

Now create models, train, predict and plot i.e. for ML, MAP and BLR:

And finally our graph:

Image by Author
Image by Author

Conclusion

In this post I have tried to cover the core mathematics behind linear regression and attempted to implement it using python code. However, as you can see, the model is not the best fit for the observed data. The primary reason being that the model is not complex enough to fit the data (i.e. underfitting). Model complexity is the number of parameters, M, of the model which for this example, I have used a random value of 7. The Bayesian approach also offers an approach that can approximate the optimal value for model complexity which I may cover in another post.

References

[1]: Bishop, C. M. (2006). Pattern recognition and Machine Learning. springer.

[2]: Deisenroth, M. P., Faisal, A. A., & Ong, C. S. (2020). Mathematics for machine learning. Cambridge University Press.


Related Articles