The world’s leading publication for data science, AI, and ML professionals.

Derivation of Least Squares Regressor and Classifier

A basic but powerful classifier and regressor, their derivations and why they work

Basic Machine Learning Derivations

Image by Author
Image by Author

In this article, I derive the pseudo-inverse solutions for the least-squares regression and classification algorithms.

Although not very complex, in some problems it remains a very powerful tool and is still used today as the core of other Machine Learning models like ensemble methods or neural networks (where perceptrons present a very similar algorithm). If you are just getting started in the world of machine learning, this is by far one of the most important topics to get your head around!

Linear Regression

Let’s first derive the least-squares solution for a regression problem.

We are trying to estimate a target vector y from a data matrix X. To do so we try to optimize the weights vector w that minimizes the sum of squared errors as shown below:

where E is the sum of squared errors, y is the target vector, X is the data matrix and w the weights vector
where E is the sum of squared errors, y is the target vector, X is the data matrix and w the weights vector

The least-squares problem has an analytical solution. When differentiating the error by w, then finding w for when the derivative is equal to zero yields the pseudo-inverse solution:

Least Squares Classifier

The least-squares solution can also be used to solve classification problems by attempting to find the optimal decision boundary.

If one attempts to classify in a two-dimensional problem, one could adapt the least-squares algorithm in the following way:

First note that the target tensor is no longer an Nx1 vector, but rather an Nxc tensor, where c is the number of categories we are attempting to classify. Moreover, the weights vector needs an extra dimension to represent the intercept of the decision boundary:

Where w0 is the intercept of the decision boundary, and w can be used as the gradient.

The data matrix X must also be adapted for the dimensionalities to be compatible. Concatenating a vector of ones to increase the dimensionality does the trick.

The new data matrix and the target vector t can be used to derive the analytical solution. The result is a least-squares classifier and its pseudo-inverse solution.

Here is a little example of a bivariate gaussian classifier implemented with the method shown above against the default SK-learn classifier.

Image by Author
Image by Author

The equation of the decision boundary is simply ax + by + c = 0. The weights vector is [a, b, c]. Isolating y we find that the gradient of the classifier is -a/b, the intercept of the classifier is -c/b.

It is important to note that this is the maximum posterior estimator for a bivariate gaussian problem, and with infinite data, it would tend to the perfect result.

Conclusion

In my masters’ degree, linear regression was a topic that kept coming up over and over again. Its simplicity and its relationship to the maximum posterior solution have been its reasons for success in the machine learning world. I hope this article helped you understand how it can be adapted for classification problems.

Support me 👏

Hopefully this helped you, if you enjoyed it you can follow me!

You can also become a medium member using my referral link, get access to all my articles and more: https://diegounzuetaruedas.medium.com/membership

Other articles you might enjoy

Differentiable Generator Networks: an Introduction

Fourier Transforms: An Intuitive Visualisation


Related Articles