Acknowledgements
Thanks to my CS7641 class at Georgia Tech in my MS Analytics program, where I discovered this concept and was inspired to write about it. Thanks to Brandon for letting me use his beautiful Bayesian inference video in this article.
Question: Why do you square the error in a regression machine learning task?
Ans: "Why, of course, it turns out all the errors (residuals) into positive quantities!"
Question: "OK, why not use a simpler absolute value function |x| to make all the errors positive?"
Ans: "Aha, you are trying to trick me. Absolute value function is not differentiable everywhere!"
Question: "That should not matter much for numerical algorithms. LASSO regression uses a term with absolute value and it can be handled. Also, why not 4th-power of x or log(1+_x_²)? What’s so special about squaring the error?
Ans: Hmm…
A Bayesian argument
Remember that, for all tricky questions in machine learning, you can whip up a serious-sounding answer if you mix the word "Bayesian" in your argument.
OK, I was kidding there.
But yes, we should definitely have the argument ready about where popular loss functions like least-square and cross-entropy come from – at least when we try to find the most likely hypothesis for a supervised learning problem using Bayesian argument.
Read on…
The Basics: Bayes Theorem and the ‘Most Probable Hypothesis’
Bayes’ theorem is probably the most influential identity of probability theory for modern machine learning and Artificial Intelligence systems. For a super intuitive introduction to the topic, please see this great tutorial by Brandon Rohrer. I will just concentrate on the equation.

This essentially tells that you update your belief (prior probability) after seeing the data/evidence (likelihood) and assign the updated degree of belief to the term posterior probability. You can start with a belief, but each data point will either strengthen or weaken that belief and you update your hypothesis all the time.
Let us now recast the Bayes’ theorem in different symbols – symbols pertaining to Data Science. Let us denote, data by D and hypothesis by h. This means we apply Bayes’ formula to try to determine what hypothesis the data came from, given the data. We rewrite the theorem as,

Now, in general, we have a large (often infinite) hypothesis space i.e. many hypotheses to choose from. The essence of Bayesian inference is that we want to examine the data to maximize the probability of one hypothesis which is most likely to give rise to the observed data. We basically want to determine argmax of the P(h|D) ** i.e. we want to know which _h is most probable, given the observed **_ D.
A shortcut trick: Maximum Likelihood
The equation above looks simple but it is notoriously tricky to compute in practice – because of extremely large hypothesis space and complexity in evaluating integrals over complicated probability distribution functions.
However, in the quest of our search for the ‘most probable hypothesis given data,‘ we can simplify it further.
- We can drop the term in the denominator it does not have any term containing h i.e. hypothesis. We can imagine it as a normalizer to make total probability sum up to 1.
- Uniform prior assumption – this essentially relaxes any assumption on the nature of P(h) by making it uniform i.e. all hypotheses are probable. Then it is a constant number 1/|Vsd| where |Vsd| is the size of the version space i.e. a set of all hypothesis consistent with the training data. Then it does not actually figure in the determination of the maximally probable hypothesis.
After these two simplifying assumptions, the maximum likelihood (ML) hypothesis can be given by,

This simply means the most likely hypothesis is the one for which the conditional probability of the observed data (given the hypothesis) reaches maximum.
Next piece in the puzzle: Noise in the data
We generally start using least-square error while learning about simple linear regression back in Stats 101 but this simple-looking loss function resides firmly inside pretty much every supervised Machine Learning algorithm viz. linear models, splines, decision trees, or deep learning networks.
So, what’s special about it? Is it related to the Bayesian inference in any way?
It turns out that, the key connection between the least-square error and Bayesian inference is through the assumed nature of the error or residuals.
Measured/observed data is never error-free and there is always random noise associated with data, which can be thought of the signal of interest. Task of a machine learning algorithm is to estimate/approximate the function which could have generated the data by separating the signal from the noise.
But what can we say about the nature of this noise? It turns out that noise can be modeled as a random variable. Therefore, we can associate a probability distribution of our choice to this random variable.
One of the key assumptions of least-square optimization is that probability distribution over residuals is our trusted old friend – Gaussian Normal.
This means that every data point (d) in a supervised learning training data set can be written as the sum of the unknown function f(x) (which the learning algorithm is trying to approximate) and an error term which is drawn from a Normal distribution of zero mean (μ) and unknown variance _σ_². This is what I mean,

And from this assertion, we can easily derive that the maximum likely hypothesis is the one which minimizes the least-square error.
Math warning: There is no way around some bit of math to formally derive the least-square optimization criteria from the ML hypothesis. And there is no good way to type in math in Medium. So, I have to paste an image to show the derivation. Feel free to skip this section, I will summarize the key conclusion in the next section.
Derivation of least-square from Maximum Likelihood hypothesis


Voila! The last term is nothing but simple least-square minimization.
So, what did all this math show?
It showed that, starting from the assumption that error of a supervised training dataset is distributed over a Gaussian Normal, the maximum likely hypothesis for that training data is the one which minimizes the least-square error loss function.
There is no assumption about the type of the learning algorithm. This applies equally to anything and everything starting from the simple linear regression to deep neural net.
Such is the power and unifying nature of Bayesian inference.
Here is the a typical scenario for a linear regression fit. Bayesian inference argument works on this model and lends credibility to the choice of square of the error as the optimum loss function.

Is the assumption of Normality sound enough?
You can question the validity of the assumption about the Normally distributed error terms. But in most cases, it works. This follows from Central Limit Theorem (CLT) in the sense that error or noise is never generated by a single underlying process but rises from the combined impact of multiple sub-processes. And when large number of random sub-processes combine, their averaged value follows Normal distribution (from CLT). Therefore, it is not a stretch to assume such distribution for most of the machine learning task we undertake.
Another (stronger) argument through Gauss-Markov theorem
It turns out that in case of no information about error distribution is available, still least square method is the best linear unbiased estimator among all linear estimations (according to Gauss-Markov theorem). The only requirements are that errors have expectation value of zero, are uncorrelated and have equal variances.
Does similar argument hold for classification problems?
Least-square loss function is used in regression tasks. But what about the classification problems where we deal with classes and probabilities, and not with arbitrary real numbers?
Amazingly, it turns out that similar derivation can be done using ML hypothesis and simple choice of class definition to arrive at,

… which is nothing but cross-entropy loss function.
So, the same Bayesian inference yields the cross-entropy as the preferred choice of loss function for obtaining maximum likely hypothesis in classification problems.
Summary and Conclusions
We can summarize and extend our discussions and arguments in the article through the following points,
- Maximum likelihood estimate (MLE) is a powerful technique to arrive at the most probable hypothesis for a given set of data if we can make a uniform prior assumption i.e. at the start, all hypotheses are equally likely.
- If we can assume that each data point in a machine learning task is a sum of the true function and some random noise variable which is normally distributed, then we can derive the fact that the maximally probable hypothesis is the one which minimizes the square loss function.
- This conclusion holds true independent of the nature of the machine learning algorithm.
- However, another implicit assumption is the mutual independence of the data point which enables us to write the joint probability as a simple product of individual probabilities. This also underscores the importance of removing collinearity among training samples before a machine learning model should be build.
At the end of the day, you can say that the least-square minimization is really special as it is, in fact, intimately related to the most celebrated distribution function in this universe 🙂
If you have any questions or ideas to share, please contact the author at tirthajyoti[AT]gmail.com. Also, you can check author’s GitHub repositories for other fun code snippets in Python, R, or MATLAB and machine learning resources. If you are, like me, passionate about machine learning/data science, please feel free to add me on LinkedIn or follow me on Twitter.
Tirthajyoti Sarkar – Sr. Principal Engineer – Low Voltage Design Engineering – ON Semiconductor |…