Foundation of machine learning
A beautiful idea, which binds together concepts from statistics, information theory, and philosophy to lay down the foundation of machine learning.

Acknowledgments
Thanks to my CS7641 class at Georgia Tech in my MS Analytics program, where I discovered this concept and was inspired to write about it. Thanks to Matthew Mayo for editing and re-publishing this in KDnuggets.
Introduction
It is somewhat surprising that among all the high-flying buzzwords of machine learning, we don’t hear much about the one phrase which fuses some of the core concepts of statistical learning, information theory, and natural philosophy into a single three-word-combo.
Moreover, it is not just an obscure and pedantic phrase meant for machine learning (ML) Ph. Ds and theoreticians. It has a precise and easily accessible meaning for anyone interested to explore, and a practical pay-off for the practitioners of ML and Data Science.
I am talking about the Minimum Description Length. And you may be thinking what the heck that is…
Let’s peel the layers off and see how useful it is…
Bayes and his Theorem
Who was Thomas Bayes?
We start with (not chronologically) with Reverend Thomas Bayes, who by the way, never published his idea about how to do statistical inference, but was later immortalized by the eponymous theorem.

It was the second half of the 18th century, and there was no branch of mathematical sciences called "Probability Theory". It was known simply by the rather odd-sounding "_Doctrine of Chances_" – named after a book by Abraham de Moievre. An article called, "An Essay towards solving a Problem in the Doctrine of Chances", first formulated by Bayes, but edited and amended by his friend Richard Price, was read to Royal Society and published in the Philosophical Transactions of the Royal Society of London, in 1763. In this essay, Bayes described – in a rather frequentist manner – the simple theorem concerning joint probability which gives rise to the calculation of inverse probability i.e. Bayes Theorem.
A tale of two camps
Many battles have been fought since then between the two warring factions of statistical science – Bayesians and Frequentists. But for the purpose of the present article, let us ignore the history for a moment and focus on the simple explanation of the mechanics of the Bayesian inference. For a super intuitive introduction to the topic, please see this great tutorial by Brandon Rohrer. I will just concentrate on the equation.

This essentially tells that you update your belief (prior probability) after seeing the data/evidence (likelihood) and assign the updated degree of belief to the term posterior probability. You can start with a belief, but each data point will either strengthen or weaken that belief and you update your hypothesis all the time.
Let’s talk about ‘Hypothesis’
Sounds simple and intuitive? Great.
I did a trick in the last sentence of the paragraph though. Did you notice? I slipped in the word "Hypothesis". That is not normal English. That is formal stuff 🙂
In the world of statistical inference, a hypothesis is a belief. It is a belief about the true nature of the process (which we can never observe), that is behind the generation of a random variable (which we can observe or measure, albeit not without noise). In statistics, it is generally defined as a probability distribution. But in the context of Machine Learning, it can be thought of any set of rules (or logic or process), which we believe, can give rise to the examples or training data, we are given to learn the hidden nature of this mysterious process.
So, let us try to recast the Bayes’ theorem in different symbols – symbols pertaining to data science. Let us denote, data by D and hypothesis by h. This means we apply Bayes’ formula to try to determine what hypothesis the data came from, given the data. We rewrite the theorem as,

Now, in general, we have a large (often infinite) hypothesis space i.e. many hypotheses to choose from. The essence of Bayesian inference is that we want to examine the data to maximize the probability of one hypothesis which is most likely to give rise to the observed data. We basically want to determine argmax of the P(h|D) i.e. we want to know for which _ h, observed_ D is most probable. To that end, we can safely drop the term in the denominator P(D) because it does not depend on the hypothesis. This scheme is known by a rather tongue-twisting name of maximum a posteriori (MAP).
Now, we apply the following mathematical tricks,
- The fact that maximization works similarly for logarithm as for the original function i.e. taking logarithm does not change the maximization problem.
- The logarithm of a product is the sum of individual logarithms
- Maximization of a quantity is equivalent to minimization of the negative quantity

Curiouser and _curiouser…_those terms with the negative logarithm of 2 looks familiar… from Information Theory!
Enters Claude Shannon. The genius who invented the Information Age.
Shannon and the information theory
It will take many volumes to describe the genius and strange life of Claude Shannon, who almost single-handedly laid the foundation of information theory and ushered us into the age of modern high-speed communication and information exchange.
Here is a great book about his life and work,
A Mind at Play: How Claude Shannon Invented the Information Age
Shannon’s M.I.T. master’s thesis in electrical engineering has been called the most important MS thesis of the 20th century: in it the 22-year-old Shannon showed how the logical algebra of 19th-century mathematician George Boole could be implemented using electronic circuits of relays and switches. This most fundamental feature of digital computers’ design – the representation of "true" and "false" and "0" and "1" as open or closed switches, and the use of electronic logic gates to make decisions and to carry out arithmetic – can be traced back to the insights in Shannon’s thesis.
But this was not his greatest achievement yet.
In 1941, Shannon went to Bell Labs, where he worked on war matters, including cryptography. He was also working on an original theory behind information and communications. In 1948, this work emerged in a widely celebrated paper published in Bell Lab’s research journal.
Entropy and the minimum length
Shannon defined the quantity of information produced by a source – for example, the quantity in a message – by a formula similar to the equation that defines thermodynamic entropy in physics. In its most basic terms, Shannon’s informational entropy is the number of binary digits required to encode a message. And for a message or event with probability p, the most efficient (i.e. compact) encoding of that message will require –log2(p) bits.
And that is precisely the nature of those terms appearing in the maximum a posteriori expression derived from the Bayes’ theorem!
Therefore, we can say that in the world of Bayesian inference, the most probable hypothesis depends on two terms which evoke the sense of length – rather minimum length.

But what could be the notion of the length in those terms?
Length (h): Occam’s Razor
William of Ockham (circa 1287–1347) was an English Franciscan friar and theologian, and an influential medieval philosopher. His popular fame as a great logician rests chiefly on the maxim attributed to him and known as Occam’s razor. The term razor refers to distinguishing between two hypotheses either by "shaving away" unnecessary assumptions or cutting apart two similar conclusions.
The precise words attributed to him are: entia non sunt multiplicanda praeter necessitatem (entities must not be multiplied beyond necessity). In statistical parlance, that means we must strive to work with the simplest hypothesis which can explain all the data satisfactorily.
Similar principles echoed by other luminaries.
Sir Issac Newton: "We are to admit no more causes of natural things than such as are both true and sufficient to explain their appearances."
Bertrand Russell: "Whenever possible, substitute constructions out of known entities for inferences to unknown entities."
Always prefer the shorter hypothesis.
Need an example of what length of a hypothesis is?
Which of the following decision trees have a smaller length? A or B?

Even without a precise definition of the ‘length’ of a hypothesis, I am sure you would think that the tree on the left (A) looks smaller or shorter. And you will be right, of course. Therefore, a shorter hypothesis is the one that has either fewer free parameters, or less complex decision boundary (for a classification problem), or some combination of these properties which can represent its brevity.
In fact, the mathematical formalism of Occam’s Razor culminates into something called "Solomonoff’s theory of inductive inference", which is considered a foundational stepping stone towards Artificial General Intelligence (AGI).
The famous MIT physicist and AI researcher Max Tegmark recently used Occam’s Razor principle to build what is touted as the "AI Physicist", an unsupervised learning agent, centered around the learning and manipulation of theories, and capable of parsimoniously predicting both aspects of the future (from past observations) and the domain in which these predictions are accurate.
What about the ‘Length(D|h)’?
It is the length of the data given the hypothesis. What does that mean?
Intuitively, it is related to the correctness or representation power of the hypothesis. It governs, among other things, given a hypothesis, how well the data can be ‘inferred’. If the hypothesis generates the data really well and we can measure the data error-free then we don’t need the data at all.
Think of Newton’s laws of motion.
They, when appeared first in _Principia_, did not have any rigorous mathematical proof behind them. They were not theorems. They were much like hypotheses, based on the observations of the motion of natural bodies. But they described the data really really well. And, consequently, they became physical laws.
And that’s why you do not need to maintain and memorize a table of all possible acceleration numbers as a function of the force applied to a body. You just trust the compact hypothesis aka law F=ma and believe that all the numbers you need, can just be calculated from it when necessary. It makes the Length(D|h) really small.
But if the data deviates from the compact hypothesis a lot, then you need to have a long description of what these deviations are, a possible explanation for them, etc.
Therefore, Length(D|h) is succinctly capturing the notion of "how well the data fits the given hypothesis".
In essence, it is the notion of misclassification or error rate. For a perfect hypothesis, it is short, perhaps zero in the limiting case. A hypothesis, which does not fit the data perfectly, tends to be long.
And, there lies the trade-off.
If you shave off your hypothesis with a big Occam’s razor, you will be likely left with a simple model, one which cannot fit all the data. Consequently, you have to supply more data to have better confidence. On the other hand, if you create a complex (and long) hypothesis, you may be able to fit your training data really well but this actually may not be the right hypothesis as it runs against the MAP principle of having a hypothesis with small entropy.
Sounds like a bias-variance trade-off? Yes, also that 🙂

Putting it all together
Therefore, Bayesian inference tells us that the best hypothesis is the one that minimizes the sum of the two terms: the length of the hypothesis and the error rate.
In this one profound sentence, it pretty much captures all of (supervised) machine learning.
Think of its ramifications,
- Model complexity of a linear model— what degree polynomial to choose, how to reduce sum-of-square residuals
- Choice of the architecture of a neural network – how not to overfit the training data and achieve good validation accuracy but reduce the classification error.
- Support vector machine regularization and kernel choice – the balance between soft vs. hard margin i.e. trading off accuracy with decision boundary nonlinearity.
What shall we really conclude?
What shall we conclude from this analysis of the Minimum Description Length (MDL) principle?
Does this prove once and for all that short hypotheses are best?
No.
What MDL shows is that if a representation of hypotheses is chosen so that the size of hypothesis h is – log2 P(h), and if a representation for exceptions (errors) is chosen so that the encoding length of D, given h, is equal to -log2 P(D|h), then the MDL principle produces MAP hypotheses.
However, to show that we have such a representation we must know all the prior probabilities P(h), as well as the P(D|h). There is no reason to believe that the MDL hypothesis relative to arbitrary encodings of hypothesis and error/misclassification should be preferred.
For practical machine learning, it might sometimes be easier for a human designer to specify a representation that captures knowledge about the relative probabilities of hypotheses than it is to fully specify the probability of each hypothesis.
This is where the matter of knowledge representation and domain expertise becomes critically important. It short-circuits the (often) infinitely large hypothesis space and leads us towards a highly probable set of hypotheses that we can optimally encode and work towards finding the set of MAP hypotheses among them.
…best hypothesis is the one that minimizes the sum of the two terms: the length of the hypothesis and the error rate.
Summary and after-thought
It is a wonderful fact that such a simple set of mathematical manipulations over a basic identity of probability theory can result in such a profound and succinct description of the fundamental limitation and goal of supervised machine learning. For a concise treatment of these issues, readers can refer to this Ph.D. thesis, called "Why Machine Learning Works", from Carnegie Mellon University. It is also worthwhile to ponder over how all of these connect to No-Free-Lunch theorems.
If you are interested in deeper reading in this area
- "No-Free-Lunch and the Minimum Description Length"
- " No Free Lunch versus Occam’s Razor in Supervised Learning"
- "The No Free Lunch and Problem Description Length"
If you have any questions or ideas to share, please contact the author at tirthajyoti[AT]gmail.com. Also, you can check the author’s GitHub repositories for other fun code snippets in Python, R, or MATLAB and machine learning resources. If you are, like me, passionate about machine learning/data science, please feel free to add me on LinkedIn or follow me on Twitter.
Tirthajyoti Sarkar – Sr. Principal Engineer – Low Voltage Design Engineering – ON Semiconductor |…