A very Bayesian interpretation of decision trees and other machine learning algorithms

Introduction

Shashank Kumar
Towards Data Science

--

I remember enrolling for a course where my professor spent two lectures chewing over the math sprouting decision trees before disclaiming, “Class, decision trees algorithms do not use any of this.”. Those classes, mind you, were not about the Gini index or Entropy gain. He shunned them in minutes. The two lectures were 180 minutes of confrontation with Bayes theorem and Beta distribution. So why were we encouraged to quaff up all that math? Well, the usual ways of growing decision trees are an approximation of that Bayesian model. But that’s not it. The model also entails ideas that bud ensemble methods. With that, let us plunge into some math and divulge the supremacy of the Bayes theorem. (Note: I have assumed you know probability concepts like random variables, Bayes theorem, and conditional probability)

Photo by Vladislav Babienko on Unsplash

Building the challenge

I suppose you are familiar with decision trees and how they work using Gini coefficients or Entropy loss. So, we’ll gloss over it and instead use the Bayes theorem. Consider a boolean classification problem you are required to solve using decision trees. Because I’m a good egg, I drudge on your behalf and build a garden of all possible decision trees using the training data. Why all? Well, your challenge is to classify a new data instance x by including all the trees in the decision process. How’ll you proceed?

As discussed, you have to approach the problem with a Bayesian psyche, which covets the probability of x belonging to a particular class Y (y1 or y2). Using that probability, you can decide the appropriate class. Note from now we’ll treat X and Y as random variables(RV). But are they the only RVs you need? No, the estimate P(Y|X=x) depends on two other things.

Let’s cogitate on the conundrum of involving all possible decision trees. All trees aren’t green enough to solve the problem. Why so? For any problem, one generally uses Gini indexing or Entropy gain to unearth the tree that best segregates the training data. This suggests that any particular dataset d has a unique befitting tree. Thereby, if you consider trees and dataset as RVs, then, for a specific tree T=t and training dataset D=d, you can find the probabilistic estimate, P(T=t|D=d), of how well t works on d. The ideal tree will have the maximum value of P(T=t|D=d). Furthermore, each tree will also classify data instances differently. Essentially, the probability of a new data instance belonging to any class P(Y|X=x, T=t, D=d) will differ across trees. Now, do you recognize that to quash my bizarre challenge, you’ll need the two discussed probabilities for each tree? Observe the equation below. What do you make of it?

The L.H.S is the final probabilistic estimate of x belonging to y1. It depends on the training dataset because, for a different dataset, the trees will modify. The R.H.S proposes that to include all trees in the decision process, we should take the probability of x belonging to y1 for a tree t, multiply it with the probability of the tree being the ideal candidate, and sum all the products. In other words, the final decision you take should be a weighted sum of classification probability from all trees. Thus, if a tree separates training data well, P(T=t|D=d) is high, it’ll have more say in the final decision.

Possible extensions of the equation

Advanced ensemble methods, which are amongst the most reliable predictive models, also function according to the above equation. They use a weighted sum of predictions from numerous small trees to classify data instances. Do note unlike my challenge, ensemble methods don’t gauge predictions from all possible trees. That would fritter away computational power. With procedures like Gini indexing or Entropy gains, they implicitly approximate P(T|D) and disregard the shoddy trees. Hence, Gini and Entropy are merely computationally efficient approaches to what is an otherwise Bayesian solution.

But we don’t necessarily have to confine the equation to trees. Instead of several trees, you may use different classification or regression models (ML algorithms) and compute the weighted average of their probabilistic prophecies to make a final decision. Here, you’ll solely have to replace the random variable T with another random variable M that encompasses an array of varying models (algorithms).

Summarizing

This article was an exceedingly brief outline of how the Bayes theorem forms the crux of ensemble methods. The aim was to understand decision trees from a Bayesian perspective and highlight how the Bayesian statistics is always stealthily bustling in the background of any ML algorithm. I have purposely not discussed how each term in the discussed equation is calculated. That’ll run long and also involve some other mathematical concepts like beta distributions. It is, however, significant to heartily appreciate decision trees. I hope to cover it sometime later.

--

--