Now that we’ve fully explored Bayes’ Theorem, let’s check out a classification algorithm that utilizes it – the naive Bayes classifier.
Classification, the process of quantitatively figuring out what class (a.k.a. group) a given observation should be assigned to, is an important one in Data Science. Some example applications of classification include figuring out whether a patient is particularly at risk of a disease, identifying customers who are likely to churn, or flagging emails as spam.
There are many classification algorithms out there, each with their own pros and cons. For example, if you just want predictive power, you would consider XGBoost or a Neural net. If you are more interested in understanding what factors and how those factors drive differentiation among your classes, you would consider using logistic regression.
Naive Bayes is an algorithm that has been around for a while (since the 1960s according to Wikipedia). While it might lack some of the hype and firepower of more recently developed algorithms, it’s a robust and versatile tool that has withstood the test of time (also its shortcomings are very well-understood which is often just as useful or even more so when building models) – so it’s definitely worth learning more about.
What Does It Do?
Let’s step back first and frame our classification problem in Bayesian terms – where we have a set of prior beliefs and update our beliefs as we observe and collect evidence.
In statistics, everything revolves around hypotheses. We make a hypothesis (an informed guess) about how the world works, and then we go about collecting evidence to test that hypothesis (if you would like to know the details, I wrote a post about hypothesis testing here).
Classification models can be framed as a hypothesis as well. Let’s first write out the objective and variables of our classification problem:
- The objective is to predict what class a given observation belongs to given its features. For example, given a description we might be trying to predict whether an animal in question is a cat, dog, or hamster.
- Y = the class label that we are trying to predict. In our case, the class labels would be cat, dog, and hamster.
- X1, X2, X3, etc. = the features of our observation with which we attempt to make a prediction. Some features we might use to differentiate between cats, dogs, and hamsters are size and agility (a meow, bark, silence category would make this too easy).
OK, so that’s classification – now let’s examine classification through a Bayesian lens. Most classification algorithms make predictions by estimating (for each class) the probability that the observation belongs to that class. Then the class with the highest estimated probability is our prediction:
P(Y = Cat) = 0.20
P(Y=Dog) = 0.60
P(Y = Hamster) = 0.20
Predict that the observed animal is a Dog!
But how do we calculate these class probabilities? Conditional probabilities to the rescue! The conditional probability P(H|E) (where H is our hypothesis and E is our evidence) is the probability that our hypothesis is true given the evidence. For the cat example, our hypothesis H is that the observation is a cat. And our evidence E is that it is medium sized and not agile. We can write the conditional probability that it is a cat given our evidence as (the equation is read "the probability that the observation is a cat given that it is of medium size and is not agile"):
P(Y=Cat | Size=Medium, Agile=No)
OK, that’s cool and all but how do we actually solve for it? That’s where Bayes’ Theorem comes in – using it we can write:

I used some of the terminology described in my previous post again here, but let me give a quick refresher:
- The prior is the unscaled probability that a randomly chosen observation is a cat. For our model, a reasonable way of estimating P(IsCat) is through calculating the percentage of our dataset that is made up of cats (provided that our sample is reasonably representative of the population).
- The job of the scaler is to scale up or down our prior based on the evidence. It tries to quantitatively answer the question – "how should I adjust my beliefs now that I have observed this evidence?" In our case, the scaler adjusts our prior to account for the fact that the animal in question is medium sized and not agile. It does so by figuring out what proportion of cats are medium sized and not agile. If it’s a lot then it helps to raise our belief that our observation is a cat. If only a small proportion are, then it helps lower our belief that we are dealing with a cat.
- The normalizer is used to adjust the numerator – this adjusts the calculated probability for the rarity of the evidence.
- The ratio between the scaler and the normalizer is pretty important to understand. When the scaler is greater than the normalizer, then we increase the prior – the evidence has increased our belief that it is a cat. In essence the ratio performs the following chain of analysis:
- What proportion of animals are medium sized and not agile?
- What proportion of cats are medium sized and not agile?
- If the proportion of medium sized and agile cats is greater than the proportion of medium sized and agile animals, then there is reason to believe that we are dealing with a cat. And we should revise our prior up accordingly to reflect this.
The issue with Bayes’ Theorem is that it forces us to calculate a lot of probabilities, including some pretty difficult to estimate ones. But in the next section, we will see how naive Bayes, by making some clever assumptions, helps us greatly simplify things.
Moving to Naive Bayes’ Simplified World
Let’s start with the normalizer. We don’t need it for classification. To see why, let me write out all three formulas:

Notice that the normalizer term (boxed in green) is the same across all three class probability equations. So basically it’s a constant that we can safely omit. Let’s see how we do that – in classification, we care more about relatives than absolutes. Recall that we make predictions by finding the most likely class (the class with the max likelihood). In order to do this, we make comparisons like this one:

And because the denominator, P(Medium|NotAgile), is the same on both sides of our inequality, we can simplify our equation to:

So (forgetting about hamsters for a minute) if the value on the left is bigger then we predict cat, otherwise we predict dog.
The Naive (But Clever) Key Assumption
Now comes the naive part of the algorithm, which I would argue is actually really clever and not naive at all. Naive Bayes makes a key simplifiying assumption that for a given class, all of our features (X variables such as size and agility) are independent of each other. In probability, the concept of independence means that the probability of event A occurring is the same whether or not B occurs – or if you are more familiar with statistics lingo like I am, we could say that A and B have zero correlation with each other. If A and B are independent, then their conditional probabilities simplify to:
P(A|B) = P(A) and P(B|A) = P(B)
Warning: Math Incoming
Let’s see how this assumption helps us out. But before we can do that, we need to introduce some helpful math first (apologies for all the equations).
A quick note on notation – P(A,B,C) means the probability of A and B and C all occurring at the same time (a.k.a. joint probability). P(A|B,C) is the probability of A occurring given that B and C have already occurred.
From Bayes’ Theorem and general probability we know that:
P(A,B) = P(A|B) P(B) = P(B|A) P(A)
So the numerator of Bayes’ theorem (the product of the scaler and prior) is a joint probability. And because we canceled out the normalizer, the numerator is all that we care about. Going back to our animals example, recall that we simplified the cat part of our likelihood calculation to:

The first probability, P(IsCat), is the prior and the second probability is the scaler – and as we just learned, the product of prior and scaler is a joint probability:

And from the chain rule of probability, we know that:
P(A,B,C) = P(A|B,C) P(B|C) P(C)
So we can rewrite our joint probability as:

Almost there! This is where naive Bayes’ simplifying assumption comes to save the day. Since we can assume that the features, size and agility, are independent (within a class), we know that:
P(Medium|NotAgile, IsCat) = P(Medium|IsCat)
And our equation ultimately simplifies to:

That’s pretty cool. This means that we can estimate the likelihood of an observation belonging to a particular class, C, by scaling the prior by as many scalers as there are features:
Likelihood that Y is Class C = P(X1|C) P(X2|C) … P(Xn|C) P(C)
In English Please
The plain English interpretation of all this is:
- We start with a prior, P(IsCat) – the probability that a randomly chosen observation will belong to the cat class. We can estimate this prior based on our training data, or even assume that all priors are equal if we believe our training data to be biased.
- For every feature/class combination of a given observation, we calculate a scaler (or lookup one as the scalers have usually already been calculated). This is used to adjust the prior based on the informational signal in the observation’s features. For example, let’s say our observation is tiny. Consulting our training data, we find that P(Tiny|Hamster) is really high – in other words, a large proportion of hamsters are tiny. Meanwhile P(Tiny|Dog) is really small – very few dogs are tiny. In this case, the scaler of the hamster class’ size feature would shift the model in favor of hamster relative to dog (which gets punished by its scaler).
- The likelihood of each class is just the product of that class’ priors and scalers. So continuing the earlier example, let’s say the observation is tiny and also clumsy. Our training data tells us that a high proportion of hamsters are tiny and a medium amount of hamsters are clumsy. Meanwhile very few dogs are tiny and a medium amount of dogs are clumsy. Finally, a medium amount of cats are tiny and almost no cats are clumsy.
Likelihood(Hamster) = P(Tiny|Hamster) * P(Clumsy|Hamster)
*Likelihood(Hamster) = High Medium = Somewhat Likely**
Likelihood(Dog) = P(Tiny|Dog) * P(Clumsy|Dog)
*Likelihood(Dog) = Low Medium = Somewhat Low**
Likelihood(Cat) = P(Tiny|Cat) * P(Clumsy|Cat)
*Likelihood(Cat) = Medium Almost Zero = Not Likely**
So in this case, naive Bayes would predict hamster because it had the highest likelihood. That’s naive Bayes in a nutshell – at a high level, naive Bayes is just applying a simplified version of Bayes’ Theorem to every observation based on its features (and for each potential class). It’s not rocket science, but in my opinion, it is powerful in its own simple way.

Conclusion
A topic for further exploration is whether (and how) the naive Bayes classifier’s assumption of feature independence hurts its performance relative to other algorithms. But the independence assumption is also one of its key advantages as it allows for quick training and predictions even on very large datasets. Also, naive Bayes has almost no hyperparameters to tune, so it usually generalizes well.
One thing to note is that due to the feature independence assumption, the class probabilities output by naive Bayes can be pretty inaccurate. So if your end application requires precise estimates of probabilities, you will want to go with another algorithm.
On the other hand, despite its naiveté, naive Bayes often does a reasonably good job of picking the right class – it may not be that good at estimating absolute probabilities, but it is pretty good at measuring relative likelihoods.
Thanks for reading and cheers!
More Data Science and Analytics Related Posts By Me: