Naive Bayes in Machine Learning

Published in

Towards Data Science

6 min readNov 6, 2017

Bayes’ theorem finds many uses in the probability theory and statistics. There’s a micro chance that you have never heard about this theorem in your life. Turns out that this theorem has found its way into the world of machine learning, to form one of the highly decorated algorithms. In this article, we will learn all about the Naive Bayes Algorithm, along with its variations for different purposes in machine learning.

As you might have guessed, this requires us to view things from a probabilistic point of view. Just as in machine learning, we have attributes, response variables and predictions or classifications. Using this algorithm, we will be dealing with the probability distributions of the variables in the dataset and predicting the probability of the response variable belonging to a particular value, given the the attributes of a new instance. Lets start by reviewing the Bayes’ theorem.

Bayes’ Theorem

This lets us examine the probability of an event based on the prior knowledge of any event that related to the former event. So for example, the probability that price of a house is high, can be better assessed if we know the facilities around it, compared to the assessment made without the knowledge of location of the house. Bayes’ theorem does exactly that.

Above equation gives the basic representation of the Bayes’ theorem. Here A and B are two events and,

P(A|B) : the conditional probability that event A occurs , given that B has occurred. This is also known as the posterior probability.

P(A) and P(B) : probability of A and B without regard of each other.

P(B|A) : the conditional probability that event B occurs , given that A has occurred.

Now, let’s see how this suits well to the purpose of machine learning.

Take a simple machine learning problem, where we need to learn our model from a given set of attributes(in training examples) and then form a hypothesis or a relation to a response variable. Then we use this relation to predict a response, given attributes of a new instance. Using the Bayes’ theorem, its possible to build a learner that predicts the probability of the response variable belonging to some class, given a new set of attributes.

Consider the previous equation again. Now, assume that A is the response variable and B is the input attribute. So according to the equation, we have

P(A|B) : conditional probability of response variable belonging to a particular value, given the input attributes. This is also known as the posterior probability.

P(A) : The prior probability of the response variable.

P(B) : The probability of training data or the evidence.

P(B|A) : This is known as the likelihood of the training data.

Therefore, the above equation can be rewritten as

Image taken from Wikipedia

Let’s take a problem, where the number of attributes is equal to n and the response is a boolean value, i.e. it can be in one of the two classes. Also, the attributes are categorical(2 categories for our case). Now, to train the classifier, we will need to calculate P(B|A), for all the values in the instance and response space. This means, we will need to calculate 2*(2^n -1), parameters for learning this model. This is clearly unrealistic in most practical learning domains. For example, if there are 30 boolean attributes, then we will need to estimate more than 3 billion parameters.

Naive Bayes Algorithm

The complexity of the above Bayesian classifier needs to be reduced, for it to be practical. The naive Bayes algorithm does that by making an assumption of conditional independence over the training dataset. This drastically reduces the complexity of above mentioned problem to just 2n.

The assumption of conditional independence states that, given random variables X, Y and Z, we say X is conditionally independent of Y given Z, if and only if the probability distribution governing X is independent of the value of Y given Z.

In other words, X and Y are conditionally independent given Z if and only if, given knowledge that Z occurs, knowledge of whether X occurs provides no information on the likelihood of Y occurring, and knowledge of whether Y occurs provides no information on the likelihood of X occurring.

This assumption makes the Bayes algorithm, naive.

Given, n different attribute values, the likelihood now can be written as

Here, X represents the attributes or features, and Y is the response variable. Now, P(X|Y) becomes equal to the products of, probability distribution of each attribute X given Y.

Maximizing a Posteriori

What we are interested in, is finding the posterior probability or P(Y|X). Now, for multiple values of Y, we will need to calculate this expression for each of them.

Given a new instance Xnew, we need to calculate the probability that Y will take on any given value, given the observed attribute values of Xnew and given the distributions P(Y) and P(X|Y) estimated from the training data.

So, how will we predict the class of the response variable, based on the different values we attain for P(Y|X). We simply take the most probable or maximum of these values. Therefore, this procedure is also known as maximizing a posteriori.

Maximizing Likelihood

If we assume that the response variable is uniformly distributed, that is it is equally likely to get any response, then we can further simplify the algorithm. With this assumption the priori or P(Y) becomes a constant value, which is 1/categories of the response.

As, the priori and evidence are now independent of the response variable, these can be removed from the equation. Therefore, the maximizing the posteriori is reduced to maximizing the likelihood problem.

Feature Distribution

As seen above, we need to estimate the distribution of the response variable from training set or assume uniform distribution. Similarly, to estimate the parameters for a feature’s distribution, one must assume a distribution or generate nonparametric models for the features from the training set. Such assumptions are known as event models. The variations in these assumptions generates different algorithms for different purposes. For continuous distributions, the Gaussian naive Bayes is the algorithm of choice. For discrete features, multinomial and Bernoulli distributions as popular. Detailed discussion of these variations are out of the scope of this article.

Naive Bayes classifiers work really well in complex situations, despite the simplified assumptions and naivety. The advantage of these classifiers is that they require small number of training data for estimating the parameters necessary for classification. This is the algorithm of choice for text categorization. This is the basic idea behind naive Bayes classifiers, that you need to start experimenting with the algorithm.

If you liked this article, be sure to show your support by clapping for this article below and if you have any questions, leave a comment and I will do my best to answer.

For being more aware of the world of machine learning, follow me. It’s the best way to find out when I write more articles like this.

You can also follow me on Twitter, email me directly or find me on linkedin. I’d love to hear from you.

That’s all folks, Have a nice day :)