Ace your Machine Learning Interview — Part 3

Dive into Naive Bayes Classifier using Python

Published in

Towards Data Science

5 min readOct 27, 2022

This is the third article in this series I have called “ Ace your Machine Learning Interview” in which I go over the foundations of Machine Learning. If you missed the first two articles you can find them here :

Ace your Machine Learning Interview - Part 1: Dive into Linear, Lasso and Ridge Regression and their assumptions
Ace your Machine Learning Interview - Part 2: Dive into Logistic Regression for classification problems using Python

Introduction

Naive Bayes is a Machine Learning algorithm used to solve classification problems, and it is so-called because it is based on Bayes’ theorem.

An algorithm referred to as a classifier, assigns a class to each instance of data. For example, classifying whether an email is spam or non-spam.

Bayes Theorem

Bayes’ Theorem is used to calculate the probability of a cause resulting in the verified event. The formula we have all studied in probability courses is the following.

So this theorem answers the question: ‘What is the probability that event A will occur given that event B has occurred?’ And the interesting thing is that this formula turns the question around. That is, we can calculate this probability by going to see how many times B actually occurred each time event A had occurred. That is, we can answer the original question by going to see the past (the data).

Naive Bayes Classifier

But how then do we apply this theorem to create a Machine Learning classifier? Suppose we have a dataset consisting of n features and a target.

Therefore, our question now is ‘What is the probability of having a certain label y given that those features occurred?’

For example if y = spam/not-spam, x1 = len(email), x2 = number_of_attachments we might ask :

‘What is the probability that y is spam given that x1 = 100 chars and x2 = 2 attachments?’

To answer this question we need only apply Bayes’ theorem trivially, where A = {x1,x2,…,xn} and B = {y}.

But the classifier is not called Bayes Classifier but Naive Bayes Classifier. This is because a naive assumption is made to simplify the calculations, that is, the features are assumed to be independent of each other. This allows us to simplify the formula.

In this way, we can calculate the probability that y = spam. Next, we will calculate the probability that y = not_spam and see which one is more likely. But if you think about it, between the two labels, the one having higher probability will be the one with the larger numerator since the denominator is always the same : P(x1) * P(x2)*…

Then we can also eliminate for simplicity the denominator since for the purpose of comparison we do not care about it.

Now we are going to choose the class that maximizes this probability, we only need to use argmax.

Argmax for classification (Image By Author)

Naive Bayes Classifier for Text Data

This algorithm is often used in the field of NLP for textual data. This is because we can treat individual words that appear in the text as features, and the naive assumption is that therefore these words are independent (which of course is not actually true).

Suppose we have a dataset in which on each row we have a single sentence, and each column tells us whether or not that word appears in the sentence. We have eliminated unnecessary words such as articles, etc.

Now we can calculate the probability that a new sentence is good or bad in the following way.

Let’s code!

Implementing the Naive Bayes algorithm in sklearn is very simple just a few lines of code. We will use the well-known Iris dataset that consists of the following features.

Iris Dataset (Image By Author)

Naive Bayes in sklearn (Image By Author)

Advantages

From the point of view of benefits, the Naive Bayes algorithm has its simplicity of use. Although it is a basic and dated algorithm, it still solves some classification problems excellently with fair efficiency. However, its application is limited to a few specific cases. Summarizing :

Works well with many features
Works well with large training Datasets
It converges fast when training
It also performs well on categorical features
Robust to outliers

Disadvantages

From the point of view of drawbacks, the following should be specially mentioned. The algorithm requires knowledge of all the data in the problem. Especially the simple and conditional probabilities. This is often difficult and expensive information to obtain. The algorithm provides a “naive” approximation of the problem because it does not consider the correlation between the characteristics of the instance.

If a probability is zero because it was never observed in the data you have to apply Laplace smoothing.

Handle Missing Values

You can simply skip missing values. Let’s suppose we throw a coin 3 times, but we forgot what was the result the second time. We can try to sum up all the possibilities for that 2nd throw.

Final Thoughts

Naive Bayes is one of the main algorithms to know when approaching Machine Learning. It has been used heavily, especially in problems with text data, such as Spam email recognition. As we have seen it still has its advantages and disadvantages, but certainly when you are asked about basic Machine Learning expect a question about it!

The End

Marcello Politi

Linkedin, Twitter, CV