The Naive Bayes classifier

Ritwick Roy
Towards Data Science
11 min readDec 29, 2021

--

The Naive Bayes algorithm is explained through simple examples.

Image by author

Contents:

Introduction

1. Bayes’ theorem

2. Naïve Bayes classifier

3. A simple binary classification problem

3.1 Prior probability computation

3.2 Class conditional probability computation

3.3 Predicting posterior probability

3.4 Treating Features with continuous data

3.5 Treating incomplete datasets

4. Naïve Bayes using Scikit Learn

4.1 Handling mixed features

5. Conclusion

6. References

Introduction:

Classification algorithms try to predict the class or the label of the categorical target variable. A categorical variable typically represents qualitative data that has discrete values, such as pass/fail or low/medium/high, etc. Out of the many classification algorithms, the Naïve Bayes classifier is one of the simplest classification algorithms. The Naïve Bayes classifier is often used with large text datasets among other applications. The aim of this article is to explain how the Naive Bayes algorithm works. The Naïve Bayes classifier is based on the Bayes’ theorem which is discussed next.

1.0 Bayes’ Theorem:

Assume that a customer survey on the purchase of ultra-high-definition TV was conducted. The results from the survey are presented below in the form of a contingency table:

Table 1: contingency table

For simplicity the categories in the contingency table are represented with letters as follows:

A: Planned to purchase

B: Actually purchased

A`: Didn’t plan to purchase

B`: Didn’t actually purchase

Based on the above notation the information in the contingency table can also be represented in the form of a Venn diagram.

Image by author

P(A) the probability of planned to purchase = 350/1100.

P(B) the probability of actually purchased = 450/1100

P(A∩B) the probability of planned to purchase and actually purchase = 300/1100

Note that while computing P(B) we didn’t consider if the customer planned to purchase or not. The prior knowledge that a customer planned to purchase, changes the customer’s probability of actually purchasing. This is also called the conditional probability of actually purchasing given that a customer planned to purchase. It is represented as P(B|A).

P(B|A) = 𝑃(𝐴∩𝐵)/𝑃(𝐴) = 300/350, i.e. out of the total instances of planned to purchase how many were actually purchased.

This equation can be rewritten as

𝑃(𝐴 ∩ 𝐵) = 𝑃(𝐵|𝐴)∗𝑃(𝐴)

We can also write the conditional probability of planned to purchase given that a customer actually purchased P(A|B) = 𝑃(𝐵∩𝐴)/ 𝑃(𝐵)= 300/450.

We can rewrite this as:

𝑃(𝐵 ∩ 𝐴) = P(A|B)*P(B)

Since P(A∩B) = P(B∩A) we can equate the right-hand sides of the above two equations to obtain:

𝑃(𝐵|𝐴)∗𝑃(𝐴)=𝑃(A|B)∗𝑃(B)

This equation can be rewritten to yield the Bayes’ theorem:

If A and B are independent events then 𝑃(𝐴 ∩ 𝐵) = 𝑃(𝐴) ∗ 𝑃(𝐵).

Therefore when A and B are independent events P(A|B) = P(A)

2.0 Naïve Bayes classifier:

Let the feature vectors in our dataset be represented by 𝑨 = (𝐴₁,𝐴₂,…, 𝐴ₙ). The target vector is categorical with values 𝐵ᵢ, i= 1,2,…k where k is the total number of classes/labels for the target vector.

Given a new feature vector 𝐴₁,𝐴₂,…, 𝐴ₙ, the Naïve Bayes’ classifier predicts the probability P(𝐵ᵢ| (𝐴₁,𝐴₂,…, 𝐴ₙ)), i=1,2,….k, i.e. the conditional probability of each class/label of the target vector for the given feature vector. This conditional probability is also called the posterior probability and can be written in a compact form as 𝑃(𝐵ᵢ| 𝑨), i=1,2,…k. The Naïve Bayes classifier then votes the class/label i with the highest posterior probability as the most likely outcome.

The posterior probability for the classes is computed using the Bayes’ theorem :

In the above equation, the denominator P(𝐴₁,𝐴₂,…, 𝐴ₙ) is the same for all classes 𝐵ᵢ, i= 1,2,…k. Therefore for the purpose of computing the posterior probability of the classes we can ignore it and simply compute the terms in the numerator P(A | 𝐵ᵢ) and P(𝐵ᵢ), i=1,2…k. The computation of each of these terms is explained next.

The term P(𝐵ᵢ) is also known as the prior probability of class i. Let m be the total number of data points in the dataset, and 𝑁Bⱼ be the number of instances of class 𝐵ⱼ in our data set, then for the jth class:

Thus to compute the prior probabilities we need to count the number of instances of each of the classes in our dataset and divide it by the total number of data points in our dataset.

To compute the term P(A | 𝐵ⱼ), the Naïve Bayes’ classifier assumes that the features are independent of each other, i.e. the occurrence of feature 𝐴ₛ is not influenced by the occurrence of feature 𝐴ᵣ. This assumption is not true for all situations, and therefore this classifier is named the Naïve Bayes’ classifier.

The assumption of feature independence yields:

P(𝐴ₛ|𝐵ᵢ) is also known as the class conditional probability. Computing and storing the class conditional probabilities is one of the critical tasks performed by any Naïve Bayes classifier.

Let us say that there are 3 categorical features 𝐴₁,𝐴₂, and A₃ in a data set. Feature 𝐴₁ has 2 classes/labels, feature 𝐴₂ has 3 classes/labels, and feature A₃ also has 2 classes/labels. The total feature label is 2+3+2=7. The target variable in the data set is a binary categorical variable, i.e. has only 2 classes. For such a dataset we need to compute and store 7*2=14 class conditional probabilities. These probabilities are looked up during the prediction phase. In the next section, a simple example is used to explain the computation of the prior and class conditional probabilities as well as the posterior probabilities.

3.0 A simple binary classification problem:

This made-up example dataset contains examples of the different conditions that are associated with accidents. The target variable accident is a binary categorical variable with yes/no values. There are 4 categorical features: weather condition, road condition, traffic condition, and engine problem.

The class /labels associated with each of the features are listed below:

Table 2: feature classes
Table 3: Example data

3.1 Prior probability computation:

There are 10 data points (m = 10). There are 5 instances of class/label ‘yes’ ( 𝑁Accidentᵧₑₛ = 5), and 5 instances of class/label ‘no’ ( 𝑁Accidentₙₒ = 5). The prior probabilities can be computed using the equation for prior probability in section 2.0:

𝑃(Accidentᵧₑₛ) = 5/10

𝑃(Accidentₙₒ) = 5/10

3.2 Class conditional probability computation:

The dataset is split based on the target labels (yes/no) first. Since there are 2 classes for the target variable we get 2 sub-tables. If the target variable had 3 classes we would get 3 sub-tables, one for each of the classes.

The following 2 tables show the dataset for target class/label ‘no’ and ‘yes’ respectively:

Table 4: sub-table for target label ‘no’
Table 5: sub-table for target label ‘yes’

The class conditional probability 𝑃(𝑨 |𝐵ₙₒ) and 𝑃(𝑨 |𝐵ᵧₑₛ) can be computed using table 4 and table 5 as shown below:

Table 6: Class conditional probabilities for Weather condition
Table 7: Class conditional probabilities for Road condition
Table 8: Class conditional probabilities for Traffic condition
Table 9: Class conditional probabilities for Engine problem

While the computation of the class conditional probabilities is simple, the conditional probabilities should be organized systematically because these probabilities need to be looked up during the prediction phase.

3.3 Predicting posterior probability:

Suppose we are now given a new feature vector:

Weather condition: rain

Road condition: good

Traffic condition: normal

Engine problem: no

The task is to predict if an accident will happen?

The posterior probability for each of the target classes is computed using the equation for posterior probability in section 2.0.

Note the calculation of the denominator is omitted as explained previously in section 2.0. Plugging in values for the prior probabilities and the class conditional probabilities from tables 6,7,8 and 9 above we get:

Since 𝑃(Accidentₙₒ|𝐴ₙₑ𝓌) > 𝑃( Accidentᵧₑₛ|𝐴ₙₑ𝓌) the prediction is Accident=‘no’.

The probabilities can be obtained by normalizing the posterior probabilities:

3.4 Treating Features with continuous data:

The dataset in this toy example has only categorical variables. What if the dataset had continuously varying features? For example, say we have a feature Temperature that recorded the temperature at the time of the accident. The computation of class conditional probability for such features can no longer be computed based on the counting method explained above. It is common to assume that the continuous feature variable is normally distributed. For a normal distribution, the probability that the random variable(x) is between x and x+dx is given by:

𝜇 and 𝜎 in the above equation are the mean and standard deviation for the given normal distribution.

The creation of sub-tables based on the target variable classes is similar to what has been explained above. After computing the sub-tables (Table 4 and Table 5), we compute and store the mean temperatures 𝜇ₜₑₘₚₑᵣₐₜᵤᵣₑ|Accidentₙₒ, 𝜇ₜₑₘₚₑᵣₐₜᵤᵣₑ|Accidentᵧₑₛ, and standard deviation 𝜎ₜₑₘₚₑᵣₐₜᵤᵣₑ|Accidentₙₒ and 𝜎ₜₑₘₚₑᵣₐₜᵤᵣₑ|Accidentᵧₑₛ for the Temperature feature variable from the 2 sub-tables.

During the prediction phase, given a new value for temperature, class conditional probability is computed using the analytical form of the normal distribution curve shown above:

3.5 Treating incomplete datasets:

In the above example dataset, sufficient data is present to compute all the class conditional probabilities. What if a particular feature label is absent for the target class in the training dataset. For example, say in our example 𝑃(𝑊eather conditionᵣₐᵢₙ|Accidentᵧₑₛ) = 0. This would result in the posterior probability 𝑃(Accidentᵧₑₛ| 𝐴ₙₑ𝓌) = 0 even when the other class conditional probabilities are non-zero. The Laplace correction is used to handle this situation. The general form of the class conditional probability with the Laplace correction is:

For the example discussed in section 3.0, n = 1 is the feature class count (i.e. rain) for Accident = yes. 𝑁Accidentᵧₑₛ= 5.

NFeature = 4 (there are 4 features in the dataset).

α is the Laplace correction factor.

The Laplace correction is applied to all class conditional probability computations. It can be seen from the above equation that if n = 0 for a particular feature class, the class conditional probability is non-zero.

The form of Laplace correction to the prior probability is

Again for the example discussed in section 3.0

m = 10 (the total number of data points).

k = 2 (number target variable classes).

To understand the influence of the correction factor α, a hypothetical case is considered where: n = 1, NFeature = 4, 𝑁Accidentᵧₑₛ= 60, m = 100 and k = 2.

Figure 1
Figure 2

Figures 1 and 2 show the variation of class conditional probability and prior probability respectively for the hypothetical case described above. From Figure 1, we can see that as α increases the class conditional probability tends towards a value of 1/4 = 0.25. Similarly from Figure 2, we can see that as α increases the prior probability tends towards a value of 1/2 = 0.5.

These limiting values can also be seen from the Laplace correction equations by letting α tend towards infinity:

Thus as the value of the correction factor increases the class conditional probabilities tend towards a uniform probability distribution, with each feature having the same class condition probability of 1/NFeature.

Similarly, the prior probability for each target class tends towards a uniform probability with each class having the same prior probability of 1/k. The value α is typically selected as 1 for most problems.

4.0 Naïve Bayes using Scikit Learn:

The naïve_bayes module in sklearn supports different version of Naïve Bayes classification such as Gaussian Naïve Bayes (discussed in section 3.4), MultiNomial Naïve Bayes for categorical features and other versions.

The iris dataset is used in this section to illustrate the usage of the Gaussian Naive Bayes classifier that is available in Scikit learn. The dataset can be found here: https://www.kaggle.com/uciml/iris?select=Iris.csv

The iris dataset is a tiny dataset consisting of 4 continuous feature vectors that describe different characteristics of the Iris flower family. There are 3target classes referring to the three Iris flower species. The aim is to correctly predict the flower species for a new set of feature vectors.

For this simple dataset, the Gaussian Naive Bayes classifier achieves an accuracy score of 0.96 in predicting the flower species.

4.1 Handling mixed features:

If a dataset has both continuous and categorical features. A simple approach with sklearn is to transform the continuous variable into a categorical variable using binning. For example, a feature such as a Temperature could be converted to a categorical variable by defining ranges of temperature for cool, mild, and hot temperature categories. With all features converted to categorical features, the MultiNomialNB algorithm in the naïve_bayes module in sklearn can be used to fit and predict.

5.0 Conclusion:

The Naïve Bayes classifier is a simple and versatile classifier. Since the computations are cheap, the Naive Bayes classifier works very efficiently for large datasets. Performance-wise the Naïve Bayes classifier has superior performance compared to many other classifiers. One of the main drawbacks of the Naïve Bayes classifier is the inherent assumption of independence between features. In practice features in real datasets are rarely independent. Despite this drawback, the Naïve Bayes classifier is very useful in the preliminary understanding of the data.

6.0 References:

  1. Basic Business Statistics Concepts and Applications, M.L. Berenson, D.M. Levine, T.C. Krehbiel.
  2. https://scikit-learn.org/stable/modules/naive_bayes.html

--

--

Senior Development Manager, Dassault Systemes, Simulia Corp. (Research and Development on Machine learning, engineering, and scientific software)