The world’s leading publication for data science, AI, and ML professionals.

Naive Bayes clearly explained

Solving the iris dataset with a gaussian approach in scikit-learn.

In this post, we’ll delve into a particular kind of classifier called naive Bayes classifiers. These are methods that rely on Bayes’ theorem and the naive assumption that every pair of features is conditionally independent given a class label. If this doesn’t make sense to you, keep reading!

As a toy example, we’ll use the well-known iris dataset (CC BY 4.0 license) and a specific kind of naive Bayes classifier called Gaussian Naive Bayes classifier. Remember that the iris dataset is composed of 4 numerical features and the target can be any of 3 types of iris flower (setosa, versicolor, virginica).

We’ll decompose the method into the following steps:

All images by author.
All images by author.
  1. Reviewing the Bayes theorem: this theorem provides the mathematical formula that allows us to estimate the probability that a given sample belongs to any class.
  2. We can create a classifier, a tool that returns a predicted class for an input sample, by comparing the probability that this sample belongs to a class, for all classes.
  3. Using the chain rule and the conditional independence hypothesis, we can simplify the probability formula.
  4. Then to be able to compute the probabilities, we use another assumption: that the feature distributions are Gaussian.
  5. Using a training set, we can estimate the parameters of those Gaussian distributions.
  6. Finally, we have all the tools we need to predict a class for a new sample.

I have plenty of new posts like this one incoming; remember to subscribe!


Step 1: Bayes’ theorem

Bayes’ theorem is a probability theorem that states the following:

  • P(A|B) is the conditional probability that A is true (or A happens) given (or knowing) that B is true (or B happened) – also called the posterior probability of A given B (posterior: we updated the probability that A is true after we know B is true).
  • P(B|A) is the conditional probability that B is true knowing A is true.
  • P(A) and P(B) are the probabilities of A and B being true without knowing anything else – also called the prior probability and marginal probability.

We also write this as:

Let’s just write the same equation using different letters:

with:

  • Y = (y=y_i), the event/fact that the target y is of class y_i
  • X = (x=x_1, x_2, …, x_n) represents the event that a vector sample is equal to the considered sample (x_1, x_2, …, x_n)

Hence, we can compute the probability a sample X belong to class y_i using:

if we know each of p(y=y_i), p(x|y=y_i), and p(x).

Step 2: Using Bayes theorem as a way to predict class

As we’ll see in greater detail, the way naive classifiers work is simply by computing such probability p(y=y_i | x) for all known classes y_i and returning the class with the highest probability.

Say we are working with the iris dataset, where a flower can belong to one of 3 classes y_1=setosa, y_2=versicolor, y_3=virginica, we must compute and compare for a sample x=(x=x_1, x_2, …, x_3):

Step 3: Simplify the probabilites formula

In this step, we’ll also remove the p(x) term and compute the class priors. Finally, we’ll simplify the left terms with the conditional independence hypothesis.

Getting ride of p(x)

First, we can notice that the denominator is the same for all classes: p(x). Since this is the same for all, and we are only interested in comparing the conditional probability for each class, we can discard this term.

Another way to put this is in the context of a 2-class problem: to compare the conditional probabilities for a given sample x, we can compute the ratio between the 2 possible classes:

then the prediction rule is simply:

Since we can discard the denominator, our iris classification problem becomes to compute and compare:

Computing the classe’s priors p(y_i)

The simplest way the prior p(y=y_i) for each class can be calculated using just the number of classes and assume equiprobable classes: for 3 classes, p(y_1)=p(y_2)=p(y_3)=1/3.

But this approach doesn’t bring any value to our problem since this term would disappear (or cancel out) when comparing the conditional probabilities.

Instead, we can use the data in the training set to estimate the probabilities using

For our the iris problem:

Conditional probability and naiveness to the rescue

We finally need to compute the likelihood terms p(x|y=y_i ), for all y_i.

Here x is a multivariate variable, and we can use the chain rule (with no particular assumption) recursively to write:

Then, using the conditional independence hypothesis, each of the terms in the product is simplified to:

So remember that using only the conditional independence assumption, we get:

Step 4: Computing the likelihoods with gaussian assumption

So, in order to have all the terms we want to compare the probabilities that a given sample x belongs to any of the 3 classes, we need to compute all p(x_j|y_i), so we can then compute the product:

Hence the question is: how do we compute, for all features j, for all classes i:

These terms can be translated into plain English as: what is the probability a sample is equal to x_i, given that that sample is of class y_i.

For example, p(sepal length=1.2|setosa) is the probability that a sepal length of a setosa sample is equal to 1.2.

In order to compute such probability numerically, we are going to rely on an additional hypothesis: we’ll assume that the distributions of each class are given by a (multivariate) Gaussian distribution.

For example, in our iris classification problem with 4 features, we’ll assume that each feature for each class follows a Gaussian distribution. That is, we have 3 Gaussian distributions just for the sepal length feature.

And similarly for all 3 other features.

Step 5: Estimate gaussian parameters

Roughly speaking, the process is two-fold, using the training set to "learn" the distributions:

  1. Split the training set by class into 3 training subsets (the setosa training samples, the versicolor training samples, and the virginica training samples).
  2. Estimate the Gaussian distribution parameters: for each of those subsets, and for each feature, compute the mean and standard deviation. This is done to estimate the underlying true mean and standard deviation of the assumed Gaussian distribution of each feature, for each class.

Step 6: Predict the class for a new sample

We finally have everything we need to predict the class for a new sample x. Indeed, we can compute the conditional probabilities using the sample’s values (x_1, x_2, …) and the assumed probability density function. In other words, in our context of Gaussian Naive Bayes, we compute:

And since we already know the prior probabilities, we can compute, compare, and select the highest probability: we just made a prediction using a Gaussian Naive Bayes classifier.

Get an email whenever I publish !

Let’s python!

Let’s use Python to try to visualize the process seen above.

First let’s just load the dataset, split it into a training and test set, and plot the training data with seaborn:

%matplotlib qt
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

X, y = load_iris(return_X_y=True, as_frame=True)
target_map = {0:"setosa",1:"versicolor",2:"virginica"}
y = y.map(target_map).astype('category')

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=3, stratify=y)

df_train = pd.concat([X_train, y_train], axis=1)
g = sns.pairplot(df_train, hue='target')

We can then visualize the Gaussian-fitted distribution for each class, for each feature. This way we can check that the distributions actually look like Gaussian distributions, and estimate their mean/variances. For now we do this ourselves, but with scikit-learn’s Gaussian Naive Bayes model, the mean and std are computed under the hood and stored as coefficients in the model.

df_melt = df_train.melt(
    id_vars=["target"], value_vars=['petal length (cm)', 'petal width (cm)', 'sepal length (cm)', 'sepal width (cm)'],
    value_name='value', var_name='feature')

g = sns.FacetGrid(df_melt, col="feature", hue="target", sharex=False, sharey=False)
g.map(sns.kdeplot, 'value', linestyle="--", legend=True)

from scipy.stats import norm

def fit_gaussian(data, color, **kwargs):
    mu, std = norm.fit(data)
    xmin, xmax = data.min(), data.max()
    x = np.linspace(xmin, xmax, 100)
    p = norm.pdf(x, mu, std)
    plt.plot(x, p, 'k', linewidth=2, color=color)
    x_max = x[np.argmax(p)]
    y_max = norm.pdf(x_max, mu, std)
    plt.text(x_max, y_max, f"({mu:.1f};{std**2:.1f})", color=color, size=8, ha='center', va='bottom')

g.map(fit_gaussian, 'value')

for ax in g.axes.flatten():
    ax.set_xlabel('')
    ax.set_ylabel('')
g.add_legend()

Let’s now use scikit-learn’s gaussian naive Bayes model to fit and predict. First we simply import and fit the model. We can then review:

  • the classes’ prior in .class_prior_
  • the variance for each (feature/class) in .var_
  • the mean for each (feature/class) in .theta_

Remember that in scikit-learn, learned parameters are stored as attributes that ends with _.

from sklearn.naive_bayes import GaussianNB

gnb = GaussianNB()
gnb.fit(X_train, y_train)
print(gnb.class_prior_)
print(gnb.var_)
print(gnb.theta_)
[0.33333333 0.33333333 0.33333333]
[[0.12406498 0.14367347 0.03012912 0.01106206]
 [0.26634736 0.09020408 0.21961683 0.03754269]
 [0.37859226 0.1033736  0.27714286 0.07541858]]
[[5.00408163 3.42857143 1.46122449 0.24693878]
 [5.93469388 2.75714286 4.25510204 1.32040816]
 [6.56530612 2.97755102 5.52857143 2.02653061]]

You can compare those values with the ones displayed on the fitted distributions above.

For completeness, let’s just estimate the power of the Gaussian Naive Bayes model in terms of accuracy using cross-validation. Note that the dataset doesn’t need much preprocessing, since the features are already numerical, do not contain missing values nor outliers, and look like Gaussians.

from sklearn.naive_bayes import GaussianNB
from sklearn.datasets import load_iris
from sklearn.model_selection import cross_validate, StratifiedKFold

X, y = load_iris(return_X_y=True, as_frame=True)

cv_results = cross_validate(
    GaussianNB(),
    X,
    y,
    scoring="balanced_accuracy",
    cv=StratifiedKFold(n_splits=10, shuffle=True),
)

print(cv_results['test_score'])
[0.73333333 1.         0.93333333 0.93333333 1.         1.
 1.         0.93333333 1.         1.        ]

Those are pretty good performances for such a simple pipeline! Note that most Naive Bayes models do not expose many hyperparameters (if any, like in this case). For other datasets, remember to add a preprocessing step if necessary.

Wrapup

Please give this post a clap if you liked what you read! =)

Here’s what you should remember from this post:

  • Naive Bayes classifiers use Bayes’ theorem to lay the mathematical groundwork, and the naive assumption simplifies it so we can compute the probabilities numerically.
  • For any new sample, the model computes the probability that it belongs to all possible classes and returns the class with the highest probability (this is known as the Maximum A Posteriori decision rule).
  • A particular flavor of those models is the Gaussian Naive Bayes model, where we assume that each feature/class distribution is Gaussian. This allows us to easily learn and use those distributions for later prediction.

Bonus

In this post, we used a Gaussian version of Naive Bayes classifier, but others exist, like multinomial, binomial, or categorical Naive Bayes classifiers. They mostly differ in how the features are formatted (continuous numerical, discrete numerical). Head to scikit-learn’s documentation for more: https://scikit-learn.org/stable/modules/naive_bayes.html.


Check out my other posts if you liked this one:

Sklearn tutorial

Scientific/numerical python

Fourier-transforms for time-series

Data science and Machine Learning


Related Articles