![Classifying Penguins [Photo by Martin Wettstein on Unsplash]](https://towardsdatascience.com/wp-content/uploads/2021/12/10FCFeF4X_xZQkbP0w3IAng-scaled.jpeg)
Both Naive Bayes and Logistic Regression are quite commonly used classifiers and in this post, we will try to find and understand the connection between these classifiers. We will also go through an example using Palmer Penguin Dataset which is available under CC-0 license. What you can expect to learn from this post?
- Basics of Probability and Likelihood.
- How Naive Bayes Classifier Works?
- Discriminative and Generative Classifier.
- Connection Between Naive Bayes and Logistic Regression.
All the formulas used here are from my notebook and the link is given under the References section along with other useful references. Let’s begin!
Probability & Likelihood:
Each probability distribution has a probability density function (PDF) for continuous distribution e.g. Gaussian distribution (or a probability mass function or PMF for discrete distribution, e.g. binomial distribution) which indicates the probability of a sample (some point) taking a particular value. This function is usually denoted by P(y|θ) where y is the value of the sample and θ is the parameter that describes the PDF/PMF. When more than one sample is drawn independently of one another then we can write as –

We consider the PDF’s in calculations when we know the distribution (and corresponding parameter θ) and want to deduce y‘s. Here we think of θ as fixed (known) for different samples and we want to deduce different y‘s. The likelihood function is the same but with a twist. Here y‘s are known and θ is the variable that we want to determine. At this point, we can briefly introduce the idea of Maximum Likelihood Estimation (MLE).
Maximum Likelihood Estimation: MLE is closely related with the modelling of the data; If we observe data points _y_1, _y_2,….,yn and assume that they are from a distribution parametrized by θ, then likelihood is given by L(_y_1, _y_2,….,yn|θ); for MLE then our task is to estimate the θ for which likelihood is maximum. For independent observations, we can write the expression for best θ as below –

Convention for optimization is minimizing a function; thus maximizing the likelihood boils down to minimizing the negative log-likelihood. These are the concepts we will be using later, now let’s move to the Naive Bayes Classifier.
Naive Bayes Classifier:
Naive Bayes Classifier is an example of a generative classifier while Logistic Regression is an example of a discriminative classifier. But what do we mean by generative and discriminative?
Discriminative Classifier: In general a Classification problem can simply be thought of as predicting a class label for a given input vector p(_Ck | x). In the discriminative model, we assume some functional form for p(_Ck | x) and estimate parameters directly from training data.
Generative Classifier: In the generative model we estimate the parameters of p(x | _Ck) i.e. probability distribution of the inputs for each class along with the class priors p(_Ck). Both of them are used in Bayes’ theorem to calculate p(_Ck | x).
We have defined the class labels as _Ck, let’s define the Bayes’ theorem based on that –

p(x) can be thought of as a normalizing constant; p(x)=∑ p(x|_Ck) p(_Ck), the ∑ is over the class labels k.
Let’s consider a generalized scenario where our data have d features and K classes then the equation above can be written as –

Class conditional probability term can be greatly simplified by assuming ‘naively’ that each of the data features _Xi‘s are conditionally independent of each other given the class _Y._ Now we rewrite the equation (2.2), as follows –

This is the fundamental equation for the Naive Bayes Classifier. Once the class prior distribution and class-conditional densities are estimated, the Naive Bayes classifier model can then make a class prediction Y^ (y hat) for a new data input X~

Since the denominator p(_X__1, _X__2,…, _Xd) in our case is constant for a given input. We can use Maximum A Posteriori (MAP) estimation to estimate p(Y=_ck) and p(_Xi | Y=_ck). The former is then the relative frequency of class in the training set. These will be more clear when we will do the coding part later.
The different naive Bayes classifiers differ mainly by the assumptions they make regarding the distribution of p(_Xi | Y=_ck). This is very important for the MAP estimation, for example, if we assume class conditionals are univariate Gaussians the parameters that need to be determined are mean and standard deviations. The number of parameters to be estimated depends on features and classes.
Connecting Naive Bayes and Logistic Regression:
Instead of the generalized case above for Naive Bayes classifier with K classes, we simply consider 2 classes i.e. Y is now boolean (0/1, True/False). Since Logistic Regression (LogReg) is a discriminative algorithm, it starts by assuming a functional form for P(Y|X).

As expected the functional form of the conditional class distribution for Logistic Regression is basically the sigmoid function. In a separate post, I discussed in detail how you can reach logistic regression starting from linear regression. The parameters (weights w) can be derived using the training data. We can expand Eq. 7, for Y (class label) as boolean –

To assign Y=0 for a given X, we then impose a simple condition as below –

We will use Gaussian Naive Bayes (GNB) Classifier and recover the form of P(Y|X) and compare it with the Logistic Regression results.
Gaussian Naive Bayes as Binary Classifier:
In the Gaussian Naive Bayes (GNB) classifier, we will assume that class conditional distributions p(_Xi | Y=_ck) are univariate Gaussians. Let’s write the assumptions explicitly –
- Y has a Boolean form (i.e 0/1, True/False) and it’s governed by a Bernoulli distribution.
- Since it’s a GNB, for class conditionals P(_Xi | Y=_ck) we assume univariate Gaussians.
- For all i and j≠i, _Xi, _Xj are conditionally independent given Y.
Let’s write P(Y=1|X) using Bayes’ Rule –

We can further simplify this equation by introducing Exponential and Natural Logarithm as below –

Let’s assume P(Y=1|X)=π, ⟹P(Y=0|X)=1−π and also use the conditional independence of data points for a given class label __ to rewrite the equation above as below –

For class conditionals P(_Xi | Y=_ck) we assume univariate Gaussians with parameters N(μ{ik}_, _σi), i.e., the standard deviations are independent of class (Y). We will use this to simplify the denominator of the equation above.

We can use this expression back to the equation before (Eq. 11) to write the posterior distribution for Y=1 in a more compact form as below –

Writing P(Y=1|X) in this form gives us the possibility to directly compare with Eq. 8, i.e. the functional form of the conditional class distribution for Logistic Regression and we get the parameters (weights and bias) in terms of Gaussian mean and standard deviation as below—

Here we have arrived at the generative formulation of Logistic Regression starting from Gaussian Naive Bayes distribution and, we can also find the relation between the parameters of these two distributions for a given binary classification problem.
Implementing Gaussian Naive Bayes; Step by Step:
Best way to understand the steps above is to implement them and for this purpose, I will be using the Palmer Penguin Dataset, which is very similar to the Iris Dataset. Let’s start by importing the necessary libraries and loading the dataset.

It’s a dataset consisting of 3 different species of Penguins ‘Adelie’, ‘Gentoo’ and ‘Chinstrap’ and some of the features such as bill length, flipper length etc. are given. We can see the penguin class distribution based on two parameters ‘bill length’ and ‘bill depth’ using a simple one-line code as below –
penguin_nonan_df = penguin.dropna(how='any', axis=0, inplace=False)
sns_fgrid=sns.FacetGrid(penguin_nonan_df, hue="species", height=6).map(plt.scatter, "bill_length_mm", "bill_depth_mm").add_legend()
plt.xlabel('Bill Length (mm)', fontsize=12)
plt.ylabel('Bill Depth (mm)', fontsize=12)

Instead of working with all the available features, we will select only these two features to make things simple.
penguin_nonan_selected_df = penguin_nonan_df[['species', 'bill_length_mm', 'bill_depth_mm']]
X=penguin_nonan_selected_df.drop(['species'], axis=1)
Y=penguin_nonan_selected_df['species']
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state=30, stratify=Y)
print (X_train.shape, y_train.shape, X_test.shape)
>>> (266, 2) (266,) (67, 2)
X_train = X_train.to_numpy()
X_test = X_test.to_numpy()
y_train = y_train.to_numpy()
y_test = y_test.to_numpy()
To build a Naive Bayes Classifier from scratch, first, we need to define a prior distribution for the classes and in this case, we will count the number of unique classes and divide by the total number of samples to get the class prior distribution. We can define it as below –
Let’s print the prior distribution based on the training samples –
prior = get_prior(y_train)
print (prior)
print ('check prior probs: ', prior.probs,)
>>> [0.43984962 0.20300752 0.35714286]
tfp.distributions.Categorical("Categorical", batch_shape=[], event_shape=[], dtype=int32)
>>> check prior probs: tf.Tensor([0.43984962 0.20300752 0.35714286], shape=(3,), dtype=float64)
After defining the prior distribution, we will define the class conditional distribution and as I’ve described before, for Gaussian NB, we will assume univariate Gaussian distributions as shown in the equation below. To remember, for our problem, we have 2 features and 3 classes.

To define the class conditionals we need to know the mean and variances and for Normal distributions, they can be easily calculated, check here.

Let’s put them all together and define the class conditional function.
Check the code block again to understand and comprehend that it does give us back Eq. 15 using mean and variance from Eq. 16. We can also print to check the mean and variance for the training data –
class_conditionals, mu, sigma2 = class_conditionals_MLE(X_train, y_train)
print (class_conditionals)
print ('check mu and variance: ', 'n')
print ('mu: ', mu, )
print ('sigma2: ', sigma2, )
# batch shape : 3 classes, event shape : 2 features
>>> tfp.distributions.MultivariateNormalDiag("MultivariateNormalDiag", batch_shape=[3], event_shape=[2], dtype=float64)
>>> check mu and variance:
>>> mu: [ListWrapper([39.017094017093996, 18.389743589743592]), ListWrapper([49.12592592592592, 18.479629629629624]), ListWrapper([48.063157894736854, 15.058947368421052])]
>>> sigma2: [[7.205861640733441, 1.5171597633136102], [10.038216735253773, 1.1042146776406032], [9.476642659279785, 0.9687357340720222]]
Finally, we make the predictions on the test set. So we select sample by sample and use the info from prior probabilities and class conditionals. Here we are building a function to represent Eq. 5 and Eq. 6 that we wrote at the very beginning of the post.
def predict_class(prior_dist, class_cond_dist, x):
"""
We will use prior distribution (P(Y|C)), class-conditional distribution(P(X|Y)),
and test data-set with shape (batch_shape, 2).
"""
y = np.zeros((x.shape[0]), dtype=int)
for i, train_point in enumerate(x):
likelihood = tf.cast(class_cond_dist.prob(train_point), dtype=tf.float32) #class_cond_dist.prob has dtype float64
prior_prob = tf.cast(prior_dist.probs, dtype=tf.float32)
numerator = likelihood * prior_prob
denominator = tf.reduce_sum(numerator)
P = tf.math.divide(numerator, denominator) # till eq. 5
#print ('check posterior shape: ', P.shape)
Y = tf.argmax(P) # exact similar to np.argmax [get the class]
# back to eq. 6
y[i] = int(Y)
return y
We can predict the classes of the test data-points and compare them with the original labels –
predictions = predict_class(prior, class_conditionals, X_test)
Plotting a decision boundary (using contour plot) for the test data points will result in the figure below

All of the above tasks can be done using Sklearn and with only a few lines of code –
from sklearn.naive_bayes import GaussianNB
sklearn_GNB = GaussianNB()
sklearn_GNB.fit(X_train, y_train)
predictions_sklearn = sklearn_GNB.predict(X_test)
It’s always good to know the working behind the scenes for these basic algorithms and we can compare the fit parameters which are very close to the parameters we’ve obtained before—
print ('variance:', sklearn_GNB.var_)
print ('mean: ', sklearn_GNB.theta_)
>>> variance: [[ 7.20586167 1.51715979]
[10.03821677 1.10421471]
[ 9.47664269 0.96873576]]
mean: [[39.01709402 18.38974359]
[49.12592593 18.47962963]
[48.06315789 15.05894737]]
Let’s plot the decision regions for a comparison with our hard-coded GNB classifier—

They look very much comparable as expected. Let’s move to explicitly connecting GNB and Logistic Regression classifier for a binary classification problem.
Logistic Regression Parameters from GNB:
As discussed before, to connect Naive Bayes and logistic regression, we will think of binary classification. Since there’re 3 classes in the Penguin dataset, first, we transform the problem as one vs rest classifier and then determine the logistic regression parameters. Also on the derivation of GNB as a binary classifier, we will use the _same σi for two different classes (no k dependence). So here we will be finding 6 parameters; 4 for μ{ik}_ & 2 for _σi. This rules out the possibility of using the class conditional function that we have used before because μ{ik}_ is used in the standard deviation formula and the means depend on the classes as well as features. So we have to write a function to use gradient descent type optimization to learn the class independent standard deviations. This is the main coding task.
Before all of these, we need to binarize the labels and to do that we will use label 1 for samples with label 2.
class_types_dict = {'Adelie':0, 'Chinstrap':1, 'Gentoo':2}
y_train_cat = np.vectorize(class_types_dict.get)(y_train)
y_train_cat_binary = y_train_cat
y_train_cat_binary[np.where(y_train_cat_binary == 2)] = 1
# gentoo == chinstrap
y_test_cat = np.vectorize(class_types_dict.get)(y_test)
y_test_cat_binary = np.array(y_test_cat)
y_test_cat_binary[np.where(y_test_cat_binary == 2)] = 1
print ('check shapes: ', y_train_cat_binary.shape, y_test_cat_binary.shape, y_train_cat_binary.dtype, 'n', y_train_cat.shape)
>>> check shapes: (266,) (67,) int64
>>> (266,)
We can plot the data distribution and it looks as below –

We can use the prior distribution function to get the class label priors for these 2 classes
prior_binary = get_prior(y_train_cat_binary)
print (prior_binary.probs, print (type(prior_binary)))
>>> [0.43984962 0.56015038]
<class 'tensorflow_probability.python.distributions.categorical.Categorical'>
tf.Tensor([0.43984962 0.56015038], shape=(2,), dtype=float64) None
Compared with the previous prior distribution, we will see that the Adelie penguin class has the same prior (0.43) and expectedly Gentoo and Chinstrap prior distributions are added (≈ 0.20 + 0.35).
For learning the standard deviations we will minimize the negative log-likelihood using gradient descent given the data and labels.
Once the training finishes, we can retrieve the parameters as –
print(class_conditionals_binary.loc.numpy())
print (class_conditionals_binary.covariance().numpy())
Using the class conditionals we can plot the contours as below –
![Fig. 5: Contours of class conditional for 2 classes [0, 1]. (Source: Author's Notebook).](https://towardsdatascience.com/wp-content/uploads/2021/12/1C5Wq-XffTVUtJD_maoztOg.png)
We can also plot the decision boundary for the binary GNB classifier –

Once we have the means and the diagonal covariance matrix we are ready to find the parameters for logistic regression. The weight and bias parameters are derived using mean and covariance in Eq. 14, let’s rewrite them one more time –

With this, we’ve come to the end of the post! Lot’s of important and fundamental concepts are covered and I hope you’ll find this helpful.
References:
[1] Penguin Dataset, Licensed under CC-0 and free to adapt.
[2] Machine Learning Book; by Tom Mitchell, Carnegie Mellon University; New Chapters.
[3] Full Notebook used for this post. GitHub Link.
Stay strong !!