The world’s leading publication for data science, AI, and ML professionals.

Generative Classification Algorithms from Scratch

LDA, QDA, and Naive Bayes

Machine Learning from Scratch

Quadratic Discriminant Analysis Decision Boundaries. (Image by author. Source.)
Quadratic Discriminant Analysis Decision Boundaries. (Image by author. Source.)

Probabilistic generative algorithms – such as Naive Bayes, linear discriminant analysis, and quadratic discriminant analysis – have become popular tools for classification. These methods can be easily implemented in Python through scikit-learn or in R through e1071. But how do the methods actually work? This article derives them from scratch.

(Note that this article is adapted from a chapter of my book Machine Learning from Scratch, which is available online for free).

Notation and Vocabulary

We’ll use the following conventions for this article.

  • Let v[i] be the _i_th entry in a vector v.
  • The target is the variable we are trying to model. The predictors are the variables we use to model the target.
  • The target is a scalar and it is written as y. The predictors are combined into a vector and they are written as x. We also assume that the first entry in x is a 1, corresponding to the intercept term.
  • p(x) refers to the distribution of x, but P(y = k) refers to the probability that y equals k.

1. Generative Classification

Most classification algorithms fall into one of two categories: discriminative and generative classifiers. Discriminative classifiers model the target variable, y, as a direct function of the predictor variables, x. For instance, logistic regression uses the following model, where _𝜷 is a length-D vector of coefficients and x i_s a length-D vector of predictors:

The logistic regression model

Generative classifiers instead view the predictors as being generated according to their class – i.e., they see x as a function of y, rather than the other way around. They then use Bayes’ rule to get from p(x|y = k) to P(y = k|x), as explained below.

Generative models can be broken down into the three following steps. Suppose we have a classification task with K unordered classes, represented by k = 1, 2, …, K.

  1. Estimate the prior probability that a target belongs to any given class. I.e., estimate P(y = k) for k = 1, 2, …, K.
  2. Estimate the density of the predictors conditional on the target belonging to each class. I.e., estimate p(x|y = k) for k = 1, 2, …, K.
  3. Calculate the posterior probability that the target belongs to any given class. I.e., calculate P(y = k|x), which is proportional to p(x|y = k)P(y = k) by Bayes’ rule.

We then classify an observation as belonging to the class k for which the following expression is greatest:

Note that we do not need p(x), which would be the denominator in the Bayes’ rule formula, since it would be equal across classes.

2. Model Structure

A generative classifier models two sources of randomness. First, we assume that out of the 𝐾 possible classes, each observation belongs to class 𝑘 independently with probability given by the kth entry in the vector 𝝅. I.e., 𝝅[k_] gives P(y = k)._

Second, we assume some distribution of x conditional on y. We typically assume x comes from the same family of distributions regardless of y, though its parameters depend on the class. For instance, we might assume

though we wouldn’t assume x is distributed MVN if y = 1 but distributed Multivariate-t otherwise. Note that it is possible, however, for the individual variables within the vector x to follow different distributions. For instance, we might assume the _i_th and _jth variables in x_ to be distributed as follows

The Machine Learning task is then to estimate the parameters of these distributions – _𝝅 f_or the target variable y __ and whatever parameters index the assumed distributions of x|y = k (in the first case above, **𝝁k an_d 𝚺_k _for k = 1, 2, …, K. On_ce that’s done, we can calculate P(_y = k) a_nd p(_x|y ** = k) f_or each class. Then through Bayes’ rule, we choose the class k _that maximizes P(y = k|x).**_

3. Parameter Estimation

Now let’s get to estimating the model’s parameters. Recall that we calculate P(y = k|x) with

To calculate this probability, we need to first estimate 𝝅 (which tells us P(y = k)) and to second estimate the parameters in the distribution p_(x|y = k). T_hese are referred to as the class priors and the data likelihood.

Note: Since we’ll talk about the data across observations, let _yn and _xn be the target and predictors for the _n_th observation, respectively. (The math below is a little neater in the original book.)

3.1 Class Priors

Let’s start by deriving the estimates for 𝝅, the class priors. Let I_nk be an indicator which equals 1 if y__n = k and 0 otherwise. We want to find an expression for the likelihood of 𝝅 gi_ven the data. We can write the probability that the first observation has the target value it does as follows:

This is equivalent to the likelihood of _𝝅 g_iven a single target variable. To find the likelihood across all our variables, we simply use the product:

This gives us the class prior likelihood. To estimate _𝝅 t_hrough maximum likelihood, let’s first take the log. This gives

where the number of observations in class k is given by

Now we are ready to find the MLE of _𝝅 by optimizing the log likelihood. To do this, we’ll need to use a Lagrangian since we have the constraint that the sum of the entries in 𝝅 mu_st equal 1. The Lagrangian for this optimization problem looks as follows:

The Lagrangian optimization. The first expression represents the log likelihood and the second represents the constraint.

More on the Lagrangian can be found in the original book. Next, we take the derivative of the Lagrangian with respect to 𝜆 and each entry in 𝝅:

This system of equations gives the intuitive solution below, which says that our estimate of P(y = k) is just the sample fraction of the observations from class k.

3.2 Data Likelihood

The next step is to model the conditional distribution of x given y so that we can estimate this distribution’s parameters. This of course depends on the family of distributions we choose to model x. Three common approaches are detailed below.

3.2.1 Linear Discriminative Analysis (LDA)

In LDA, we assume the following distribution for x

for k = 1, 2, …, K. Note that each class has the same covariance matrix but a unique mean vector.

Let’s derive the parameter estimates in this case. First, let’s find the likelihood and log likelihood. Note that we can write the joint likelihood of all the observations as

since

Then, we plug in the Multivariate Normal PDF (dropping multiplicative constants) and take the log:

Finally, we have our data likelihood. Now we estimate the parameters by maximizing this expression.

Let’s start with 𝚺. First, simplify the log-likelihood to make the gradient with respect to 𝚺 more apparent.

Then, we take the derivative. Note that this uses matrix derivatives (2) and (3) introduced in the "math note" here.

Then we set this gradient equal to 0 and solve for 𝚺.

where

Half way there! Now to estimate **𝝁k_ (the k_t_h class’s mean vector), let’s look at each class individually. Let C__k be the set of observations in class k. Looking only at terms involving 𝝁k**, we get

Using equation (4) from the "math note" here, we get the gradient to be

Finally, we set this gradient equal to 0 and find our estimate of the mean vector:

where the last term gives the sample mean of the x in class k.

3.2.2 Quadratic Discriminant Analysis (QDA)

QDA looks very similar to LDA but assumes each class has its own covariance matrix. I.e.,

The log-likelihood is the same in LDA except we the 𝚺 with a 𝚺_k:

Again, let’s look at the parameters for the _k_th class individually. The log-likelihood for class k is given by

We could take the gradient of this log-likelihood with respect to **𝝁k_ and set it equal to 0 to solve for our estimate of _𝝁k**. However, we can also note that this estimate from the LDA approach will hold since this expression didn’t depend on the covariance term (which is the only thing we’ve changed). Therefore, we again get

To estimate the 𝚺__k, we take the gradient of the log-likelihood for class k._

Then we set this equal to 0 to get our estimate:

where

3.2.3 Naive Bayes

Naive Bayes assumes the random variables within x are independent conditional on the class of the observation. That is, if x is D-dimensional,

This makes calculating p(x|y = k) very easy – to estimate the parameters of p(x[j]|y), we can ignore all variables in x other than the _j_th.

As an example, suppose x is two-dimensional and we use the following model, where for simplicity σ is known.

As before, we estimate the parameters in each class by looking only at the terms in that class. Let θ__k = (μ_k, σ_k, pk) contain the relevant parameters for class k. The likelihood for class k is given by the following,

where the two are equal due to the assumed independence between the entries in x. Subbing in the Normal and Bernoulli densities for _xn1 and _xn2, respectively, we get

Then we can take the log likelihood as follows

Finally we’re ready to find our estimates. Taking the derivative with respect to _pk, we’re left with

which will give us the sensible result that

Notice this is just the average of the x_2’s. The same process will give us the typical results for _μk and _σk.

4. Making Classifications

Regardless of our modeling choices for p(x|y = k), classifying new observations is easy. Consider a test observation _x0. For k = 1, 2, …, K, we use Bayes’ rule to calculate

where 𝑝̂ gives the estimated density of x****0 c_onditional on y__0. W_e then predict y__0 = k f_or whichever value k _m_aximizes the above expression.

Conclusion

Generative models like LDA, QDA, and Naive Bayes are among the most common methods for classifications. However, the (albeit arduous) details of their fitting process are often swept under the rug. The aim of this article is to make those details clear.

While the low-level details of estimating parameters for a generative model can be quite complex, the high-level intuition is quite straightforward. Let’s recap this intuition in a few simple steps.

  1. Estimate the prior probability that an observation is from any given class k. In math, estimate p(y = k) for each value of k.
  2. Estimate the density of the predictors conditional on the observation’s class. I.e., estimate p(x|y = k) for each value of k.
  3. Use Bayes’ Rule to obtain the probability that an observation is from any class given its predictors (up to a constant of proportionality): p(y = k|x).
  4. Choose which value k maximizes the probability in step 3 (call it k*) and estimate that y = k*.

And that’s it! To see more derivations from scratch like this one, please check out my free online book! I promise that most of them have less math.


Related Articles