
In the previous article, we described the Bayesian framework for linear regression and how we can use latent variables to reduce model complexity.
In this post, we will explain how latent variables can also be used to frame a classification problem, namely the Gaussian Mixture model (or GMM in short) that allows us to perform soft probabilistic clustering.
This model is classically trained by an optimization procedure named the Expectation-Maximization (or EM in short) for which we will have a thorough review. At the end of this article, we will also see why we do not use traditional optimization methods.
This article contains a few mathematical notations and derivations. We are not trying to scare anybody. We believe that once the intuition is given is it important to dive into the math to understand things for real.
This post was inspired by this excellent course on Coursera: Bayesian Methods for machine learning. If you are into machine learning I definitely recommend this course.
Gaussian Mixture Model
This model is a soft probabilistic clustering model that allows us to describe the membership of points to a set of clusters using a mixture of Gaussian densities. It is a soft classification (in contrast to a hard one) because it assigns probabilities of belonging to a specific class instead of a definitive choice. In essence, each observation will belong to every class but with different probabilities.
We take the famous Iris classification problem as an example. So we have 150 Iris flowers divided between 3 classes. For each of them, we have the sepal length and width, the petal length and width, and the class.
Let’s have a quick view of the data using the pair plot of the seaborn package:

The Gaussian Mixture model tries to describe the data as if it originated from a mixture of Gaussian distributions. So first, if we only take one dimension, say the petal width, and try to fit 3 different Gaussians, we would end up with something like this :

The algorithm found that the mixture that is most likely to represent the data generation process is made of the three following normal distributions :

The setosa petal widths are much more concentrated with a mean of 0.26 and a variance of 0.04. The other two classes are comparably more spread out but with different locations. Now let’s see the result with two dimensions say petal width and petal length.

Now the constituents of the mixture are the following :

Note that the GMM is a pretty flexible model. It can be shown that for a large enough number of mixtures, and appropriate choice of the involved parameters, one can approximate arbitrarily close any continuous pdf (with the extra computational cost that it entails).
Formalization – MLE
So how does the algorithm finds the best set of parameters to describe the mixture?
Well, we start by defining the probability model. The probability of observing any observation, that is the probability density, is a weighted sum of K Gaussian distributions (as pictured in the previous section) :

Each point is a mixture of K weighted Gaussians which are parameterized by a mean and a covariance matrix. So overall, we can describe the probability of observing a specific observation as a mixture. To be sure that you fully understand, in the one-dimensional example above using the petal width, the probability of observing a petal width in the region [0.2, 0.3] is the highest. We also have a strong probability of observing a petal width in the regions [1.2, 1.5] and [1.8, 2.2].
Note that in the case of 3 clusters, the set of parameters will be:

Then we want to find parameter values that maximize the likelihood of the dataset. We want to find the maximum likelihood estimates of the parameters. That is, we want to find the parameters that maximize the probability of observing all the data points together (ie, the joint probability), solving the following optimization problem :

Note that the full joint probability of the data set can be vectorized (ie., decomposed as a product of individual probabilities, with the Π operator) only in the assumption that the observations are drawn i.i.d (identically independently distributed). When they are, the events are independent and the probability of observing one data point is not influenced by the other probabilities.
Instead of the likelihood, we usually maximize the log-likelihood, in part because it turns the product of probabilities into a sum (simpler to work with). This is because the natural logarithm is a monotonically increasing concave function and does not change the location of the maximum (the location where the derivative is null will remain the same).

Now maximum likelihood estimation can be done in a large number of different ways. It can be done by direct optimization (finding the points where the partial derivatives are null) or by numerical optimization like gradient descent. The MLE of GMMs is not done using those methods for a number of reasons that I will explain. But I leave that for the end of the article because I want to get to the most relevant materials first. The MLE of GMMs is done using the expectation-maximization algorithm.
Expectation-Maximization
GMM training intuition
First, we are going to visually describe what happens during the training of a GMM model because it will really help to build the necessary intuition for EM. So let’s say we are back into the one-dimensional example but without labels this time. Try to imagine how we could assign cluster labels to the below observations:

Well, if we already knew where the Gaussians are in the above plot, for each observation, we could compute the cluster probabilities. Let’s draw this picture so you have it in mind. So we would be assigning a color to the points below:

Intuitively, for one selected observation and one selected Gaussian, the probability of the observation belonging to the cluster would be the ratio between the Gaussian value and the sum of all the Gaussians. Something like:

Ok, but how do know where the Gaussians are to be located in the above plot (ie. how do we find the Gaussian parameters)? Well, let’s say we are already in possession of the observation’s labels, like so:

Now we could easily find the parameter values and draw the Gaussians. It suffices to consider the points independently, say the red points, and find the maximum likelihood estimate. For a Gaussian distribution, one can demonstrate the following results:

Applying the above formula, to the red points, then the blue points, and then the yellow points, we get the following normal distributions:

Ok so given the normal distribution parameters, we can find the observation labels, and given the observation labels, we can find the normal distribution parameters. So it seems like we have a kind of a chicken and egg problem, right?
Well, in fact solving it is not that hard. We just have to start somewhere. So we can set the Gaussian parameters to random values. Then we perform the optimization by iterating through the two successive steps until convergence :
- We assign labels to the observations using the current Gaussian parameters
- We update the Gaussian parameters so that the fit is more likely
This would give the following result:

As you can see, as soon as we reach step 4 we are already at the best possible fit.
EM intuition
The Expectation-Maximization algorithm is performed exactly the same way. In fact, the optimization procedure we describe above for GMMs is a specific implementation of the EM algorithm. The EM algorithm is just more generally and formally defined (as it can be applied to many other optimization problems).
So the general idea is that we are trying to maximize a likelihood (and more frequently a log-likelihood), that is, we are trying to solve the following optimization problem:

This time we are not saying that the likelihood P(x_i|Θ) is a mixture of Gaussians. It can be anything.
Now, let’s visualize things! Imagine that the log-likelihood (log P(X|Θ)) is the following one-dimensional distribution:

The major trick, that makes this algorithm works, lies in the definition and usage of a specific function. This function is defined in such a way that at any given point in the parameter space, we know for sure that it will always have a value lower or equal to the log-likelihood. It is called a lower-bound. We will call it L (pictured in red below):

Now, in fact, we don’t use a single lower-bound but a family of lower-bounds parameterized by the vector of parameters Θ and a variational distribution q. So L(Θ, q) can be located anywhere as long as it remains a lower bound for the likelihood:

The EM algorithm starts by assigning random parameters. So let’s say we start with the following lower bound:

The algorithm will now perform two successive steps:
- Fix Θ and adjust q so that the lower-bound gets as close as possible to the log-likelihood. For example, during the first step, we compute q1:

- Fix q and adjust Θ so that the lower-bound gets maximized. For example, during the second step, we compute Θ1:

So to sum up this intuition, the EM algorithm breaks up the difficulty of finding the maximum of the likelihood (or a least a local maximum) into a series of successive steps that are much easier to deal with. In order to do so, it introduces a lower bound that is parametrized by the vector Θ for which we want to find the optimum and a variational lower bound q that we can also modify at will.
The Jensen’s inequality
This inequality is in some way just a rewording of the definition of a concave function. Recall that for any concave function f, any weight α and any two points x and y:

In fact, this definition can be generalized to more than two points as long as the weights sum up to 1 :

If the weights sump up to 1, then we can say that they represent a probability distribution and this gives us the definition of the Jensen’s inequality:

The projection of the expected value by a concave function is always greater or equal to the expected value of a concave function.
EM Formalization
The Expectation-Maximization algorithm is used with models that make use of latent variables. In general, we define a latent variable t that explains an observation x. There is one instance of the latent variable by observation. So we can draw the following diagram:

Also because t explains the observation x, it defines the probability that the observation belongs to one of the clusters. So we can write:

Now the full likelihood of the observation can be written as a marginal likelihood (ie. by marginalizing out t):

Now recall that we are trying to resolve the following optimization problem:

And also we just introduced the Jensen’s inequality, which can be written as:

We want to use this inequality to helps us define the lower bound. And we want this lower bound to depend on Θ and a variational distribution q. Well, the trick is to introduce q like so:

By multiplying and dividing by q, we don’t change anything. And now we can use Jensen’s inequality:

And we succeeded, we built a lower bound for the marginal log-likelihood that depends both on Θ and q. So we are now able to maximize the full lower bound by alternating the two following steps:
Expectation step:

We fix Θ, and we try to get the lower bound as close as possible to the likelihood, that is we try to minimize:

With further extra steps (that I will save you here), we can demonstrate that :

So in order to find the next value (k+1) of the variational distribution q, we need to consider every observation x_i independently and for every class, we compute the probability of that observation belonging to the class p(t_i|x_i, Θ). Recall that by the Bayes rule:

Now we can make the link with the Gaussian Mixture Models. We find in the above formula what we intuitively derived, ie:

Maximization step:

We fix q, and we maximize the lower bound which is defined as:

Now the second term in the subtraction is not dependent on Θ, so we can write:

Now we usually select a concave function that is easy to optimize. In the case of Gaussian Mixture Models, we used the MLE of the Gaussian distributions.
If you want to see the full derivations that allow us to get the closed-form expressions for the updates of the parameters in the m-step, I wrote them up in a dedicated article (in order not to overload this one).
If you red up this point, congratulations! You should now have a very good grasp of GMM and EM. Optionally, if you want to understand why we don’t use traditional methods to find the best set of parameters, read on.
So if we were to start from scratch, how would one perform maximum likelihood estimation in the case of Gaussian Mixture Models?
Direct optimization: A first approach
A way to find the maximum likelihood estimate is to set the partial derivatives of the log-likelihood with respect to the parameters to 0 and solve the equations. We would call that approach direct optimization.

As you can see, this approach is impractical because the sum over the Gaussian components appears inside the log making all the parameters tied together. For example, if K=2 and we try to solve the equation for α_1, we have:

We got through the first step using the chain rule… but we are not capable of expressing the parameter α_1 in terms of the other parameters. So we are stuck! We can not solve this system of equations analytically. This means that we can not find the global optimum, or at least a local one, in one step using analytical expressions.
Numerical optimization: A second approach
Ok, what else can we do? Well, we have to rely on a numerical optimization method. We could, for example, try to use our favorite stochastic gradient descent algorithm. How does it work?
Well, we initialize the vector of parameters Θ randomly and we iterate through the observations of the dataset. At step k, we compute the gradient of the likelihood for one selected observation. Then we update the parameter values by taking one step in the opposite direction of the gradient (using a specific learning rate η), that is:

Using the chain rule, we can compute the partial derivatives for the remaining parameters (like we did above for α_1). For example, we would get the following result for μ_1:

Now you are going to tell me that the parameters are still dependent on each other. Well yes, but we are not solving equations anymore. To take a step, we evaluate the partial derivatives with the values at the current location and with the current set of parameters. It is an iterative algorithm.
So let’s say we are computing Θ¹. We have at our disposal Θ⁰=(α_0, α_1, μ_0, μ_1, Σ_0, Σ_1) initiated randomly and we have the first selected observation x_0. Given the formulas of partial derivatives, we have all we need to compute Θ¹. And we proceed to the successive steps until convergence (that is, the current step gets too small).

For example, I generated the above animation by simulating the optimization of the Beale function B with two parameters (x and y). We start at a random point, here around (-3.5, -3.5), and at each step, we update the parameters towards the minimum.
NB: Note that in reality, the differentiation of the loss function performed by numerical frameworks like Tensorflow or PyTorch is not performed like we did above. We don’t use a set of hard-coded mathematical rules like we use to do in high-school (what is called manual differentiation). There might not even be convenient formulas for the derivatives. It is done using automatic differentiation.
Ok, this looks good! So we are done? Well not so fast! We forgot an important thing. We have to fulfill two constraints resolving this optimization problem. We are performing optimization under constraint.
The first one is that the sum of mixture weights α must be non-negative and sum up to 1. This allows the probability of observing a data point to be a proper probability density function; that is :

To overcome this constraint, we can incorporate it into the optimization problem with the use of Lagrangian multipliers; i.e reformulating the problem like :

λ is added as an additional parameter to the vector θ. Then we run our stochastic optimization procedure again and we are done! Well, not so fast! The real problem lies in the second constraint. Recall that the multinormal probability density function is written as:

But the matrix Σ can not be arbitrary! It is a covariance matrix and therefore shall respect a certain number of properties:
- The diagonal terms are variances and so must be positive
- The matrix must be symmetric
- For two distinct predictors, the square of their covariances must be less than the product of their variances
- The matrix must be invertible
- The determinant must be positive
Those conditions are fulfilled when the matrix is positive semidefinite.
And this is an extremely harder constraint to fulfill and is still an active research area. In fact, there is an all subfield of convex optimization dedicated to it, it is called Semidefinite programming. It has really emerged as a discipline starting from the 90s with methods like Interior-point or Augmented Lagrangian. But Mixture models have emerged before that and led to the Expectation-Maximization algorithm.
Don’t worry if you did not understand all of the above (especially the part regarding the Lagrangian formulation). All you need to understand is that traditional methods to find the maximum likelihood estimate of the parameters are not adapted to Gaussian mixture models.
Conclusion
In this article, we have had a thorough review of Gaussian Mixture Models and the Expectation-Maximization both visually (to give some insight) and more formally giving the full mathematical derivations.
This article has been pretty heavy on the math but I think that if you managed to take the time, it is really worth it. Having a deeper insight into this kind of model makes you understand a lot of the different technics that are spread out all over the field of machine learning and Statistics. And next time you try to understand another model of the same caliber, it will be much easier. I promise you that!
In the meantime, take care of yourself and your loved ones!