Demystifying Gaussian Mixture Models and Expectation Maximization

Explanation of Gaussian Mixture Models and its underlying algorithm of Expectation-Maximization in a simplified manner

Prasad Pai
Towards Data Science

--

After you learn how to cluster the samples of unlabelled data points using the simplest clustering algorithm of k-means, we begin to see several shortcomings of k-means upon the application of this technique on a real dataset. The next step an ML engineer will take is to apply more sophisticated algorithms to understand various groupings(clusters) in his/her data samples and most likely this algorithm is going to be Gaussian Mixture Modeling (GMM). Due to the presence of open-source software and multiple ML frameworks like Scikit-learn, all it takes is just a couple of lines to make use of this algorithm. However, understanding the underlying maths used in this process has been difficult to grasp for many. In this article, we will look into the inner workings of this algorithm more intuitively.

Advantages of GMM over k-means

Let us start building our concepts of Gaussian Mixture Modeling looking straight into the shortcomings of k-means.

  • Disproportionate sizes of the clusters: k-means works on a simple principle of assigning the data samples to the nearest assumed cluster center thus forming a Voronoi diagram in a hyperplane. As a consequence, several data samples will get misclassified if the spread of a cluster is too far from its correct cluster center. With GMM, the spread of your data points in the cluster need not be uniform from its cluster centers.
Clusters with different spreads in GMM with Voronoi partition overlapped (Image by Author)
  • Different shapes of clusters: Another way of seeing the above problem is k-means may disturb the shape of the cluster, though the data samples are tightly packed inside the cluster. As the orientation of cluster changes, the results of k-means can change significantly.
Clusters formed in GMM with cluster centers and corresponding Voronoi partition overlapped (Image by Author)
  • Overlapping clusters: Sometimes, the data points belonging to different clusters can be very tightly packed and the algorithm of k-means will not hesitate to speculate between the allocation of samples to its cluster. k-means will draw a hard boundary between the clusters even though samples may be infinitesimally near to the boundary. With GMM, you formulate your model with a postulate that there exist no hard boundaries in clusters which leads us to the next main advantage of GMM.
Overlapped clusters in GMM (Image by Author)
  • Soft assignment: With the standard algorithm of k-means, you allocate a data point to a cluster with 100% probability. However, in GMM for every data sample in the dataset, you assign a set of likelihood with what probability the data sample belongs to every cluster present in the system. For example, in the distribution below comprising of 4 clusters, you assign an equal probability (25% each) for the data sample present in the center if GMM is asked to cluster it.
GMM will make soft assignments (Image by Author)

Gaussian distribution

So mathematically, how do we represent the functions of these clusters which may be of unequal sizes, shapes, orientations, and possibly overlapping each other? As the name of this technique suggests, we make use of Gaussian distribution to represent each cluster separately in the system. Before we look into how it is possible, let us do a brief recap of Gaussian distribution in 1D.

PDF of various Gaussian distribution (Image source: Wikipedia)

Here we are showing a few probability density functions(PDF) of Gaussian distribution. The hump of this unimodal function represents the mean and is also present at the center of this distribution. The spread of this distribution around the mean is specified by standard deviation. The squared term of standard deviation is known as the variance of this distribution.

But as our dataset will be in higher dimensions (say n), we will have to make use of an n-dimensional Gaussian distribution function to represent our cluster. Since there is the influence of multiple features (represent by axis in image) on the Gaussian, we will move up from simple variance to covariance matrix that will help us in capturing the information on the spread (through variance) and the orientation (through correlation) as well.

Creating clusters with Gaussians

Let us next create clusters of varying positions, sizes, shapes, and orientations with Gaussian distribution functions. Let us consider we have a 2-dimensional feature set that will result in a 2X2 dimensional covariance matrix. To create a simple circular cluster, here is our covariance matrix.

Covariance matrix with equal variance (Image by Author)

The diagonal term represents the variance of Gaussian distribution in each direction of the axis and they should be equal to get a circular shape. Now, if we change the dimensions of variance, the circular shape will be converted into an ellipse as shown.

Elliptical cluster with unequal variances in the covariance matrix (Image by Author)

The above cluster is an axis aligned in a regular Cartesian-based X-Y coordinate system. And to change the orientation from regular axis alignment, we add the correlation terms in the off-diagonal positions.

Elliptical cluster with orientation (Image by Author)

And how do we change the position of these clusters? We make use of the n-dimensional mean vector (denoted by μ) to achieve this. For example, below is a collection of four Gaussian modeled clusters of data samples, with each cluster having its mean vector and covariance matrix.

Four separate Gaussian distributions
Four separate clusters on the four distributions (Image by Author)

How do we combine these Gaussian modeled clusters?

Though we have represented the four clusters above, how did we just combine them? Shouldn’t there be a relationship that captures the importance of each cluster over the other? This leads us to another parameter represented by π that quantifies the relative proportion of weights of each cluster in the system. As this is a relative scale, we will ensure the sum of all π in the system will be equal to 1.

Proportion weight parameter

So to represent our cluster (or mixture) component, we will have to use a total of three parameters (Mean μ, covariance ∑, and proportion weight π) and are collectively known as cluster parameters.

How do we track the soft assignments?

We will have to add one more terminology. We need to express the probabilities of assigning every data sample to all the existing clusters in the system. We will capture this soft assignment by a responsibility vector denoted by rᵢₖ that transcends into the amount of responsibility, cluster k will take for data sample i. As these are probabilities, the sum of responsibilities associated with an individual data point will be equal to 1. So, we have,

Soft assignment parameter

Formulation of model

Now that we have our Gaussian distribution ready to represent the cluster, let us extract some useful information out of the model. The first question we will try to answer is if you randomly select a data sample (say sample i), with what probability it belongs to each of these clusters (say a total of k clusters)? The answer is simply our weight proportion component π of each cluster as it translates to the importance/dominance of each cluster in the system. Mathematically,

This equation in other words corresponds to the prior probability of assigning a data sample to cluster k.

Now let us go a step ahead. Imagine that we got to know that our previously seen iᵗʰ sample (denoted by zᵢ), belonged to cluster k, then let us try to compute what is the probability for a particular configuration of input vector (denoted by xᵢ) present inside the data sample zᵢ belongs to cluster k? If you imagine, the answer to this question is equal to the Gaussian distribution function of cluster k as the distribution function is in other words an individual collection of samples present inside it. The samples over here are made of input vectors. So mathematically, we have,

This equation corresponds to the likelihood of occurrence of xᵢ given that zᵢ belongs to cluster k.

Now that we have defined our model comprising of cluster parameters and soft assignments, we need to devise an algorithm that will establish the relations between various cluster parameters and finally assign soft assignments to all the individual data points. Let us proceed with this problem by leveraging the mathematical power of assumptions in two steps.

Step 1: Assume, cluster parameters are known

Though we are unaware of it in reality, let us assume for a while, that we are aware of the values of cluster parameters {μ, ∑ and π} of all clusters and consequently the distribution of Gaussian functions are known. We have to compute the soft assignment probabilities now. The individual probability value inside the responsibility vector for a particular data sample zᵢ in cluster k translates to the likelihood of observing data sample zᵢ in cluster k amplified by the weighing factor (prior probability) of that cluster k. So mathematically, this is put as:

Since this is a probability value, we have to normalize this equation over all possible cluster assignments for every element of the responsibility vector.

Step 2: Assume, soft assignments are known

Previously, we assumed cluster parameters are known and then, we found a way to calculate soft assignments. Now let us do the reverse way in this step. Let us find a way to calculate cluster parameters by assuming soft assignments are known.

To simplify the calculation procedure, we will temporarily consider that every data sample’s assignment to a cluster is absolute (hard) similar to k-means. Now, we need to calculate {μ, ∑ and π} for all clusters using the data samples belonging to each cluster separately. This is the direct application of maximum likelihood estimation in which we try to find the best values for parameters in the objective function and subsequently fitting the function with the best values.

However, we will take a shortcut as we already know the values of the mean vector and covariance matrix in a Gaussian distribution and we will not derive the values using the standard procedure of MLE with an optimizer. So the estimated values of mean and covariance are:

Estimated cluster parameters using hard assignments

Nₖ denotes the number of samples in cluster k and N corresponds to the total number of samples in the dataset. The estimated cluster proportion parameter of π is the direct probability value of occurrence of data samples belonging to cluster k. So,

Estimated proportion weight using hard assignments

Now with estimated cluster parameters equations ready, let us solve the original problem assuming the cluster assignments were soft and not hard or absolute. For this, we can build on top of our already derived equations using hard assignments. In hard assignments, we were working with the full observation of xᵢ in cluster k, but now we just have to downscale the absolute (100% or 0%) contribution to its share of relative responsibility (0.0 to 1.0). The updated cluster parameter estimates with soft assignments will be:

Estimated cluster parameters using soft assignments

As a curiosity exercise, you can reverse engineer these equations to that of hard assignments by changing the responsibility vector to absolute (1.0 or 0.0).

Expectation-Maximization algorithm

We have observed that if we fix our cluster parameters to some values, we can momentarily calculate the soft assignments at that time step. Similarly, when we fix the soft assignments, we can procure the cluster parameter values at that time step. This forms the two fundamental steps which we can run iteratively for a finite number of times to create the algorithm of Expectation-Maximization. The only setback at this point is with what values should we start for time step 0. Just like during the starting of the k-means clustering procedure, we can randomly assign some cluster parameters and then get started with the EM algorithm.

Let us formally define the two steps we have seen.

E-step: This is described in step 1 of our explanation. In this step, we estimate the cluster responsibility vectors for the given cluster parameter estimates at that time step.

M-step: This is described in step 2 of our explanation. In this step, we maximize the likelihood over cluster parameters for the given responsibility vectors at that time step.

Lastly, we iteratively keep repeating the above steps till the likelihood values of cluster parameters and soft assignments converge to a local mode. In simpler terms, it means the oscillations in values of soft assignment and cluster parameters are below a threshold level when iterating between E-step and M-step respectively.

In this article, we have explored the unsupervised learning method of clustering with Gaussian Mixture modeling and its underlying algorithm of Expectation-Maximization. Do share your view about the contents of this article.

--

--

Software developer @ Flipkart. An aspiring data scientist moving ahead with one step at a time.