Getting Started

I like the kind of math that can be explained to me like I’m five years old.
In this Article
My take on explaining, like I’m five years old, the math behind a key component of Gaussian Mixture Models (GMM) known as Expectation-Maximization (EM) and how to translate the concepts to Python. The focus this story is on the M of EM, or M-Step.
Note: This is not a comprehensive explanation about the end-to-end GMM algorithm. For a deeper dive, check out this article from Towards Data Science, another one on GMM, documenation from sci-kit learn, or Wikipedia.
Source: Based on my notes from studying Machine Learning; the source materials are derived from and credited to this university class.
Expectation Maximization
There is a series of steps in GMM that are often referred to as Expectation Maximization, or "EM" in short. To explain how to understand the EM math, first consider a mental model of what you might be dealing with.

- There are samples, represented by points on a graph.
- The points form a few distinct blobs.
- Each blob has a center and each point is some distance away from each blob’s center.
- Given a GMM model the data, the goal is to generally label other samples by their points based on the closest center.
- Some points are nearly equally far from one or more centers, as a result, we want to label points based on some kind of probability.
The Language of EM
To learn how to learn about machine learning algorithms, you need some Greek in your life. No, not college fraternity "Greek Life" – I am talking about the meaning of Greek symbols in algorithms. Although it may be tempting to gloss over the basics, establishing a simple grasp of the individual Greek letters can help you understand important concepts in algorithms.
The Greeks
Algorithms can be intimidating and downright confusing. For example, at first glance, a high concentration of Greek symbols is sometimes strong enough to make a person blackout. However, instead of passing out, consider taking in the Greek, one symbol at a time.

The English
Not to be left out, we also have some English letters that carry meaning during EM for GMM. Usually, English letters surround Greek letters like little pilot fish swimming around big sharks. Like little fish, English letters serve an important purpose and provide a guide for how to interpret the algorithm.

The Math of M-Step
Now that we’ve isolated each component of the equation, let’s combine them into some common mathy phrases that are important to conversing in the language of EM by examining the M-Step.
Clusters, Gaussians, The Letter J or K and sometimes C: This all generally the same thing – if we have 3 clusters, then you might hear "for every Gaussian", "for every j", "for every Gaussian j", or "for every K components" – these are all different ways to talk about the same 3 clusters. In terms of data, we could plot an array of (x,y) samples/points and see how they form clusters.
# a 2D array of samples [features and targets]
# the last column, targets [0,1,2], represent three clusters
# the first two columns are the points that make up our features
# each feature is just a set of points (x,y) in 2D space
# each row is a sample and cluster label
[[-7.72642091 -8.39495682 2. ]
[ 5.45339605 0.74230537 1. ]
[-2.97867201 9.55684617 0. ]
[ 6.04267315 0.57131862 1. ] ...]
Soft Assignments, Probability, Responsibility: The big idea with clustering is that we want to find a number for each sample that tells us which cluster the sample belongs to. In GMM, for every sample we evaluate, we might return values that represent the "responsibility of each Gaussian j", the "soft assignment" of each, or "the probability" of each.
These phases are all generally about the same thing but with a key distinction between responsibility and probability.
# an array of assignment data about the 2D array of samples
# each column represents a cluster
# each row represents data about each sample
# in each row, we have the probability that a sample belongs to one of three clusters - it adds up to 1 (as it should)
# but the sum of each column is a big number number (not 1)
print(assignments)
# sample output: an array of assignment data
[[1.00000000e+000 2.82033618e-118 1.13001412e-070]
[9.21706438e-074 1.00000000e+000 3.98146031e-029]
[4.40884339e-099 5.66602768e-053 1.00000000e+000]...]
print(np.sum(assignments[0])
# sample output: the sum across each row is 1
1
print(np.sum(assignments[:, 0])
# sample output: the sum in each col is a big number that varies
# Little Gamma: the really small numbers in each column
# Big Gamma: the sum of each column, or 33.0 in this sample
33.0
Big Gamma, Little Gamma, J, N, x, and i: The core set of tasks in EM is to optimize three sets of parameters for every cluster, or "for every j, optimize the w (𝓌), the mew (𝜇), and the variance (𝜎)." In other words, what is the cluster’s weight (𝓌), the cluster’s center point (𝜇), and cluster’s variance(𝜎)?
- For weight (𝓌), we have Big Gamma divided by the total number of features. From earlier, we know that Big Gamma for every cluster j, is just the result of adding every sample’s assignment value for a given cluster (this is the number that does not add up to 1). Figure 3.

- For mew (𝜇), instead of adding up all the Little Gammas into a single Big Gamma as we did earlier, do matrix multiplication of the Little Gammas by the features x **** for each cluster j and each sample i. Figure 4.
- Remember, the mew is just the center point of each cluster – if we have 3 clusters and our samples are all x, y coordinates, then the mew is going to be an array of 3 x, y coordinates, one for each cluster.

# for figure 4 - mew (mu)
# same array of assignment data as before
# each column is a cluster of little gammas
print(assignments)
[[1.00000000e+000 2.82033618e-118 1.13001412e-070]
[9.21706438e-074 1.00000000e+000 3.98146031e-029]
[4.40884339e-099 5.66602768e-053 1.00000000e+000]...]
# the little gammas of cluster 0 is just column 0
[[1.00000000e+000 ]
[9.21706438e-074 ]
[4.40884339e-099 ]...]
# same array of sample data as before
# the first two columns are the x,y coordinates
# the last column is the cluster label of the sample
print(features)
[[-7.72642091 -8.39495682 2. ]
[ 5.45339605 0.74230537 1. ]
[-2.97867201 9.55684617 0. ]
[ 6.04267315 0.57131862 1. ] ...]
# for features, we just need its points
[[-7.72642091 -8.39495682 ]
[ 5.45339605 0.74230537 ]
[-2.97867201 9.55684617 ]
[ 6.04267315 0.57131862 ] ...]
# if using numpy (np) for matrix multiplication
# for cluster 0 ...
big_gamma = np.sum(assignments[:, 0]
mew = np.matmul(assignments[:, 0], features) / big_gamma
# returns an array of mew
[[-2.66780392 8.93576069]
[-6.95170962 -6.67621669]
[ 4.49951001 1.93892013]]
- For variance (𝜎), consider that by now, we have points and center points – with variance, we are basically evaluating the distance from each sample’s points (x for every i) to each cluster’s center point (mew for every i). In the language of EM, some might say "x_i minus mew_i squared over Big Gamma j."

# for figure 5 - variance
# a sampling of variance for cluster 0 of n clusters
# given arrays for features and assignments...
x_i = features
big_gamma = np.sum(assignments[:, 0]
mew = np.matmul(assignments[:, 0], features) / big_gamma
numerator = np.matmul(assignments[:, 0], (x_i - mew) ** 2)
variance = numerator / big_gamma
# returns an array of variance
[[0.6422345 1.06006186]
[0.65254746 0.9274831 ]
[0.95031461 0.92519751]]
Assignments (the E-Step)
The preceding steps are all about the _M-_Step in EM or Maximization – everything about weights, mew, and variance are all about optimization; however, what about the initial assignments array? How do we get an array of probabilities for each sample – this is the _E-_Step in EM or Expectation.
Although not covered in detail in this story, I can leave you with a good reference to Wikipedia and the following intuition.
In E-Step, we try to guess the assignments of each point with Bayes’ Rule – this produces the array of values that indicate responsibility or probability of each point to a Gaussian. At first the guessed values in assignments (which are the posteriors) are far off, but after cycling through E-Step and M-Step, the guesses will get better and closer to the objective ground truth.
The GMM algorithm repeats both M-Step and E-Step until convergence. The convergence, for instance, might be a maximum number of iterations or when the differences between each round of guessing gets really small. The result, hopefully, is a label of soft assignments for each sample in the data.
Conclusion
In this article, I present a beginner’s understanding to navigating part of the Expectation Maximization phase of the Gaussian Mixture Model algorithm known as M-Step. Although the math seems too hot to handle at the surface, we can manage the complexity by understanding its individual parts. For instance, a few key understandings as simple as pronouncing the Greek symbols and applying their operations with NumPy are important to grasping the overarching concepts.
Thanks for reading, hope these explanations help you learn and understand GMM. Let me know if I can make any improvements or cover new topics.