The world’s leading publication for data science, AI, and ML professionals.

A Primer on the EM Algorithm

Christian Zuniga, PhD

Christian Zuniga, PhD

Figure 1. Example EM model used for mixture models [1]
Figure 1. Example EM model used for mixture models [1]

The Expectation-Maximization (EM) algorithm is one of the main algorithms in machine learning for estimation of model parameters [2][3][4]. For example, it is used to estimate mixing coefficients, means, and covariances in mixture models as shown in Figure 1. Its objective is to maximize the likelihood p(X|θ) where X is a matrix of observed data and θ is a vector of model parameters. This is maximum likelihood estimation and in practice the log-likelihood ln p(X| θ ) is maximized. The model parameters that maximize this function are deemed to be the correct model parameters. EM is useful alternative to gradient descent and may have several advantages.

1 It enforces constraints more naturally such as keeping a covariance matrix positive definite.

  1. It may converge faster in many situations such as for mixture models.
  2. It provides a way to deal with missing data.

Unfortunately, the Em Algorithm may not be as easy to understand to the beginner. The first step in the EM algorithm is to assume there are hidden variables Z that help generate the data. Only the data X is directly observed. The hidden variables help break the problem into two, sometimes simpler steps, Expectation and Maximization. The likelihood can be written in terms of the hidden variables Z by marginalizing over the joint distribution p(X, Z | θ ). The sum is over all possible values of Z.

There are three main concepts to understand to derive the EM algorithm: Bayes’ Theorem for more than two variables, Jensen’s inequality for concave functions, and the Kullback-Leibler divergence to compare probability distributions.

Bayes’ Theorem plays an important role in the EM algorithm. Many are familiar with the theorem with two variables but what happens with three variables as shown above? Although the form is more complex, the definition of conditional probability still applies and it can be expressed in terms of the joint distribution p(X,Z, θ).

The joint distribution is the same regardless of the order of the variables and that is the basis of manipulating all the conditional probabilities. In the EM algorithm, Z is going to be estimated using X and θ so the conditional probability p(Z | X, θ) is of interest. Using the joint distribution from the previous equation, the posterior distribution of Z can be expressed as shown below.

The equation can be simplified by using p(X | θ) = p(X, θ)/p(θ).

Jensen’s Inequality is the second result needed to obtain the EM algorithm. For concave functions f(x), the expectation of f(x), E[f(x)], is less than or equal to first applying the function and then taking the expectation f(E[x]). Succinctly, E[f(x)] ≤ f(E[x]).

Concave functions, like the logarithm, are those that ‘bulge’ outward as in Figure 2. The function is always greater than or equal to the chord. Mathematically, for any point p between x and y, f(p) ≥ L(p) where L(p) is the value of the line evaluated at p. The point p (or any point between x and y) can be expressed as a weighted combination of x and y by p = w1 x + w2 y. The weights are greater than zero and sum up to 1. Using the equation of a straight line, the value L(p) can be written as L(p) = w2 f(y) + (1- w2) f(x). So for concave functions f(w1 x + w2 y) ≥ w2 f(y) + (1- w2) f(x).

Figure 2 Concave function lies above any chord.
Figure 2 Concave function lies above any chord.

The next step in showing Jensen’s inequality follows since expectation can be thought of as a weighted sum with the probability distribution giving the weights. In general, the probabilities will give more than 2 weights, but the previous result can be generalized by induction to many weights.

The final concept to understand is the Kullback-Leibler divergence which is a measure of the dissimilarity between two probability distributions q(z) and p(z). The KL-divergence is always greater than or equal to zero with equality when the two distributions are equal, q(z) = p(z). It is not symmetric, KL(p||q) ≠ KL(q||p), so it is not a distance measure. It can be interpreted as the number of extra bits needed to encode the data using distribution q(z) when the true distribution is p(z) and vice-versa.

With these 3 concepts mastered, the EM algorithm is more easily understood. The hidden variables Z also have a distribution q(Z). Introducing q(z) into the log likelihood as q(z)/q(z) = 1 and using the fact that the log function is concave, Jensen’s inequality gives an expectation in terms of the distribution q(z).

The inequality gives a lower bound on the log-likelihood. The difference between the log-likelihood and L(q, θ) can be seen to be the KL-divergence between q(z) and p(Z|X, θ ). The expansion of p(X,Z| θ) = P(Z|X, θ)P(X| θ) is used to prove this.

There is exact equality when the KL divergence is zero, or when q(z) = p(Z|X, θ).

Maximizing the log-likelihood can be indirectly achieved by maximizing L(q, θ). There are two options for maximizing L, by varying q, or by varying the parameters θ. It is simpler to vary each independently. As the name implies, the EM algorithm has two steps, expectation and maximization and proceeds iteratively. First the parameters are initialized. The distribution q(z) is updated in the Expectation step holding the parameters θ fixed to θold. This simply involves setting q(z) = p(Z|X, θold). Then the parameters θ are updated to maximize L(q, θ) holding q(z) fixed in the Maximization step.

Expectation

Set q(z) = p(Z|X, θold) and substitute into L .

Maximization

Maximize L(p(Z|X, θold), θ) by holding q(z) fixed and varying θ.

The two steps are guaranteed to always increase the likelihood and converge to at least a local maximum. The maximization step may proceed in different ways depending on the form of the distributions. For example in Gaussian mixture models, a modification of the maximum likelihood estimates gives analytical results. In other situations there may not be an analytical solution but solving the Maximization step may still be simpler than the original problem.

References

[1] https://en.wikipedia.org/wiki/Expectation%E2%80%93maximization_algorithm

[2] Dempster, A.P.; Laird, N.M.; Rubin, D.B. (1977). "Maximum Likelihood from Incomplete Data via the EM Algorithm". _Journal of the Royal Statistical Society, Series B. 39 (1): 1–38. [JSTOR](https://en.wikipedia.org/wiki/JSTOR(identifier)) 2984875

[3] Bishop Pattern Recognition and Machine Learning Springer 2006

[4] Murphy Machine Learning A Probabilistic Perspective MIT Press 2012


Related Articles