Introduction
The goal of this post is to explain a powerful algorithm in statistical analysis: the Expectation-Maximization (EM) algorithm. It is powerful in the sense that it has the ability to deal with missing data and unobserved features, the use-cases for which come up frequently in many real-world applications.
Upon finishing the post, you will be able to understand many research papers which solve interesting problems using Probabilistic Graphical Models, and also acquire the ability to develop and train such model yourself.
Pre-requisites
- Patience (since this post is detailed and dives deep into mathematical concepts).
- Basic Probability concepts.
- Logistic Regression.
- Iterative optimization techniques like Gradient Ascent.
The case of observed features
Before delving into complex use-cases, it would be helpful to first understand how parameters are estimated in simple use-cases. Thus, we first consider the case where our data does not contain any missing values and our model does not have any latent features. To that end, we will take the help of the logistic regression model.
Suppose we have a dataset with each data point consisting of a d dimensional feature vector X and an associated target variable Y ∈ {0,1}. A graphical representation of this model is depicted following:

As we know, the prediction probability of the target variable in logistic regression is given by a sigmoid function like the following:

where w is the weight vector to be estimated. Once we estimate the parameters w, **** we can produce output for unobserved data points depending upon whether **** Y=0 or Y=1 gets more probability as per equation 1
. However, the central question here is: how can we approximate w such that we produce accurate predictions? To that end, we will use the Maximum Likelihood approach.
The premise of the Maximum Likelihood approach is simple: find parameters w __ that maximize the likelihood (or _probabilit_y) of the observed data.
For mathematical convenience, we’ll consider maximizing the log of likelihood. Parameterized by w, log-likelihood can be written as:

Here the likelihood of the observed data is represented as the multiplication of the likelihood of individual data point under the assumption that data samples are independently and identically distributed (so-called _i.i.d assumption). We also used the property that multiplication of terms inside a log expression is equivalent to the summation of the log_ of the individual term. Substituting equation 1
in equation 2
leads to:

where we used the knowledge that P(_Y=y_i|X=xi) evaluates to σ(_w.xi) when _y_i_=1, else σ(-_w.xi) (this can be verified from equation 1
).
At this point, we should note that the log-likelihood, L_(w),_ **** conveniently breaks down into per-instance form and there’s no coupling between different parameters, which greatly easies the optimization. We will later see that this might not be achievable in many realistic scenarios.
Since L(w) is a function of w, we don’t have any closed-form solution to equation 3
. Thus, we would have to use iterative optimization methods like Gradient Ascent to find w. An update for the gradient ascent method would look like:

where η is the learning rate. We repeat the procedure in equation 4
until convergence and w obtained at the end is called the maximum likelihood estimate.
The case of latent features
Now we are ready to dive deep into more complex and realistic use-cases. In most of the realistic scenarios, it is common to have missing values in the dataset or to choose sophisticated models that have latent (unobserved) features which relate to the observed feature in the data.
One such model that uses latent features is the Hidden Markov model. It is widely used and has an array of real-world applications like speech recognition, handwriting recognition, gesture recognition, part-of-speech tagging, musical score following, time series analysis and so on.
However, the question is: does having latent features makes any difference in the estimation of parameters? It turns out, yes. Estimating model parameters does get a little tricky if latent features (or missing data) are involved. Let’s see why.
The issue
Let V be the set of observed variables, Z be the set of latent variables and θ be the set of model parameters. If we consider the maximum likelihood approach for the parameter estimation, our objective1 function would be:

Comparing
equation 5
withequation 2
, we can see that in the latter case, the parameters are coupled because of summation inside the log. This makes the optimization using Gradient Ascent (or any iterative optimization technique in general) intractable. This means that many realistic scenarios need a more powerful technique to infer the parameters.
EM Algorithm to the Rescue
Thankfully, researchers already came up with such a powerful technique and it is known as the Expectation-Maximization (EM) algorithm. It uses the fact that optimization of complete data log-likelihood **P(V, Z | θ)* is much easier when we know the value of _Z (thus, removing the summation from inside the log**_).
***** Going forward, we consider Y __ as part of V for notation simplicity.
However, since the only way to know the value of Z is through posterior P(Z | V, θ), we instead consider the expected value of complete data log likelihood under the posterior distribution of latent variables. This step of finding the expectation is called the E-step. In the subsequent M-step, we maximize this expectation to optimize θ.
Formally, the EM algorithm can be written as:
1. Choose initial setting for the parameters θ^old
2. E Step Evaluate P(Z | V, θ^old)
3. M step Evaluate θ^new given by

4. Check for convergence of log likelihood or parameter values. If not converged, then θ^old=θ^new and return to the E step.
Diving Deeper into EM
Before diving deep, first, we will derive a property that will come in handy while explaining the E and M steps.
Let us consider a distribution q(Z) over the latent variables. Independent of choice of q(Z), we can decompose the likelihood of the observed data in the following fashion:

- In the first step, we apply the concept of marginalisation of probability and Bayes’ Theorem respectively.
- In the second step, we multiply and divide by q_(Z)_ inside the log and apply the "multiplication of terms in the log is equivalent to the summation of the log of terms" property respectively.
The second term in equation 6
is a well-known distance measure between two distributions known as Kullback-Leibler divergence.
At this point, we should carefully study the form of equation 6
. The first term contains joint distribution over V and Z whereas the second term contains the conditional distribution of Z given V. One of the properties of KL divergence is that it’s always non-negative. Using this property in equation 6
, we can deduce that L(q, θ) __ ≤ ln p(V | θ).
This means that L_(q, θ)_ acts as a lower bound on the log-likelihood of the observed data.
This observation, very shortly, would help in demonstrating that the EM algorithm does indeed maximize the log likelihood.
E-step
Suppose the initial value of the parameter vector θ is θ^old
– (Step 1). Keeping in mind the relation given by equation 6
, the E-step tries to maximize the lower bound L(q,θ^old
) with respect to q while holding θ^old
fixed. The solution to this maximization problem is easily seen by noting that the value of ln p(V | θ) does not depend on q(Z) and so the largest value of L(q,θ^old
) will occur when the KL divergence vanishes, or in other words when q(Z) is equal to the posterior distribution p(Z | V, θ^old
) (since ln 1 evaluates to 0).
Thus, E-step involves evaluating p(Z | V, θ^old
) – (Step 2)
![Illustration of E-step. [1]](https://towardsdatascience.com/wp-content/uploads/2019/02/0jamMOiu7pyeu7Myk.jpg)
M-step
In this step, the distribution q(Z) is held fixed. If we substitute q(Z) = p(Z | V, θ^old
) from E-step into the expression of L(q, θ) (equation 6
), we see that the lower bound takes the following form:

where the constant is simply the negative entropy of the q distribution and is therefore independent of θ.
So, in the M-step, we maximize the lower bound L(q,θ) with respect to θ to give some new value θ^new
. This will cause the lower bound L(q,θ) to increase, which will necessarily cause the corresponding log-likelihood function to increase. — (Step 3)
![Illustration of M-step. [1]](https://towardsdatascience.com/wp-content/uploads/2019/02/0ig-dt5tIKgN0FQ4C.jpg)
Since the distribution q is held fixed during the M-step, it will not equal the new posterior distribution p(Z | V, θ^new
), and hence there will be a non-zero KL divergence. So, we repeat the E and M steps again until convergence. — (Step 4)
Putting it all together
I know it’s a lot to digest at once. So, I’ll try to summarize the discussion with the help of the following figure that should help in connecting the dots.
![How E and M step maximize the likelihood. [1]](https://towardsdatascience.com/wp-content/uploads/2019/02/1OziyuPrVfZ_qBc33lNaeFA@2x.jpeg)
The red curve depicts the incomplete data log likelihood, ln p(V | θ), which we want to maximize. We start with some initial parameter value θ^old.
In the first E-step, we evaluate the posterior distribution over latent variables, p(Z | V, θ^old
), which gives rise to a lower bound L(q,θ^old
) whose value equals the log likelihood at θ^old
, as shown by the blue curve. In the M-step, the bound is maximized giving the value θ^new
, which gives a larger value of log-likelihood than θ^old
. The subsequent E-step then constructs a bound that is tangential at θ^new
as shown by the green curve.
At each step, we see that the obtained parameters increase the log-likelihood and the process continues until convergence.
Concluding remarks
This concludes a rather intense mathematical post on the EM algorithm. However, understanding this algorithm in-depth is a really good investment of time and it proves helpful if you want to work on developing your graphical models to solve challenging problems. If you are looking for an example problem and code, you can refer to this GitHub repo of mine where we tackled the challenge of jointly modeling aspects, ratings, and sentiment with temporal dynamics for the rating prediction tasks.
References
[1] Nasrabadi NM. Pattern recognition and machine learning. Journal of electronic imaging. 2007 Oct;16(4):049901.
Originally posted @ https://rishabhmisra.github.io/Maximum-Likelihood-Estimates-Motivation-For-EM-Algorithm/
Feedback is a Gift
If you found the content informative and would like to learn more about Machine Learning and AI, please follow the profile to get notified about my future posts. Also, let me know in the comments if you have any questions or topic requests.
Feel free to visit my website to get in touch http://rishabhmisra.github.io/ – I frequently consult on real-world AI applications.
Citation
If you use the work presented here in your research, please cite it using one of the following formats:
Text format:
Misra, Rishabh. "Inference using EM algorithm." Medium, Towards Data Science (2019).
BibTex format:
@misc{misrainference2019, author={Misra, Rishabh}, title={Inference using EM algorithm}, url={https://medium.com/towards-data-science/inference-using-em-algorithm-d71cccb647bc}, journal={Medium}, publisher={Towards Data Science}, year={2019}, month={Feb} }