Parameter Inference — Maximum Likelihood

Published in

Towards Data Science

10 min readApr 3, 2017

The never ending debate of Frequentists and Bayesians. (Image source: xkcd)

This post takes an in-depth tour in one of the most important concepts of theoretical Machine Learning, viz., Parameter Inference. I will try to focus on an intuitive understanding of the concept while embedding mathematical formulae as and when I feel the need for them. If you have ever taken a undergraduate/graduate level Machine Learning course in a university, you must have come across Parameter Inference. When I encountered this for the first time, I kept scratching my head for quite a while trying to grasp the concept of PI (not a well known acronym for Parameter Inference), and that’s why I felt the need to write a post try to make it easier for others to understand. I hope you take something valuable out of this post when we are done with it. Let’s begin.

PI can liberally be described as the process of determining the parameters which govern the dataset being generated from an experiment. For example, if we flip a coin ten times and get the following dataset,

what is the parameter responsible for generating these ten flips? As it turns out, the process of parameter inference is not completely trivial and requires quite a bit of mathematical savvy to wrap our heads around. If you understand the basics of logarithms, calculus (differentiation to be precise) and basic probability, this post should not be mystical to understand. Anyway, let’s get down to the business. PI can be considered as a step-wise process having three major steps:

Img. 2: Three steps to Parameter Inference.

We start at the lowest step and will try to reach the topmost one. Each successive step is slightly more convoluted than the preceding, but at the same time, it provides us with a more robust model to describe the data. This post will tackle the lowest step, Maximum Likelihood Estimation in detail, and the follow-up posts will deal with the other two steps. (A side tip: Use pen and paper to easily follow the terms and formulae as Medium doesn’t support Math symbols yet). Okay, So what is MLE?

Maximum Likelihood Estimation

As the picture says, the entire process of MLE revolves around determining the parameters which maximizes the probability of the data. Going back to the coin toss example, what do you think the probability of the 11th flip being Tails is? There are two logical answers:

If you say 0.5, that’s a reasonable answer as the flips are independent of each other and the next flip being Tails has a 1/2 chance, i.e., 0.5.
The other answer could be 0.3. The rationale behind this answer would be that we can expect the coin flips to continue the way they have occurred until now (until the 10th flip).

I am not gonna tell you the correct answer just yet. Let’s work together to find out which one is correct. Okay, so we need to find out P_11(F_11= Tails), where P_11 is the probability of 11th flip and F_11 is the 11th flip. This gives rise to the general case: ith flip being Tails:

Img. 3

where F_i denotes ith flip and theta_i being the parameter governing the ith flip. To denote that the probability distribution depends on theta_i, we can write:

Img. 4

Do you see the problem here? We have the data that depends on parameters theta_1 to theta_10 (i = 1 to 10). And that poses a problem in that how can we extrapolate it to theta_11? For the time being, let’s stick to what we have, i.e., theta_1 to theta_10.

All the randomness of a sequence of flips is governed (modeled) by the parameters theta_1 to theta_10:

Img. 5

What do we currently know about theta_1 .. theta_10? And are they, in some way, connected to theta_11? At first glance, there seems to be no connection. It is now that we can start the process of MLE: we need to find theta_i’s such that

Img. 6

is as high as possible. This is the principle behind MLE:

MLE looks at the probability of data (the so called Likelihood; Img. 5 & 6) and it tries to find those parameters theta_1 through theta_10 that maximize the likelihood/probability of this sequence. To reiterate one last time, we want to choose those parameters under which our observations become most likely. That’s why the process is called MLE.

Now let’s try to actually model the probability depicted in Img. 5 & 6. We need to make two critical assumptions in order to be able to work out the ML estimate of the data:

I. The coin flips do not affect each other, i.e., they are independent (the outcome of first flip does not affect the outcome of the second flip, and so on):

Img. 8: The independence assumption allows us to simplify the complex likelihood term into ten simpler factors.

The third statement in Img. 8 is just a short-hand notation describing the product of the ten terms from the second statement.

The first assumption allows us to simplify the likelihood term considerably but notice that there is still theta_i present in the shorthand notation. What it means that we still don’t have theta_11 in the equation.

There is another assumption that can further unravel the equation. It arises from the fact that the coin (the experiment setup, in general) does not change significantly:

II. The flips are quantitatively same, i.e., they are identically distributed:

Img. 9: Since the flips are taking place under similar circumstances, we can assume that the parameter governing the flips is one and same.

It is the second assumption of the flips being identically distributed that allows us to drop the subscript i from theta. This means that all of the 10 flips we observed are essentially governed by the same parameter theta; instead of having ten parameters governing ten different flips, we now have just one parameter governing the entire sequence of coin flips, and that includes the 11th flip as well. In a sense, we are connecting the first 10 coin flips to the 11th coin flip. As we’re soon going to see, this is going to be the key for inference.

NOTE: The two assumptions we made are used so often in Machine Learning that they have a special name together as an entity, i.i.d. assumption:

In total, the 10 flips are independent and identically distributed.

This allows us to explicitly write down the likelihood that we are trying to optimize. Remember that theta was defined as the probability of the flip showing up Tails; the probability of our sequence w.r.t. theta can now be formulated as:

Img. 10: First term is the probability of *Heads, second for Tails, third and fourth for Heads, fifth for Tails and so on. Note that this formula corresponds to the sequence of our coin flip.*

So, by this very broad and very intuitive assumption i.i.d., we simplified our likelihood function to a fairly simple polynomial; to a point where we can start optimizing the function for the parameter theta.

Under our model assumptions:

Img. 11

The above simplified polynomial expression can be interpreted as a function of theta i.e., f(theta). Now we want to find out the maxima (maximum likelihood) of this function. Any ideas on how to proceed?

That’s right! It’s all basic high school math from here. We take the derivative df/d(theta), set it to zero, and solve for theta. And then verify the critical points (points where slope of a function is zero: maxima, minima and saddle points. In this case, we are only concerned with maxima) by inserting them into the second derivative of f(theta).

In principle, this is fairly straightforward. Except one caveat: the second derivative of f(theta) is already ugly since we will need to apply the product rule twice to get there. It is doable, but it requires a lot of technical humdrum to get there. Is there any way we can simplify our computations?

The beauty of mathematics is that is that it provides us with numerous pathways to arrive at a particular solution. Some of them being magnitudes of times easier than others. We are gonna apply one such simpler pathway to optimize our likelihood function:

Let’s take a little detour into high school calculus and analyze monotonic functions (If you feel comfortable with this concept already, feel free to continue down to the end of this section). We already know how to find out maxima or minima of a function and that the actual maxima and minima values are called Critical points of the function. There is a rule in mathematics that if we apply a monotonic function to another function which we are optimizing, the application of monotonic function will preserve the critical points of the original function (The wikipedia article is an excellent source to learn more about monotonic functions).
If x1 < x2 => f(x1) < f(x2) : this is a monotonically increasing function.
If x1 < x2 => f(x1) > f(x2) : this is a monotonically decreasing function.
Log function is an example of a monotonically increasing function:

Img 12: A monotonically increasing function — log(x)

Hence, we can conclude that log f(theta) has the same maxima as f(theta).

Therefore, if we apply log to our likelihood function, we will essentially get the same maxima as we would have gotten had we directly optimized it.

So we want to find out:

Arg max means that we want to know the value of theta where this function is maximized rather than the maximum value of the function itself. This means that we are not actually concerned with the actual maximum value of the function. Rather, we are interested in the value for theta where the function has the maximum value (Let that sink in for a moment before proceeding).

Taking the log of this function:

We denote this function by g(theta) : log likelihood.

Taking the derivative of g(theta),

where |T| are the number of tails (3 in our example), and |H| are the number of heads (7 in our example). I introduced the T and H to make the solution generalizable.

Setting the value of g’(theta) equal to zero (to find the critical value for theta):

Therefore, Maximum Likelihood Estimation (MLE) for any coin sequence:

We have arrived at our MLE estimate for the parameter theta which governs our coin flip dataset. Going back to the original question, what is the probability of the 11th flip being Tails?

Put |T| = 3 and |H| = 7, we get the answer as 0.3 . This justifies 30% as a reasonable answer to our initial question.

Now we need to substitute this critical value of theta in the second derivative to verify that it is indeed the maximum. I am skipping this verification here for brevity’s sake. But please go ahead and convince yourself that this is indeed the maximum value.

This maximum is called the MLE for theta. This particular theta is the one that makes our observed sequence most likely.

Intuitively, this 30% value is also justified by the fact that we expect the coin flips to continue they way they started. So, we have found that the MLE explains our intuition when we look at data and that’s very nice.

MLE’s shortcoming?

Clearly, finding MLE is not the end of Parameter Estimation as we have already seen in the staircase figure. We are still left with Maximum Aposteriori Estimate and Fully Bayesian Analysis. So what motivates not stopping at MLE and be done with the process? Can you think of a reason?

What if the sequence looked like following:

What do you think the probability of the third flip being Tails is?

In the upcoming posts, we’ll explore the shortcomings of MLE, the intuitive rationale behind finding out MAP (Maximum Aposteriori Estimate) and Fully Bayesian Analysis and will also build up a few key concepts in ML, viz., Conjugate priors, Inductive Priors and a technique which allows us to calculate unsolvable integrals.

If you have any concept(s) in mind which doesn’t make intuitive sense to you and would like to see a similar post(s) for it, let me know in the comments and I’ll try to come up with them.

Resource:

The brilliant ML course taught at my grad school, TU Munich. You can watch all the lectures at this YouTube channel.

If you liked the post, feel free to recommend and share it so that others can benefit from it as well.

Parameter Inference — Maximum Likelihood

Maximum Likelihood Estimation

MLE’s shortcoming?

Resource:

Written by Rahul Bohare