Maximum Likelihood Estimation in Real Life : Optimizing Study Time

Published in

Towards Data Science

6 min readJan 2, 2019

Maximum likelihood estimation is a statistical technique widely used in Machine Learning. It is used to pick the parameters of a model.

Exam season is here and this time around you want to be more efficient with your study time. You planned ahead, and made sure to track how much you've been studying for each exam in the last couple of rounds, and what grades you got. You ended up with this dataset

Plotting the data makes it easier to see that there's some correlation between the amount of time you spent studying for an exam and its final grade.

Your biggest challenge, as with the previous rounds, is that you have multiple exams scheduled a few days apart from each other. You want to create a study plan that will allow you to maximize you grades, but guarantee that you have a good amount of time to dedicate to each exam.

So, what do you do?

Linear Model to the rescue!

Thinking about a way to maximize your grades based on how much time you have to study for each exam, you remember the correlation in the scatter plot above. You can use Linear Regression to help figure out what grade you’ll get, given the amount of time you can dedicate to study for the exam.

This is the model that best describes the problem at hand

You're predicting the exam grade based on how much time you study. You go to the statistical software of your choice, and fit a linear model to the dataset.

Now you can plug in how long you plan to study and check what grade you might obtain, based on the model's equation.

Here's a summary of our model, obtained using Python's statsmodels module.

We can see that the Least Squares method was used to fit the model, the pink line, to the dataset. The parameters, beta0 and beta1, also called the coefficients of the model, correspond to const and time, respectively.

So, we have the model and calculated the parameters using Python, but the question remains: how did we actually estimate the parameters?

The Math behind the scenes

It's great that we can use a statistical software to do all the heavy lifting and fit a linear model to our dataset.

But how did the parameters get estimated?

Were those values picked at random?

This is where statistician R. A. Fischer had a great idea! He discovered that we could build a model and estimate the parameters such that they maximize the likelihood of obtaining the values observed in the dataset.

In other words, we're estimating parameters such that the probability, i.e., likelihood, of observing the values seen in the dataset is as high as possible.

But before we start diving into the Math, here are a few assumptions about our dataset:

Each data point is independent
Our dataset follows a Normal distribution
The error in our model also follows a Normal distribution
Our output is continuous

These assumptions come in very handy when it comes to calculating the parameters. They facilitate the use of certain mathematical properties that end up simplifying the calculations!

1. Decoding the Likelihood Function

So far we know that parameters must maximize the likelihood function

The likelihood function is, in fact, a conditional probability. It is dependent on the parameter, because we'll only pick the value for the parameter that maximizes the probability of observing the data.

Let's use theta to represent the parameter.

Our Linear Model, has two unknown parameters — beta 0, beta1.

So we can rewrite the likelihood function as

So far we

decoded what likelihood means
wrote down the likelihood expression for our linear model as a conditional probability

2. Probability Density Function

Now that we know the likelihood is a conditional probability, it's time to start dive deeper into the math.

According to our assumptions, our dataset follows a Normal distribution and we're dealing with continuous data. Therefore, we're going to use the Normal distribution's probability density function to define the likelihood.

Because each data point is independent of each other, the probability of all points in the dataset is expressed as a product, by using the Pi Notation in the probability density function.

To simplify the calculations that are coming up, we can transform the likelihood into a log-likelihood.

When picking the value each parameter, this is what we want to maximize!

But we can make this expression even simpler. Since we're maximizing the likellihood in relation to parameters beta 0 and beta 1, we can actually ignore any term that does not contain beta 0 or beta 1 in them.

The likelihood expression then becomes

Does this summation look familiar?

If you recall, our linear model is defined as y = beta0 + beta1x + error. If we solve this equation for the error, we have error = y - beta0 - beta1.

What we have above is the sum of squared errors!

And, because we also assumed that the error in our model follows a Normal distribution, using the Maximum Likelihood for parameter estimation in this case is exactly the same as calculating the Ordinary Least Squares!

In practice, under these assumptions, maximizing the likelihood is the same as minimizing the sum of squared errors.

That’s why most of the time we see that the Ordinary Least Squares method is used to fit a linear model to a dataset.

3. (Finally) Estimate the parameters

Here's where we left off

To get the values of the parameters we'll calculate the partial derivative in respect to beta 0 and beta 1.

Starting with the partial derivative in respect to beta 0.

One down, one to go!

Calculating the partial derivative in respect to beta 1, we get

These are the calculations that occur under the covers every time we use some statistical software to fit a linear model to our dataset.

If we calculate each expression for our dataset, we'll confirm that beta 0= 37.4571 and beta 1= 12.0495, the exact values shown in the model summary.

Thanks for reading!