The Likelihood-Ratio Test

An intuitive explanation of the likelihood-ratio test through a worked example in R

Clarke Patrone
Towards Data Science

--

Photo by Chris Briggs on Unsplash

All images used in this article were created by the author unless otherwise noted

The Likelihood-Ratio Test (LRT) is a statistical test used to compare the goodness of fit of two models based on the ratio of their likelihoods. This article will use the LRT to compare two models which aim to predict a sequence of coin flips in order to develop an intuitive understanding of the what the LRT is and why it works. I will first review the concept of Likelihood and how we can find the value of a parameter, in this case the probability of flipping a heads, that makes observing our data the most likely. I will then show how adding independent parameters expands our parameter space and how under certain circumstance a simpler model may constitute a subspace of a more complex model. Finally, I will discuss how to use Wilk’s Theorem to assess whether a more complex model fits data significantly better than a simpler model.

I have embedded the R code used to generate all of the figures in this article.

Flipping Coins

Let’s start by randomly flipping a quarter with an unknown probability θ of landing a heads:

P(heads)= θ

Let’s flip the quarter ten times:

We flip it ten times and get 7 heads (represented as 1) and 3 tails (represented as 0).

We want to know what parameter θ makes our data, the sequence above, most likely. To find the value of θ, the probability of flipping a heads, we can calculate the likelihood of observing this data given a particular value of θ. Put mathematically we express the likelihood of observing our data d given θ as: L(d|θ). We want to find the to value of θ which maximizes L(d|θ).

Intuitively, you might guess that since we have 7 heads and 3 tails our best guess for θ is 7/10=.7

Let’s write a function to check that intuition by calculating how likely it is we see a particular sequence of heads and tails for some possible values in the parameter space θ. Since each coin flip is independent, the probability of observing a particular sequence of coin flips is the product of the probability of observing each individual coin flip. In the function below we start with a likelihood of 1 and each time we encounter a heads we multiply our likelihood by the probability of landing a heads. Each time we encounter a tail we multiply by the 1 minus the probability of flipping a heads.

Now that we have a function to calculate the likelihood of observing a sequence of coin flips given a θ, the probability of heads, let’s graph the likelihood for a couple of different values of θ.

In this graph, we can see that we maximize the likelihood of observing our data when θ equals .7. We’ve confirmed that our intuition we are most likely to see that sequence of data when the value of θ=.7

Now let’s do the same experiment flipping a new coin, a penny for example, again with an unknown probability of landing on heads. We can combine the flips we did with the quarter and those we did with the penny to make a single sequence of 20 flips.

If we didn’t know that the coins were different and we followed our procedure we might update our guess and say that since we have 9 heads out of 20 our maximum likelihood would occur when we let the probability of heads be .45. We graph that below to confirm our intuition.

In the above scenario we have modeled the flipping of two coins using a single θ. Maybe we can improve our model by adding an additional parameter. What if know that there are two coins and we know when we are flipping each of them? We can then try to model this sequence of flips using two parameters, one for each coin. Adding a parameter also means adding a dimension to our parameter space. Let’s visualize our new parameter space:

The graph above shows the likelihood of observing our data given the different values of each of our two parameters. Observe that using one parameter is equivalent to saying that quarter_θ and penny_θ have the same value. In the graph above, quarter_θ and penny_θ are equal along the diagonal so we can say the the one parameter model constitutes a subspace of our two parameter model. In this case, the subspace occurs along the diagonal. If we slice the above graph down the diagonal we will recreate our original 2-d graph.

The above graph is the same as the graph we generated when we assumed that the the quarter and the penny had the same probability of landing heads. Know we can think of ourselves as comparing two models where the base model (flipping one coin) is a subspace of a more complex full model (flipping two coins).

To visualize how much more likely we are to observe the data when we add a parameter, let’s graph the maximum likelihood in the two parameter model on the graph above.

In this scenario adding a second parameter makes observing our sequence of 20 coin flips much more likely.

From Likelihood to the Likelihood-Ratio Test

We can see in the graph above that the likelihood of observing the data is much higher in the two-parameter model than in the one parameter model. However, what if each of the coins we flipped had the same probability of landing heads? Then there might be no advantage to adding a second parameter. So how can we quantifiably determine if adding a parameter makes our model fit the data significantly better? A natural first step is to take the Likelihood Ratio: which is defined as the ratio of the Maximum Likelihood of our simple model over the Maximum Likelihood of the complex model ML_complex/ML_simple

Let’s also define a null and alternative hypothesis for our example of flipping a quarter and then a penny:

Null Hypothesis: Probability of Heads Quarter = Probability Heads Penny

Alternative Hypothesis: Probability of Heads Quarter != Probability Heads Penny

The Likelihood Ratio of the ML of the two parameter model to the ML of the one parameter model is: LR = 14.15558

Based on this number, we might think the complex model is better and we should reject our null hypothesis. But we are still using eyeball intuition. To quantify this further we need the help of Wilk’s Theorem which states that −2log(LR) is chi-square distributed as the sample size (in this case the number of flips) approaches infinity when the null hypothesis is true. (Read about the limitations of Wilk’s Theorem here). By Wilk’s Theorem we define the Likelihood-Ratio Test Statistic as: λ_LR=−2[log(ML_null)−log(ML_alternative)]

Why is it true that the Likelihood-Ratio Test Statistic is chi-square distributed? First recall that the chi-square distribution is the sum of the squares of k independent standard normal random variables. Below is a graph of the chi-square distribution at different degrees of freedom (values of k).

How can we transform our likelihood ratio so that it follows the chi-square distribution? First observe that in the bar graphs above each of the graphs of our parameters is approximately normally distributed so we have normal random variables. We can turn a ratio into a sum by taking the log.

Doing so gives us log(ML_alternative)−log(ML_null).

But we don’t want normal R.V. we want squared normal variables.

log[ML_alternative^2]−log[ML_null^2]

and by the rule of logarithms we get:

2∗[log(ML_alternative)−log(ML_null)]

Or in the form used by Wilk’s

-2[log(ML_null)-log(ML_alternative)]

Wilk’s Theorem tells us that the above statistic will asympotically be Chi-Square Distributed

Let’s put this into practice using our coin-flipping example.

First let’s write a function to flip a coin with probability p of landing heads. Let’s also we will create a variable called flips which simulates flipping this coin time 1000 times in 1000 independent experiments to create 1000 sequences of 1000 flips.

Now let’s right a function which calculates the maximum likelihood for a given number of parameters. This function works by dividing the data into even chunks (think of each chunk as representing its own coin) and then calculating the maximum likelihood of observing the data in each chunk. For example if this function is given the sequence of ten flips: 1,1,1,0,0,0,1,0,1,0 and told to use two parameter it will return the vector (.6, .4) corresponding to the maximum likelihood estimate for the first five flips (three head out of five = .6) and the last five flips (2 head out of five = .4) . If we pass the same data but tell the model to only use one parameter it will return the vector (.5) since we have five head out of ten flips.

Now we need a function to calculate the likelihood of observing our data given n number of parameters. This function works by dividing the data into even chunks based on the number of parameters and then calculating the likelihood of observing each sequence given the value of the parameters. For example if we pass the sequence 1,1,0,1 and the parameters (.9, .5) to this function it will return a likelihood of .2025 which is found by calculating that the likelihood of observing two heads given a .9 probability of landing heads is .81 and the likelihood of landing one tails followed by one heads given a probability of .5 for landing heads is .25. Since these are independent we multiply each likelihood together to get a final likelihood of observing the data given our two parameters of .81 x .25 = .2025.

Now we write a function to find the likelihood ratio:

And then finally we can put it all together by writing a function which returns the Likelihood-Ratio Test Statistic based on a set of data (which we call flips in the function below) and the number of parameters in two different models.

Now we are ready to show that the Likelihood-Ratio Test Statistic is asymptotically chi-square distributed. Let’s flip a coin 1000 times per experiment for 1000 experiments and then plot a histogram of the frequency of the value of our Test Statistic comparing a model with 1 parameter compared with a model of 2 parameters.

The density plot below show convergence to the chi-square distribution with 1 degree of freedom.

If we compare a model that uses 10 parameters versus a model that use 1 parameter we can see the distribution of the test statistic change to be chi-square distributed with degrees of freedom equal to 9.

The above graphs show that the value of the test statistic is chi-square distributed. So returning to example of the quarter and the penny, we are now able to quantify exactly much better a fit the two parameter model is than the one parameter model. Recall that our likelihood ratio: ML_alternative/ML_null was LR = 14.15558. if we take 2[log(14.15558] we get a Test Statistic value of 5.300218.

We can use the chi-square CDF to see that given that the null hypothesis is true there is a 2.132276 percent chance of observing a Likelihood-Ratio Statistic at that value. So in this case at an alpha of .05 we should reject the null hypothesis.

The graph above show that we will only see a Test Statistic of 5.3 about 2.13% of the time given that the null hypothesis is true and each coin has the same probability of landing as a heads.

Conclusion

This article uses the simple example of modeling the flipping of one or multiple coins to demonstrate how the Likelihood-Ratio Test can be used to compare how well two models fit a set of data. We discussed what it means for a model to be “nested” by considering the case of modeling a set of coins flips under the assumption that there is one coin versus two. Finally, we empirically explored Wilk’s Theorem to show that LRT statistic is asymptotically chi-square distributed, thereby allowing the LRT to serve as a formal hypothesis test.

--

--