Perplexity Intuition (and its derivation)

Never be perplexed again by perplexity.

You might have seen something like this in an NLP class:

A slide from Dr. Luke Zettlemoyer’s NLP class

Or

A slide of CS 124 at Stanford (Dr. Dan Jurafsky)

During the class, we don’t really spend time to derive the perplexity. Maybe perplexity is a basic concept that you probably already know? This post is for those who don’t.

In general, perplexity is a measurement of how well a probability model predicts a sample. In the context of Natural Language Processing, perplexity is one way to evaluate language models.

But why is perplexity in NLP defined the way it is?


If you look up the perplexity of a discrete probability distribution in Wikipedia:

from https://en.wikipedia.org/wiki/Perplexity

where H(p) is the entropy of the distribution p(x) and x is a random variable over all possible events.

In the previous post, we derived H(p) from scratch and intuitively showed why entropy is the average number of bits that we need to encode the information. If you don’t understand H(p), please read this before reading further.

Now we agree that H(p) =-Σ p(x) log p(x).

Then, perplexity is just an exponentiation of the entropy!

Yes. Entropy is the average number of bits to encode the information contained in a random variable, so the exponentiation of the entropy should be the total amount of all possible information, or more precisely, the weighted average number of choices a random variable has.

For example, if the average sentence in the test set could be coded in 100 bits, the model perplexity is 2¹⁰⁰ per sentence.


Let’s confirm that the definition in Wikipedia matches to the one in the slides.

Where

p : A probability distribution that we want to model. A training sample is drawn from p and it’s unknown distribution.

q : A proposed probability model. Our prediction.

We can evaluate our prediction q by testing against samples drawn from p. Then it’s basically calculating the cross-entropy. In the above example, we assumed that all words will have the same probability (1 / # of words) in p.

Observation

  • When q(x) = 0, the perplexity will be . In fact, this is one of the reasons why the concept of smoothing (in NLP) was introduced.
  • If we use a uniform probability model for q (simply 1/N for all words), the perplexity will be equal to the vocabulary size.
  • A language model is a probability distribution over sentences. And the best language model is one that best predicts an unseen test set. Perplexity is the inverse probability of the test set, normalized by the number of words.

Takeaway

  • Less entropy (or less disordered system) is favorable over more entropy. As predictable results are preferred over randomness. This is why people say low perplexity is good and high perplexity is bad since the perplexity is the exponentiation of the entropy (and you can safely think of the concept of perplexity as entropy).
  • The difference between the perplexity of a (language) model and the true perplexity of the language is an indication of the quality of the model.
  • Why do we use perplexity instead of the entropy?
    If we think of perplexity as a branching factor (the weighted average number of choices a random variable has), then that number is easier to understand than the entropy. I found this surprising because I thought there will be more profound reasons. I asked Dr. Zettlemoyer if there is any other reason other than easy interpretability. His answer was “I think that is it! It is largely historical, since lots of other metrics would be reasonable to use also!”

If you like my post, could you please clap? It gives me motivation to write more. :)

Towards Data Science

A Medium publication sharing concepts, ideas, and codes.

Aerin Kim 🙏

Written by

I’m a Senior Research Engineer at Microsoft and this is my notepad for Applied Math / CS / Deep Learning topics. Follow me on Twitter for more!

Towards Data Science

A Medium publication sharing concepts, ideas, and codes.

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade