Evaluating Text Output in NLP: BLEU at your own risk

Published in

Towards Data Science

17 min readJan 15, 2019

One question I get fairly often from folks who are just getting into NLP is how to evaluate systems when the output of that system is text, rather than some sort of classification of the input text. These types of problems, where you put some text into your model and get some other text out of it, are known as sequence to sequence or string transduction problems.

And they’re really interesting problems! The general task of sequence to sequence modelling is at the heart of some of the most difficult tasks in NLP, including:

Text summarization
Text simplification
Question answering
Chatbots
Machine translation

This sort of technology is right out of science fiction. With such a wide range of exciting applications, it’s easy to see why sequence to sequence modeling is more popular than ever. What’s not easy is actually evaluating these systems.

Unfortunately for folks who are just getting started, there’s no simple answer about what metric you should use to evaluate your model. Even worse, one of the most popular metrics for evaluating sequence to sequence tasks, BLEU, has major drawbacks, especially when applied to tasks that it was never intended to evaluate.

Fortunately for you, you’ve found this in-depth blog post! In it, I’m going to walk through how this popular metric works (don’t worry, I’ll keep the equations to an absolutely minimum). We’ll look at some of BLEU’s problems and, finally, how you can minimize those problems in your own work.

Orange painted blue on a blue background. (There, uh, aren’t a lot of eye-catching photos for NLP evaluation metrics as it turns out.)

A very difficult problem

BLEU was originally developed to measure machine translation, so let’s work through a translation example. Here’s a bit of text in Language A (aka “French”):

J’ai mangé trois filberts.

And here are some reference translations in Language B (aka “English”). (Note that some English speakers call hazelnuts “filberts”, so these are both perfectly good translations.)

I have eaten three hazelnuts.
I ate three filberts.

And here is a generated “neural” translation. (The “neural” in this case being “Rachael came up with a possible translation using her brain,” but pretend that this was generated by a network you were training.)

I ate three hazelnuts.

Now, here’s a very difficult problem: How can I assign a single numerical score to this translation that tells us how “good” it is using only the provided reference sentences and the neural output?

Why do you need a single numerical score? Great question! If we want to use machine learning to build a machine translation system we need a single real number score to put into our loss function. If we also know the potential best score, we can calculate the distance between the two. This allows us to give feedback to our system while it’s training―that is, whether a potential change will improve a translation by making the score closer to the ideal score―and to compare trained systems by looking at their scores on the same task.

One thing you might do is look at each word in the output sentence and assign it a score of 1 if it shows up in any of the reference sentences and 0 if it doesn’t. Then, to normalize that count so that it’s always between 0 and 1, you can divide the number of words that showed up in one of the reference translations by the total number of words in the output sentence. This gives us a measure called unigram precision.

So, for our example, “I ate three hazelnuts”, we see all the words in the output sentence in at least one of the reference sentences. Dividing that by the number of words in the output, 4, you end up with a score of 1 for this translation. So far so good! But what about this sentence?

Three three three three.

Using that same metric, we’d also come up with a score of 1. Which isn’t great: We need some way to tell the system we’re training that sentences like the first one are better than sentences like the second one.

You could tweak the score a bit by capping the number of times to count each word based on the highest number of times it appears in any reference sentence. Using that measure, our first sentence would still get a score of 1, while our second sentence would get a score of only .25.

This gets us out of the “three three three three” problem, but it doesn’t help us with sentences like this, where for some reason the words have been sorted alphabetically:

Ate hazelnuts I three

Using our current method, this sentence gets a score of 1, the maximum score! We can get around this by counting, not individual words, but words that occur next to each other. These are called n-grams, where n is the number of words per group. Unigrams, bigrams, trigrams and 4-grams are made up of chunks of one, two, three and four words respectively.

For this example, let’s use bigrams. Generally, BLEU scores are based on an average of unigram, bigram, trigram and 4-gram precision, but we’re sticking with just bigrams here for simplicity. Also for simplicity, we won’t add a “word” that tells us that there’s a sentence boundary at the beginning and end of the sentence. With those guidelines, the bigrams in the words-sorted-alphabetically example are:

[Ate hazelnuts]
[hazelnuts I]
[I three]

If we use the same calculation we did with single words using these bigrams, we now get a score of 0; the worst possible score. Our “three three three three” example also gets a score of 0 rather than .25 now, while the first example “I ate three hazelnuts” has a score of 1. Unfortunately, so does this example:

I ate.

One way of getting around this is by multiplying the score we have so far by a measure that penalizes sentences that are shorter than any of our reference translations. We can do this by comparing it to the length of the reference sentence that it the closest in length. This is the brevity penalty.

If our output is as long or longer than any reference sentence, the penalty is 1. Since we’re multiplying our score by it, that doesn’t change the final output.

On the other hand, if our output is shorter than any reference sentence, we divide the length of the closest sentence by the length of our output, subtract one from that, and raise e to the power of that whole shebang. Basically, the longer the shortest reference sentence and the shorter our output, the closer the brevity penalty gets to zero.

In our “I ate” example, the output sentence was two words long and the closest reference sentence was four words. This gives us a brevity penalty of 0.36, which, when multiplied by our bi-gram precision score of 1, drops our final score down to just 0.36.

This measure, looking at n-grams overlap between the output and reference translations with a penalty for shorter outputs, is known as BLEU (short for “Bilingual evaluation understudy” which people literally only ever say when explaining the acronym) and was developed by Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu at IBM in 2002. It’s a very popular metric in NLP, particularly for tasks where the output of a system is a text string rather than a classification. This includes machine translation and, increasingly, natural language generation. It’s one solution to the very-hard-problem I proposed back at the beginning of this post: developing a way to assign a single numerical score to a translation that tells us how “good” it is.

It’s also deeply flawed.

The problem(s) with BLEU

At this point you may be wondering, “Rachael, if this metric is so flawed, why did you walk us through how to calculate it?” Mainly to show you how reasonable the metric is. It’s fairly intuitive and the underlying idea, that you can evaluate the output of a machine translation system by comparing it to reference translations, has been extremely influential in NLP (although not without its critics).

BLEU also does have some strengths, of course. The most relevant ones folks working in NLP are have to do with how convenient it is for researchers.

It’s fast and easy to calculate, especially compared to having human translators rate model output.
It’s ubiquitous. This makes it easy to compare your model to benchmarks on the same task.

Unfortunately, this very convenience has led to folks overapplying it, even for tasks where it’s not the best choice of metric.

Despite my single-sentence examples, BLEU was always intended to be a corpus-level measure. Taking the BLEU score of each sentence in the corpus and then averaging across them will artificially inflate your score and you’ll definitely get dinged by reviewers if you try to publish work where you do it.

And even when it’s not overapplied, there are serious limitations to the metric that you should know before you choose to spend your time and compute chasing better BLEU scores. While there have been a lot of discussions of BLEU’s drawbacks, my top four issues with it are:

It doesn’t consider meaning
It doesn’t directly consider sentence structure
It doesn’t handle morphologically rich languages well
It doesn’t map well to human judgements

Let’s walk through each of these issues one by one so I can show you why I think these are problems.

BLEU doesn’t consider meaning

For me, this is the single most compelling reason not to rely solely on BLEU for evaluating machine translation (MT) systems. As a human user of MT, my main goal is to accurately understand the underlying meaning of the text in the original language. I’ll happily accept some syntactic or grammatical weirdness in the output sentence as long as it’s true to the meaning of the original.

BLEU does not measure meaning. It only rewards systems for n-grams that have exact matches in the reference system. That means that a difference in a function word (like “an” or “on”) is penalized as heavily as a difference in a more important content word. It also means that a translation that had a perfectly valid synonym that just didn’t happen to show up in the reference translation will be penalized.

Let’s work through an example so you can see why this is a problem.

Original (French): J’ai mangé la pomme.
Reference translation: I ate the apple.

Based on BLEU, these are all “equally bad” output sentences.

I consumed the apple.
I ate an apple.
I ate the potato.

As an end user of a machine translation system, I would actually be fine with the first two sentences. Even though they’re not exactly the same as the reference translation, they get the idea across. The third sentence, however, is completely unacceptable; it entirely changes the meaning of the original.

One of the metrics based on BLEU, NIST, sort of gets around this by weighting the penalties for mis-matched n-grams. So a mismatch on a more common n-gram (like “of the”) will receive a lower penalty, while a mismatch on a rarer n-gram (like “buffalo buffalo”) will be more highly penalized. But while this solves the problem of giving function words too much weight, it actually makes the problem of penalizing synonyms (like “ambled” for “walked”) worse because those synonyms only show up in rarer r-grams and are therefore assigned a higher penalty.

BLEU doesn’t consider sentence structure

Perhaps you’re not convinced by the whole “you can still get pretty good BLEU scores even if you’ve messed up a few key words that entirely change the meaning of the sentence” thing. Perhaps some syntax will convince you?

Syntax is the study of the structure of sentences. It’s the field of study that allows us to formally model sentences like “I saw the dog with the telescope”, which can mean either that I was using the telescope to look at the dog or that the dog had the telescope. The difference between the two meanings can only be captured by modelling the fact that the words in the sentences can have different relationships to each other.

I’m not the world’s greatest syntactican (by a long shot), but even I know that there’s a lot of important internal syntactic structure in natural language, and if you randomly shuffle the order of words in a sentence you either get 1) meaningless word stew or 2) something with a very different meaning.

Fortunately, there’s been a huge amount of work done in developing systems to automate modelling that structure, which is known as parsing.

Unfortunately, BLEU doesn’t build on any of this research. I can understand why you might want to avoid it, since parsing tends to be fairly computationally intensive, and having to parse all your output everytime you evaluate does add some overhead. (Although there are metrics, like the STM, or subtree metric, that do directly compare the parses for the reference and output translations.)

However, the result of not looking at syntactic structure means that outputs that have a completely bonkers surface word order can receive the same score as those that are much more coherent.

There’s a nice illustration of this in Callison-Burch et al (2006). For this set of reference sentences:

Orejuela appeared calm as he was led to the American plane which will take him to Miami, Florida.
Orejuela appeared calm while being escorted to the plane that would take him to Miami, Florida.
Orejuela appeared calm as he was being led to the American plane that was to carry him to Miami in Florida.
Orejuela seemed quite calm as he was being led to the American plane that would take him to Miami in Florida.

They generated this machine translation.

Appeared calm when he was taken to the American plane, which will to Miami, Florida.

It’s not a perfect translation — the person’s name is dropped and there’s no verb after “will” in the second half of the sentence — but it’s also not complete nonsense. This example, however, is:

which will he was, when taken appeared calm to the American plane to Miami, Florida.

The kicker? Both the first and second outputs get the exact same BLEU score even through the first is clearly a better English translation.

BLEU doesn’t handle morphologically-rich languages well

If, like the majority of people on Earth, you happen to use a language other than English, you may have already spotted a problem with this metric: it’s based on word-level matches. For languages with a lot of morphological richness that quickly becomes a problem.

Morphemes are the smallest unit of meaning in a language, which are combined together to form words. One example in English would be the “s” in “cats” that tells us that there’s more than one cat. Some languages, like Turkish, have a lot of morphemes in a single word while others, like English, generally have fewer morphemes per word.

Consider the following sentences in Shipibo, a language spoken in Peru. (These examples are from “Evidentiality in Shipibo-Konibo, with a comparative overview of the category in Panoan” by Pilar Valenzuela.)

Jawen jemara ani iki.
Jawen jemaronki ani iki.

These are both perfectly acceptable translations of the English sentence “her village is large.” You may notice that the middle word, that starts with “jemar-,” has a different ending in the two sentences. The different endings are different morphemes that indicate how certain the speaker is about the fact that the village is large; the top one means they’ve actually been there and the bottom that they heard it was large from someone else.

This particular type of morpheme is known as an “evidentiality marker,” and English doesn’t have them. In Shipibo, however, you need one of these two morphemes for a sentence to be grammatical, so our reference translations would definitely have one of the two. But if we didn’t happen to generate the exact form of the word we had in our reference sentence, BLEU would penalize it for it… even though both sentences capture the English meaning perfectly well.

BLEU doesn’t map well to human judgements

If your eyes started to glaze over when I got into the grammar bits, now’s the point to tune back in.

What’s the final goal of building a machine translation, or chatbot, or question-answering system? You eventually want people to use it, right? And people won’t use a system if it doesn’t give them useful output. So it makes sense that the thing that you actually want to be optimizing for is how good a human using your system likes it. Pretty much all the metrics we use are designed to be different ways of approximating that.

When BLEU was first proposed, the authors did actually do some behavioral tests to make sure that the measure correlated to human judgement. (And props to them for doing that!) Unfortunately, as researchers did more experiments comparing BLEU scores and human judgements, they discovered that this correlation isn’t always very strong and that other measures tend to pattern more closely with human judgements depending on the specific task.

Turian et al (2003), for example, found that BLEU had the poorest correlation with human judgments of machine translation out of three measures, with simple F1 having the strongest correlation with human judgements, followed by NIST. Callison-Burch et al (2006) looked at systemes developed for a shared task (like a Kaggle competition for academics, but without prize money) and found that the relative ranking of those systems were very different depending on whether you were looking at a BLEU scores or human evaluator’s judgements. And Sun (2010) compared three different metrics―BLEU, GTM and TER―and again found that BLEU scores were the least closely correlated with human judgements.

In other words: if you want humans to enjoy using your system, you shouldn’t just be focusing on getting a higher BLEU score.

I’m not the only one with reservations

Maybe you’re still not convinced that BLEU isn’t always the right tool for the job. That’s fine; in fact, I applaud your skepticism! However, I’m far from the only NLP practitioner who’s not the biggest fan of the metric. Here are some quick links to peer reviewed papers with more discussion of some of the other drawbacks of BLEU.

Peer reviewed papers:

Reiter (2018) is a meta-review of ACL papers that use both BLEU and human judgments for evaluation, and found that they only patterned together for system level reviews of machine translation systems specifically.
Sulem et al (2018) recommend not using BLEU for text simplification. They found that BLEU scores don’t reflect either grammaticality or meaning preservation very well.
Novikova et al (2017) show that BLEU, as well as some other commonly-used metrics, don’t map well to human judgements in evaluating NLG (natural language generation) tasks.
Ananthakrishnan et al (2006) lay out several specific objections to BLEU, and have an in-depth exploration of specific errors in English/Hindi translation that BLEU scores well.

And here are some non-peer-reviewed resources. (While they’re probably not going to be as convincing to peer reviewers looking at a research paper you’ve written, they might be easier to convince your boss with.)

Other resources:

Matt Post from Amazon Research has an excellent discussion of the effects of preprocessing on BLEU scores.
This blog post by Kirti Vashee, who worked in translation, discusses problems with BLEU from the perspective of translators.
Yoav Goldberg gave a really good talk that included a discussion on why you shouldn’t use BLEU for NLG at the International Conference of Natural Language Generation in 2018. You can find the slides for here (search for “BLEU can be Misleading” to get the relevant slide). In particular, he and his co-authors found that their sentence simplification model achieved a high BLEU score even through it added, removed or repeated information.
Ana Marasović’s blog post “NLP’s generalization problem, and how researchers are tackling it” discusses how individual metrics, including BLEU, don’t capture models’ ability to handle data that differs from what they were exposed to during training.

So what should you use instead?

The main thing I want you to use in evaluating systems that have text as output is caution, especially when you’re building something that might eventually go into production. It’s really important for NLP practitioners to think about how our work will be applied, and what could go wrong. Consider this Palestinian man who was arrested because Facebook translated a post saying “good morning” as “attack them”! I’m not picking on Facebook in particular, I just want to point out that NLP products can be higher-stakes than we sometimes realize.

Carefully picking which metrics we optimize for is an important part of ensuring that the systems we work on are actually usable. For tasks like machine translation, for example, I personally think penalizing large changes in the meaning is very important.

That said, there are a lot of automatic evaluation metrics that can be alternatives to BLEU. Some of them will work better for different tasks, so it’s worth spending some time evaluating what the best choice for your specifica project is.

Two popular methods are actually derivations of BLEU that were designed to help address some of its shortcomings.

NIST, as I mentioned above, weights n-grams based on their rareness. This means that correctly matching a rare n-gram improves your score more than correctly matching a common n-gram.
ROUGE is a modification of BLEU that focuses on recall rather than precision. In other words, it looks at how many n-grams in the reference translation show up in the output, rather than the reverse.

There are also a large number of methods you can use to evaluate sequence to sequence models that aren’t based on BLEU. Some of them are measures adopted from other areas of NLP of machine learning.

Perplexity is a measure from information theory more often used for language modelling. It measures how well the learned probability distribution of words matches that of the input text.
Word error rate, or WER, is a commonly-used metric in speech recognition. It measures the number of substitutions (“an” for “the”), deletions and insertions in the output sequence given a reference input.
The F-score, also commonly known as F1, is the mean of precision (how many predictions were correct) and recall (how many of the possible correct predictions were made).

Others were developed specifically for sequence to sequence tasks.

STM, or the subtree metric (which I mentioned above) compares the parses for the reference and output translations and penalizes outputs with different syntactic structures.
METEOR is similar to BLEU but includes additional steps, like considering synonyms and comparing the stems of words (so that “running” and “runs” would be counted as matches). Also unlike BLEU, it is explicitly designed to use to compare sentences rather than corpora.
TER, or the translation error rate, measures the number of edits needed to chance the original output translation into an acceptable human-level translation.
TERp, or TER-plus, is an extension of TER that also considers paraphrases, stemming, and synonyms.
hLEPOR is a metric designed to be better for morphologically complex languages like Turkish or Czech. Among other factors it considers things like part-of-speech (noun, verb, etc.) that can help capture syntactic information.
RIBES, like hLEPOR, doesn’t rely on languages having the same qualities as English. It was designed to be more informative for Asian languages―like Japanese and Chinese―and doesn’t arely on word boundaries.
MEWR, probably the newest metric on the list, is one that I find particularly exciting: it doesn’t require reference translations! (This is great for low resource languages that may not have a large parallel corpus available.) It uses a combination of word and sentence embeddings (which capture some aspects of meaning) and perplexity to score translations.

Of course, I don’t have space here to cover all the automated metrics that researchers have developed here. Do feel free to chime in the comments with some of your favorites and why you like them, though!

So what you’re saying is… it’s complicated?

That’s pretty much the heart of the matter. Language is complex, which means that measuring language automatically is hard. I personally think that developing evaluation metrics for natural languagae generation might currently be the hardest problem in NLP. (There’s actually an upcoming workshop on that exact thing at NAACL 2019, if you’re as interested as I am.)

That said, there is one pretty good method to make sure that your system is actually getting better at doing things that humans like: You can ask actual people what they think. Human evaluation used to be the standard in machine translation and I think there’s still a place for it. Yes, it’s expensive and, yes, it takes longer. But at least for systems that are going into production, I think you should be doing at least one round of system evaluation with human experts.

Before you get to that round, though, you’ll probably need to use at least one automatic evaluation metric. And I would urge you to use BLEU if and only if:

You’re doing machine translation AND
You’re evaluating across an entire corpus AND
You know the limitations of the metric and you’re prepared to accept them.

Otherwise, I’d invest the time to find a metric that’s a better fit for your specific problem.

About the author:

Rachael is a data scientist at Kaggle (which, fun fact, has never run a competition that used BLEU as an evaluation metric). She has a PhD in linguistics, and a hedgehog named Gustav. If you’re interested in seeing more of her NLP tutorials and projects, you can check them out here.