BLEU-BERT-y: Comparing sentence scores

Published in

Towards Data Science

5 min readOct 14, 2019

Blueberries — Photo by Joanna Kosinska on Unsplash

The goal of this story is to understand BLEU as it is a widely used measurement of MT models and to investigate its relation to BERT.

This is the first story of my project where I try to use BERT contextualised embedding vectors in the Neural Machine Translation (NMT) problem. I am relatively new to MT, therefore, any suggestion is welcomed.

Sentence similarity

When it comes to machine translation or other natural language processing (NLP) problems where the output of the process is a text, measuring the correctness of the result is not straightforward.

The question we have to ask for evaluating machine translation algorithms is “How good is this translation?” or “How close is the sentence in the target language to the sentence in the original language?”

To understand this problem, here we will look at a simpler version of it: “How similar two sentences in the same language?”

In this story, let’s use the following pocket sentences:
s0: James Cook was a very good man and a loving husband. s1: James Cook was a very nice man and a loving husband. s2: James Cook was a bad man and a terrible husband. s3: James Cook was a nice person and a good husband. s4: The sky is blue today and learning history is important.

I’d like to suggest to take a minute and think about the similarity between the sentences, how close they are to the first one!

BLEU

BLEU: Bilingual Evaluation Understudy provides a score to compare sentences [1]. Originally, it was developed for translation, to evaluate a predicted translation using reference translations, however, it can be used for sentence similarity as well. Here is a good introduction of BLEU (or read the original paper).

The idea behind BLEU is to count the matching n-grams in the sentences. A unigram is a word (token), a bigram is a pair of two words and so on. The order of the words is irrelevant in this case.

To eliminate false scores (e.g. “the the the the the” gives a relatively good score to “The cat eats the bird”), a modified version of this count is introduced. This modified unigram precision penalties the use of the same word in the reference text multiple times.

BLEU cannot handle word synonyms, however, it is a quick, inexpensive algorithm that is language independent and correlates with human evaluation.

BERT

BERT: Bidirectional Encoder Representations from Transformers is a contextualised word embedding (and more) [2]. Here is a great summary of BERT (or read the original paper).

Word embeddings are vectors mapped to words to help the computer in understanding words. While “cat” or “dog” is hard to handle for a computer, their vector representation suits better. One expectation from an embedding mapping is that similar words have to be close to each other.

Contextualised word embedding vectors have different embeddings for the same word depending on its context. One trick of BERT embedding is that it is trained with separators CLS and SEP and these separators also have context-dependent embedding vectors. It is suggested in the original paper that these embedding can be used as sentence-level embeddings. Here, we will use the embedding vector of the CLS separator of each sentence as the sentence’s embedding. [CLS] This is a sentence with separators . [SEP]

Vector distance

Here, we will calculate the similarity between sentences using the Euclidean distance and the Cosine similarity of the corresponding CLS sentence-level embeddings.

The Euclidean distance has a range of [0, ∞), therefore, to match the other scores, let’s use the function f(x)=(1/1.2)^x to get a (0,1] score. This function results in scores relatively close to the BLEU ones. The cosine similarity has the right range, however, it is not comparable with the BLEU scores. To evaluate the results we have to investigate the scores of the sentences relative to each other, not to other scores.

Similarity scores

Here is a table of the scores using BLEU. As we can see, BLEU identifies the second sentence as close to the first one (only 1 word changed) but it cannot handle the synonyms of the fourth sentence. Also, the score of a completely different sentence is relatively high.

BLEU with Smoothing Function [3] resolve this later issue.

The BERT with Euclidean distance achieves relatively similar scores as the BLEU, but it handles the synonyms as well. The Cosine similarity of the BERT vectors has similar scores as the Spacy similarity scores.

Spacy is an Industrial-Strength Natural Language Processing tool. Spacy uses a word embedding vectors and the sentence’s vector is the average of its tokens’ vectors. More about Spacy similarity here.

BLEU and BERT scores of the pocket sentences, similarity to the first sentence

BERTScore

(Updated on 06.11.2019)

This is an update as I recently found an article with the idea to use BERT for evaluating Machine Translation systems [4]. The authors show that BERTScore correlates better to the human judgement than previous scores such as BLUE.

The BERTScore is similar to the one introduced here, however, BERTScore uses the similarity of token-level BERT embedding vectors, while we used sentence-level embeddings.

Summary

This story introduces the BLEU scores for evaluating sentence similarity and compares it to BERT vector distances. The examples tried to illustrate how hard it is to define what a similar sentence means and the two methods showed possible answers to quantitatively measure some kind of similarity. The codes attached to this story provide a basic usage of the BLEU and BERT as well as Spacy. For detailed introductions, additional links are provided.

Corresponding codes are available in Google Colab.

References

[1] Papineni, K., Roukos, S., Ward, T., & Zhu, W. J. (2002, July). BLEU: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting on association for computational linguistics (pp. 311–318). Association for Computational Linguistics.

[2] Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.

[3] Lin, C. Y., & Och, F. J. (2004, July). Automatic evaluation of machine translation quality using longest common subsequence and skip-bigram statistics. In Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics (p. 605). Association for Computational Linguistics.

[4] Zhang, T., Kishore, V., Wu, F., Weinberger, K. Q., & Artzi, Y. (2019). BERTScore: Evaluating Text Generation with BERT. arXiv preprint arXiv:1904.09675.

Learn NMT with BERT stories