The world’s leading publication for data science, AI, and ML professionals.

Natural Language Inference: An Overview

Benchmark and models

Photo by Valentin Vesa from Pexels
Photo by Valentin Vesa from Pexels

What and why?

Natural Language Inference (NLI) is the task of determining whether the given "hypothesis" logically follows from the "premise". In layman’s terms, you need to understand whether the hypothesis is true, while the premise is your only knowledge about the subject.

Why should you read this? I assume that you know nothing about NLI and promise to get you up to speed with the latest (April 2022) developments in the field. Quite a bold promise for one article, isn’t it?

Problem Statement

Natural Language Inference which is also known as Recognizing Textual Entailment (RTE) is a task of determining whether the given "hypothesis" and "premise" logically follow (entailment) or unfollow (contradiction) or are undetermined (neutral) to each other.

You can view NLI as classifying the hypothesis into three classes based on the premise: entailment, contradiction, or neutral. There is a related problem: fact-checking. The fact-checking problem is very similar to NLI. The only difference is that you don’t have the premise. Thus, fact-checking consists of two problems: the search problem and NLI. In this article, I will focus on the NLI problem.

Benchmarks

When entering a new game, the first step is to learn the rules to play by. In Machine Learning, benchmarks are the de-facto rules that researchers play by.

SNLI

  • Website
  • Paper
  • Benchmark
  • Examples: 570k
  • Premise Type: sentence.
  • Labels: entailment, neutral, contradiction.

That’s the golden classic of NLI benchmarking, so the benchmark is widely used, respected, and, frankly, outdated. The SNLI dataset is based on the image captions from the Flickr30k corpus, where the image captions are used as premises. The hypothesis was created manually by the Mechanical Turk workers in line with the following instruction:

  1. entailment: write one alternate caption that is definitely an accurate description of the photo;

  2. neutral: write one alternate caption that might be an accurate description of the photo;

  3. contradiction: write one alternate caption that is a wrong description of the photo.

SNLI suffers from two significant drawbacks:

  • The premises are limited to the short photo descriptions and thus don’t contain temporal reasoning, beliefs, or modalities.
  • Simple and short premises call for simple and short hypotheses, so the benchmark is not challenging enough where models can easily achieve human accuracy levels.

MultiNLI

  • Website
  • Paper
  • Benchmark
  • Examples: 433k
  • Premise Type: sentence
  • Labels: entailment, neutral, contradiction.

MultiNLI (MNLI) is modeled after the SNLI dataset schema but covers a range of spoken and written text. Therefore, MNLI can be used in conjunction with SNLI and offers text from ten different genres.

MNLI’s premises are derived from the ten sources or genres (based on the Open American National Corpus):

  1. face-to-face: transcriptions from the Charlotte Narrative and Conversation Collection of two-person conversations;
  2. government: reports, speeches, letters, and press releases from public domain government websites;
  3. letters: letters from the Indiana Center for Intercultural Communication of Philanthropic Fundraising Discourse;

  4. 9/11: the public report from the National Commission on Terrorist Attacks Upon the United States;

  5. OUP: five non-fiction works on the textile industry and child development published by the Oxford University Press;
  6. slate: popular culture articles from the archives of Slate Magazine;
  7. telephone: transcriptions from the University of Pennsylvania’s Linguistic Data Consortium Switchboard corpus of two-sided telephone conversations;
  8. travel: travel guides published by Berlitz Publishing;
  9. verbatim: posts about linguistics for non-specialists from the Verbatim archives;
  10. fiction: several freely available works of contemporary fiction written between 1912 and 2010.

The hypothesis creation process looks as follows: a crowdworker was presented with the premise and asked to compose three new sentences:

  1. entailment: one which is necessarily true or appropriate whenever the premise is true;
  2. contradiction: one which is necessarily false or inappropriate whenever the premise is true;
  3. neutral: and one where neither condition applies.

An important feature of the dataset is that only five out of ten genres appear in the training set, making the other five genres unseen for a model. These unseen genres can be used to estimate how well the model can generalize to the unseen sources of text.

SuperGLUE

  • Website
  • Paper
  • Benchmark
  • Benchmark
  • Examples: RTE: 3k, CB: <1k
  • Premise Type: sentence
  • Labels: RTE: entailment, not_entailment; CB: entailment, not_entilament, unknown

SuperGLUE is a collection of ten benchmarks that measure the performance of an NLP model in three tasks:

  1. Natural Language Inference
  2. Question answering
  3. Coreference resolution

Based on the performance in these tasks, SuperGLUE aims to provide a single score that summarizes the model’s capabilities in natural language understanding. SuperGLUE is an extension of a very popular GLUE benchmark with more complex tasks.

There are two NLI benchmarks in SuperGLUE: RTE and CB.

RTE, or Recognizing Textual Entailment, comes from annual competitions on textual entailment. RTE contains RTE1, RTE2, RTE3, and RTE5 datasets. The data itself is from Wikipedia and news articles. A distinctive feature of RTE is that it’s labeled for two-class classification instead of three-class classification, so there is no neutral label.

CB, or CommitmentBank, is a corpus of short texts in which at least one sentence contains an embedded clause. Each of these embedded clauses is annotated with the degree to which it appears the person who wrote the text is committed to the truth of the clause. The resulting task is framed as three-class textual entailment on examples that are drawn from the Wall Street Journal, fiction from the British National Corpus, and Switchboard. Each example consists of a premise containing an embedded clause, and the corresponding hypothesis is the extraction of that clause.

FEVER

  • Website
  • Paper
  • Benchmark
  • Examples: 185k
  • Premise Type: Wikipedia URL + sentence number
  • Labels: Supported, Refuted, NotEnoughInfo

Note: this dataset calls hypotheses "claims".

This dataset is different from SNLI and MNLI because it supports both NLI and Fact-checking problems. Instead of the premise, the dataset provides a URL of the Wikipedia page from which the premise can be extracted from. The dataset also provides the number of premise sentences on the pages to support pure NLI use cases.

The claims were human-generated, manually verified against the introductory sections of Wikipedia pages sections, and labeled as Supported, Refuted, or NotEnoughInfo.

The claims were generated by paraphrasing facts from Wikipedia and mutating them in a variety of ways, some of which were meaning-altering. For each claim, and without the knowledge of where the claim was generated from, annotators selected evidence in the form of sentences from Wikipedia to justify the labeling of the claim.

One caveat of the dataset is that it doesn’t provide Wikipedia URLs of premises for the NotEnoughInfo labels. Therefore, you will need to search for premises yourself if you would like to use the dataset for the NLI use case.

WIKI-FACTCHECK

  • Website
  • Paper
  • Benchmark: no benchmark on Papers with Code
  • Examples: 160k
  • Premise Type: evidence document
  • Labels: entailment, contradiction, neutral

Unlike SNLI, MNLI, and FEVER, this dataset consists of real-word claims extracted from Wikipedia citations. The premises, or evidence documents, were the documents cited by the claims. Besides, the dataset provides context for each claim which can help in ambiguous cases. The emphasis on real-world claims indeed poses a more challenging NLI task.

The drawback of this dataset is the quality: the claims and premises were extracted from Wikipedia automatically and sometimes made little sense.

ANLI

  • Website
  • Paper
  • Benchmark
  • Examples: Round 1 – 19k, Round 2 – 47k, Round 3 – 103k
  • Premise Type: sentence
  • Labels: entailment, contradiction, neutral

ANLI is the most advanced NLI benchmark to date. ANLI’s collection process is very different from other datasets because it employs a technique called Human-And-Model-in-the Loop Entailment Training (HAMLET).

HAMLET organizes the data collection process into rounds, where each round has the following steps:

  1. A SOTA model is trained.
  2. A human annotator is given the context (premise) and the desired target label and is asked to generate a hypothesis that will fool the model into misclassifying the label.
  3. If the model misclassifies the example, the example is shown to two human verifiers to make sure it is correct. In case they disagree, a third human verifier breaks the tie.
  4. The example is added to the training set of this round and will be used to train the model for the next round.

With each round, the complexity of the model and the dataset grew:

  1. Round 1. Model: BERT-Large. Dataset: SNLI + MNLI. Contexts: Wikipedia + HotpotQA.
  2. Round 2. Model: an ensemble of RoBERTa models. Dataset: SNLI+ MNLI + FEVER + Round 1 data. Contexts: new from Wikipedia + HotpotQA.
  3. Round 3. Model: an ensemble of RoBERTa models. Dataset: SNLI + MNLI + FEVER + Round 1 data + Round 2 data. Contexts: various sources, including spoken texts and longer contexts.

In each round, the NLI model used was stronger, and the contexts were longer and harder to understand. Thus, the adversary annotators had to come up with more examples each round before they managed to fool the model. For example, in Round 1, annotators did 3.4 tries on average before the model was fooled, while in Round 3, they needed 6.4 tries on average.

The ANLI dataset is collected to be more difficult than others by design and provides longer, real-world contexts. In addition, ANLI provides a mechanism for extension by adding new rounds so that the benchmark can develop alongside the SOTA models.

SOTA Models

This is the part you all have been waiting for – models! As expected, BERT-derived architectures are leading the list.

DeBERTa

This is a Transformer-based architecture and the current SOTA model on most of the SuperGLUE tasks: NLI (RTE, CommitmentBank), Common Sense Reasoning (ReCoRD), Question Answering (COPA, MultiRC).

DeBERTa:

  1. Introduces disentangled attention mechanism, where each word is represented using two vectors that encode its content and position. The attention weights among words are computed using disentangled matrices on their contents and relative positions.
  2. An enhanced mask decoder is used to incorporate absolute positions in the decoding layer to predict the masked tokens in model pre-training.
  3. A new virtual adversarial training method is used for fine-tuning to improve models’ generalization.

RoBERTa

This is a Transformer-based architecture. The interesting part about it is that it just tweaks several hyperparameters of the BERT training process and achieves a new SOTA model that beats many of the BERT modifications. This work questions the source of performance improvements demonstrated by the models released after BERT.

RoBERTa:

  1. training the model longer with bigger batches over more data
  2. removing the next sentence prediction objective
  3. training on longer sequences
  4. dynamically changing the masking pattern applied to the training data

Playing with DeBERTa

I decided to reproduce the DeBERTa results on the MNLI dataset. The DeBERTa base model was selected due to computational limitations. MNLI has two parts of the test dataset: matched and mismatched genres. In a nutshell, matched genres were present in the train split, and mismatched genres were not. According to the DeBERTa paper, the DeBERTa base model should produce 88.8% accuracy on the matched MNLI-m test dataset and 88.5% accuracy on the mismatched MNLI-mm test dataset:

Performance of RoBERTa and DeBERTa compared.
Performance of RoBERTa and DeBERTa compared.

The easiest way to get the DeBERTa pre-trained weights is by using the HuggingFace library. I have used Microsoft’s deberta-base model, but a more recent deberta-v3-base model is also available. To use these pre-trained weights, you need to load them using the CrossEncoder class from the sentence-transformers library.

The MNLI dataset is available from the HuggingFace Datasets library, and we should use the _validationmatched split for MNLI-m and _validationmismatched split for MNLI-mm.

There are two caveats when reproducing the DeBERTa performance on MNLI:

  1. First, the label encoding in the dataset and the trained model differ. That’s why if you score the model on MNLI-m right away, you will get only around 31% accuracy.
  2. Second, the trained model is sensitive to the sentence order: the premise should go first, and the hypothesis should go second.

The label encoding of the dataset:

  • 0 – entailment
  • 1 – neutral
  • 2 – contradiction

The label encoding of the trained model:

  • 0 – contradiction
  • 1 – entailment
  • 2 – neutral

When the fixes above are applied, the model produces the 88.29% accuracy score on the MNLI-m dataset. I suppose that the reason for it is that the deberta-base weights from HuggingFace were not trained on the MNLI dataset at all. That also explains why the spread between the matched and the mismatched genres is so small in my tests: 88.29% matched and 88.11% mismatched, while the paper reports 88.8% matched and 88.5% mismatched. The model was supposed to be trained on the matched genres and perform better on them. However, the model was not trained on any genres, so it performs on both matched and mismatched in the same way, and, in both cases, the performance is worse than reported in the paper.

Let’s go through the workflow of reproducing the results. I assume that the experiment is run in a Colab notebook with GPU.

First, we need to install the required packages:

!pip install transformers sentence-transformers datasets sentencepiece

Next, we load the dataset:

Using the HuggingFace Datasets library to load the MNLI dataset
Using the HuggingFace Datasets library to load the MNLI dataset

Keeping in mind that label encoding in the model and the dataset are not compatible, we need to define a mapping function:

Define a function to convert model labels to dataset labels
Define a function to convert model labels to dataset labels

And load the model using CrossEncoder:

Load pre-trained model weights using the sentence_transformers library
Load pre-trained model weights using the sentence_transformers library

Finally, we can score the model on the _validationmatched split to achieve 88.29% on the matched genres:

Scoring the model on validation_matched
Scoring the model on validation_matched

and score the model on the _validationmismatched split to achieve 88.11% on the mismatched genres:

Scoring the model on validation_mismatched
Scoring the model on validation_mismatched

You can find a complete Colab notebook here.

Conclusion

We have reviewed the datasets and models most commonly used in NLI. We also got our hands dirty by reproducing the DeBERTa base results on MNLI.

If you need to solve an NLI task quickly, use the DeBERTa model. It’s SOTA and is available through HuggingFace. If you need a benchmark, I would prefer using ANLI or MNLI.

What I especially like about ML these days is that transformers are ubiquitous. Thus, the models we have reviewed are also applicable in many applied tasks beyond NLI. The other way around also works – many transformer-based models are benchmarked on NLI tasks to show the performance gains compared to the previous architectures.

Natural Language Inference is an important task that makes us develop models that can actually understand the dependencies between sentences. There is so much more to NLI that we have covered so far, but you already have enough ground to explore other models and understand the differences between benchmarks.

I wish you good luck on the ML path, and remember to share your journey with the community!


Related Articles