Stress-Test Your NLP Models
Dataset artifacts are one of the problems in Natural Language Processing (NLP) that affect models’ performance in the real world. Even though pre-trained models can perform well on benchmark datasets, they show poor performance in other settings. Those failures are due to dataset artifacts or annotation artifacts – spurious correlations, which a language model learns during training[1].

It turns out there is a whole bunch of artifacts your model can absorb while being trained thanks to the dataset peculiarities and the biases brought in by annotators. Why are artifacts bad? Basically, they are providing your model with shortcuts to memorize some causalities which are actually false instead of learning correct "reasoning". For example, if a dataset consists of many examples where a male character is a doctor, then a model will give more probability to men rather than women to be doctors while conducting inference. It is also possible to construct some adversarial examples where the model achieves surprisingly low performance.
Your job as a data scientist is to eliminate as many artifacts as possible and to increase the overall performance of your model at the same time.
Find them all!
First of all, you have to know your enemy. So you need to find the artifacts. To find the artifacts we need to define what we are looking for. So, to summarise, we can list the following artifacts, even though the list is not exhaustive:
- a model does not understand comparisons
- a model does not understand intensifiers
- a model does not understand taxonomy (synonyms, antonyms, etc.)
- a model does not have the robustness to typos, irrelevant changes, etc.
- a model cannot handle named entities appropriately
- a model demonstrates unfairness to some minorities or genders
- a model does not understand an order of events
- a model cannot handle negations appropriately
- a model does not understand coreference
- a model does not understand roles such as agent, object, etc.
- a model is not robust to some trigger words (some words combinations "hack" your model to show some unwished results)
- a model cannot handle adversarial examples (the adversarial examples are created by adding distracting sentences to the input paragraph, but they neither contradict the correct answer nor confuse humans)
- and others…
Now as we know their faces, we want to find artifacts. While looking at datasets or annotating them you may figure out some of the artifacts, but there is no better way to find them all rather than train a baseline model and do some tests. Luckily there is just a perfect tool for that – CheckList [2]. It is not a panacea, but it is a huge deal as it helps to analyze most of the artifacts listed above.
Check-up for artifacts
Tools and instruments
Thanks to the authors there is a great open-source tool that is ready to use out of the box for some datasets (such as SQuAD, QQP, and others). Let’s take a closer look.
The CheckList is a testing suite that is inspired by unit testing from software development. The authors created a bunch of scripts that generate a number of tests with special "templates" such as:
ret = editor.template({'question': 'Is this a {adj} movie?',
'context': 'This is a {adj} movie.' },
labels='Yes, this is {adj}.',
adj=['good', 'great', 'awesome', 'excellent'])
print(ret.data[0])
print(ret.labels[0])
print()
print(ret.data[1])
print(ret.labels[1])
print()
This template will return a bunch of contexts, questions, and answers that will be used to test your model:
{'question': 'Is this a good movie?', 'context': 'This is a good movie.'}
Yes, this is good.
{'question': 'Is this a great movie?', 'context': 'This is a great movie.'}
Yes, this is great.
Using this tool you can generate any amount of such examples. The great thing about it is that you can customize the test suites provided to add or edit any particular template.
Moreover, once you have set up your test templates and prepared your test suit, you run the test and get a pretty neat summary, which you can even visualize with a widget (does not work in Colab). The tool is pretty easy to follow, and it is pretty well documented. So spend some time checking it out.
Testing results
So what are the results presented by a model? The authors themselves conducted a whole bunch of tests over most of the popular state-of-the-art models in 2019 and found out that all of them are pretty biased and have lots of artifacts.
Authors summarised results of the CheckList tests on the following models (ordered as on a Figure from left to right):
- Microsoft’s Text Analytics
- Google Cloud’s Natural Language
- Amazon’s Comprehend
- BERT-base
- RoBERTa-base
According to the paper the tested models’ failure rates are pretty high at least on several tests. They terribly fail with most of the tests realted to negation processing, work pretty bad with changes of sentiment over time and comparison of two statements. Interestingly, particularly BERT-based models also fail to classify neutral sentiment sentences. So, it turns out that most NLP models are vulnerable to artifacts even though being trained on huge amounts of data.
For the sake of example, I and my colleague Derrick Xu conducted similar tests on the SQuAD-trained ELECTRA-small model. As expected we got even worse results (Figure 1). Even though the model itself achieved an OK F1 score of 86.3 and an Exact match score of 78.5, it had lots of biases, which you can see in Figure 1.

We also conducted the adversarial evaluation by evaluating the baseline QA model with the adversarial AddOneSent dataset (see Figure 2). F1 score dropped from 86% to 49.6%, and exact match dropped from 78% to 42.1% comparing to performance on the SQuAD development set.

So, besides fixing for general linguistic capabilities, found using CheckList, we also sought to improve the model’s performance on adversarial examples.
Now fix them!
There are a few methods to fight Annotation Artifacts:
- Retraining on hard subsets of data or data where the gold label distribution is ambiguous. It is suggested to use dataset cartography [3] or any other approach to find such examples.
- Ensemble-based debiasing: use a weak model to learn the correlations, then train your model to learn the residual of that model [4] or otherwise remove it from output distribution. That will make your main model extra-trained on hard examples.
- And other methods…
Basically, all of the methods boil down to model retraining on the data which is hard for the model to infer. So the simplest method would be to generate that data and retrain your model. That actually works pretty well.
Luckily you don’t have to write up all the extra data by hand. You have CheckList generating tools set up. So, all you need is to set up the templates and you are good to go.
To fix our ELECTRA-small model we used the CheckList tool too. Here is an example code to generate extra data for comparison capability improvements.
import checklist
import spacy
import itertools
import json
import checklist.editor
from checklist.test_types import MFT, INV, DIR
from checklist.expect import Expect
from checklist.test_suite import TestSuite
from checklist.perturb import Perturb
import checklist.text_generation
# Template to generate comparison examples
adj = ['large', 'fat', 'fresh', 'kind', 'deep', 'wierd', 'poor', 'clear', 'bold', 'calm', 'clever', 'firm', 'mean', 'quick', 'quiet', 'strong', 'bright', 'light']
adj = [(x.rstrip('e'), x) for x in adj]
temp1 = editor.template(
[(
'{first_name} is {adj[0]}er than {first_name1}.',
'Who is less {adj[1]}?'
),(
'{first_name} is {adj[0]}er than {first_name1}.',
'Who is {adj[0]}er?'
)
],
labels = ['{first_name1}','{first_name}'],
adj=adj,
remove_duplicates=True,
nsamples=1000,
save=True
)
# Generating train extension from comparisons
train_extension_comparison = []
id_n = 0
for string in range(len(temp1['data'])):
for i in range(len(temp1['data'][string])):
index_of_answer = temp1['data'][string][i][0].find(temp1['labels'][string][i])
train_extension_comparison.append({
'id':f'aug{id_n}',
'title':'aug_comparison',
'context':temp1['data'][string][i][0],
'question':temp1['data'][string][i][1],
'answers':{"text": [temp1['labels'][string][i], temp1['labels'][string][i], temp1['labels'][string][i]], "answer_start": [index_of_answer, index_of_answer, index_of_answer]}
})
id_n += 1
# will generate 1996 examples with different compbinations
# of adjectives and first names according to the template.
Then just dump the data you get and concatenate it with the original training data. In our case that was SQuAD train data. Do not forget to shuffle your new data with the original one to avoid the skewness of the training data.
We repeated the generation for several capabilities for this model. In the end, we achieved about a 30% extension of the original SQuAD train data. We also included some adversarial data, which is also one of the methods to fight artifacts. Liu et al. (2019) [5] found that the models’ (i.e. BiDAF, QANet) performance increases when retraining them using 500 or more adversarial examples from the challenge dataset. So we included around 750 adversarial examples to our extended dataset as well, shuffled the whole dataset, and retrained the model.
We conducted the same CheckList tests as before and here is what we have got. The results are presented in Figure 3.

So, as you may see the results are not perfect. We have obviously decreased the model’s performance on some of the capabilities, but mainly those, for which the data was not generated to retrain. For the capabilities where we have augmented the data, we got a significant increase in performance (decrease in failure rates).
However, some of the results are not that stable. There was probably not enough data to improve the performance on negations capability (even though we did target that capability) as well as for coreference and temporal capabilities.
At the same time, we managed to simultaneously increase performance on the adversarial dataset (Figure 4) and compensate for the performance drop from adversarial retraining on the original dev dataset.
In terms of the overall final metrics testing on the original dev dataset showed us the following results: the exact match grew from 78.5 to 78.7 and the F1 score dropped from 86.3 to 86.2. That is actually a good result as in line with Liu et al (2019) findings retraining on adversarial data leads to a significant drop in the performance of the retrained QA model on the original dev dataset (in our case it led to a drop in exact match to 74.2 and F1 81.9), thus adversarial training alone does not look very attractive. Though combing it with other artifact fighting techniques, like CheckList generated data augmentation in our case, one can achieve much better performance and significantly improve performance on adversarial data.

To conclude, we approximately kept the same overall metrics while improving some of the model’s capabilities and fighting artifacts.
To sum up
The CheckList + adversarial training is just one of the methods to analyze and fix annotation artifacts. In order to significantly improve a model’s performance alongside artifacts elimination, you have to use several approaches at once.
However, I cannot stress enough that you have to think about the artifacts and fight them. That is an important step towards the NLP models which are more robust and fair as well as less biased!
References
This article is brought to you by me and Derrick Xu.
All images are by the author unless noted otherwise.
[1] Gururangan, S., Swayamdipta, S., Levy, O., Schwartz, R., Bowman, S. R., & Smith, N. A. (2018). Annotation artifacts in natural language inference data. arXiv preprint arXiv:1803.02324.
[2] Ribeiro, M. T., Wu, T., Guestrin, C., & Singh, S. (2020). Beyond accuracy: Behavioral testing of NLP models with CheckList. arXiv preprint arXiv:2005.04118
[3] Swayamdipta, S., Schwartz, R., Lourie, N., Wang, Y., Hajishirzi, H., Smith, N. A., & Choi, Y. (2020). Dataset cartography: Mapping and diagnosing datasets with training dynamics. arXiv preprint arXiv:2009.10795.
[4] He, H., Zha, S., & Wang, H. (2019). Unlearn dataset bias in natural language inference by fitting the residual. arXiv preprint arXiv:1908.10763.
[5] Liu, N. F., Schwartz, R., & Smith, N. A. (2019). Inoculation by fine-tuning: A method for analyzing challenge datasets. arXiv preprint arXiv:1904.02668.