Multilingual NLP: Get Started with the PAWS-X Dataset in 5 Minutes or Less

An hands-on tutorial for retrieving, processing, and using the dataset

Yousef Nami
Towards Data Science

--

Photo thanks Hannah Wright from Unsplash.

Introduction

PAWS-X [1] is a multilingual Sequence Classification dataset created from the original English Paraphrase Adversaries using Word Scrambling (PAWS) dataset [2]. The dataset consists of 49401 sentence pairs each with an associated label indicating whether the sentence pair is a paraphrase (y=1) or not (y=0). Each sentence pair is machine translated from the original English dataset into the following languages: German (de), Spanish (es), French (fr), Japanese (ja), Korean (ko) and Chinese (zh). As such, the dataset has 7 languages covering 4 scripts (Latin, Ideograms, Hangul and Chinese ideograms), and 4 language families (Indo-European (Germanic and Romance), Japonic, Koreanic and Sino-Tibetan). Some dataset statistics are described below in Table 1. Table 2 shows example instances from the English subset of the dataset.

Table 1: Dataset statistics for the different data splits available. Note that the number in brackets represents the number of ‘cleaned’ instances.
Table 2: two examples from the English subset of PAWS-X. These examples are taken directly from the dataset on HuggingFace.

PAWS-X is typically used as a benchmark for multilingual NLP and it appears as part of the XTREME [3] datasets by DeepMind. Overall, it is a relatively easy dataset, with models achieving up to 89% accuracy [3]. However, there is a considerable gap in the performance when compared to human performance (97.5%). This makes the dataset suitable for a) evaluating the performance of new NLP models (in particular measuring cross lingual performance) and b) using them for fine-tuning NLP models in a plagiarism detection pipeline.

This article provides a short guide for retrieving, processing and using the dataset. If you are not able to use the dataset within 5 minutes (i.e. because of bugs in the code or because my article is not well written) then I have failed you. Please flag any problems so I can make the process smoother for others.

Retrieving the Dataset

The dataset can be easily retrieved from its HuggingFace repository. The function below can be used to retrieve the data.

Do make sure to have the datasets library installed first.

pip install datasets

Processing the Dataset

After retrieving the dataset, I ran some simple validation tests. These were:

  • Ensuring that the dataset sizes are the same for each language
  • Ensuring that none of the sentence pairs are empty

To my disappointment, the latter test failed. A systematic error in the translation leads to 272 instances being empty strings. This error is documented on Google’s PAWS-X Github repo in the following issue, however for whatever reason it has not been fixed. If you are looking to reproduce State-of-the-art (SOTA) results then it may be a good idea NOT to filter these broken instances in the dataset since it is unlikely that other researchers did so.

However, if you are interested in keeping the parallel nature of the dataset intact, then you can remove these broken instances (across all languages) using the following function:

Note that we are generating the PAWSX_FILTER_IDS based on the German dataset. We can in fact use any of the translated languages (NOT English) since the error is systematic (I checked that the formula provides the same IDs for all the translated languages!)

If you don’t want to locate the excluded IDs each time you run your code (depending on your setup), then you may wish to hard code them in a config file. These IDs are provided in a Python-copyable format below:

['306', '473', '624', '1209', '1698', '1858', '1975', '2325', '2530', '2739', '2912', '2991', '3046', '3135', '3394', '3437', '3664', '3726', '3846', '4135', '4518', '4721', '4826', '5107', '5457', '5857', '5934', '6048', '6147', '6506', '6650', '7200', '7350', '7374', '7508', '7666', '7808', '8656', '8789', '8905', '9114', '9259', '9368', '9471', '9854', '10115', '10285', '10386', '10666', '10757', '10992', '11252', '11305', '11385', '11732', '11772', '11783', '11784', '11804', '11843', '11870', '11944', '12484', '12642', '12679', '12754', '12794', '12830', '13136', '14108', '14442', '14525', '14693', '14812', '14820', '14889', '15170', '15395', '15397', '15594', '15647', '16131', '16346', '16359', '16441', '16478', '16777', '17067', '17123', '17563', '17607', '17615', '17863', '17995', '18213', '18443', '18549', '18606', '19075', '19181', '19289', '19311', '19329', '19476', '19597', '19672', '19762', '19882', '19888', '19988', '20028', '20126', '20219', '20752', '20818', '20902', '20903', '21162', '21248', '21520', '21556', '22294', '22585', '22621', '22733', '22785', '22822', '23414', '23588', '23752', '23907', '24964', '25002', '25075', '25088', '25092', '25369', '25587', '25889', '26172', '26787', '26881', '27137', '27223', '27446', '27829', '27925', '28192', '28242', '28517', '28654', '28836', '28846', '29020', '29060', '29066', '29465', '29632', '30314', '30568', '30649', '30882', '31284', '31458', '31712', '31715', '31963', '32035', '32043', '32067', '32334', '32489', '32534', '32976', '33502', '33538', '33974', '34119', '34619', '34634', '34706', '34793', '34820', '34976', '35221', '35251', '35334', '35406', '35439', '35568', '36246', '36406', '36524', '36589', '36651', '36685', '36719', '36816', '36947', '37331', '37397', '37672', '38068', '38093', '38198', '38378', '39005', '39020', '39195', '39633', '39674', '39683', '39744', '40325', '40337', '40397', '40406', '40457', '40509', '40574', '40750', '40799', '40814', '40870', '40913', '41342', '41498', '41579', '41595', '41782', '42177', '42253', '42490', '42568', '42757', '42862', '43161', '43417', '44037', '44467', '44488', '44861', '45243', '45365', '45498', '45594', '45750', '45975', '45982', '46143', '46593', '46672', '46691', '46743', '46751', '47436', '47632', '47657', '47667', '47677', '48090', '48217', '48243', '48307', '48678', 
'48687', '48973', '48994', '49183', '49219', '49312', '49358']

Using the Dataset

How you use the dataset from this point onwards is mostly up to you. Here I provide important notes on tokenising the dataset, as well as a Python class that allows you to load and use the dataset directly in a PyTorch training loop. Skip the latter part if you are interested in using TensorFlow or the Trainer from HuggingFace. I also provide two common use cases for the PAWS-X dataset.

Tokenising the Dataset

Since our problem is Sequence Classification on a pair of sentences, we need to provide both sentences as separate arguments to the tokenizer function. Further, we specify the max_length to be 128. This is directly taken from the XTREME GitHub repository. It is relatively small, meaning that we can run most models on Colab without any memory constraints. We truncate with ‘longest_first’ (i.e. if the sentence pair exceeds the max length, then we truncate the longer sentence) and pad short sentence pairs to the max length. The following code can be used to tokenise the dataset:

Dataset Class for PyTorch Training Loop

The following is code that prepares the PAWS-X dataset (from the HuggingFace source) for training in a PyTorch style loop.

Academic Use Case: Benchmarking NLP Models

PAWS-X is useful for benchmarking NLP models. Despite it being an easy dataset, there is still a sizeable gap in its performance compared to human performance (88.9% vs. 97.5% [3]). This is mostly because of weaker performance on the non-Latin script languages [3]. This makes the dataset a good choice for examining if your NLP models have a Latin-script centric bias.

Further, the size of the dataset make it a good starting point for developing synthetic datasets. For example, adversarial examples can be created to evaluate model robustness. Because PAWS-X is an easy dataset, models can be pushed a lot with adversarial examples before they are no longer useful. Finally, the PAWS-X construction does not mix languages for a given pair of sentences (i.e. we can never have a pair comprised of an English sentence and a Spanish sentence). However, because the size of the dataset is large, one can use a good portion of it for supplementing the dataset with examples that mix languages across sentence pairs, making the model better at multilingual paraphrase detection without harming its performance on monolingual paraphrase detection.

Practical Use Case: PAWS-X Fine-Tuned Plagiarism Detection

The PAWS-X dataset is for paraphrase detection, meaning that it can be used for fine-tuning simple plagiarism detection models. However, because the sentence lengths are relatively small it is unlikely that the model will work for very large documents. As such, any pipeline using such a model would first have to split the text into sentences before feeding the model.

The core logic behind using PAWS-X for fine-tuning a plagiarism detection model is that “paraphrase” is a measure of sentence similarity. With this logic, it is possible to PAWS-X fine-tuned models for measuring sentence similarity as well (in support of traditional metrics, such as calculating cosine similarity of embeddings). The multilingual nature of PAWS-X means that similarity can be found across languages as well (e.g. given an English sentence “I love you” and a French sentence “Je t’adore” (simply I love you in French), determine their similarity)*.

*While this is true, to get better performance on multilingual sentence similarity it would be a good idea to fine-tune on an enhanced PAWS-X dataset that has multilingual translation pairs, as described in the section “Academic Use Case: Benchmarking NLP Models”

Concluding Remarks

  • PAWS-X is a multilingual Sequence Classification dataset in the form of paraphrase detection
  • It has over 49k instances with an extra 2k validation and test instances
  • It covers 7 languages, spanning 4 scripts and 4 language families
  • It is a relatively easy task and dataset
  • It is a good introduction for people interested in multilingual NLP

Author’s Note

I was motivated to write this article because I recently worked on a project dealing with multilingual datasets. In this process, I found that while there are many rich datasets available online, processing them is a painfully difficult task (many appear in different locations with different versions, there are discrepancies between sizes of the available datasets and those reported in papers, there are translations errors, etc…). I thought to myself that others have likely gone through the same problems, and that many more will in the future, so why not write concise tutorials to make life easier for people?

This is part of a series of articles that I intend to write on Multilingual NLP. The first few will focus on datasets (next ones will be TyDiQA [4] and XQuAD [5]). Later I will shift the focus towards the tasks themselves (representing data, interesting experiments, training and evaluation tutorials).

If you are interested in my work, please consider supporting me by getting a Medium membership using my referral link:

This helps me as a portion of your membership fee comes to me (don’t worry, this is at no extra cost to you!) while giving you full access to all articles on Medium!

--

--