The world’s leading publication for data science, AI, and ML professionals.

Multilingual NLP: Get Started with the TyDiQA-GoldP Dataset in 10 Minutes or Less

A hands-on tutorial for retrieving, processing and using the dataset

Introduction

TyDiQA-GoldP [1] is a difficult Extractive Question Answering dataset that is typically used for benchmarking question answering models. What makes the dataset worthwhile is the manner in which the data is created. Annotators were given the first 100 characters of random Wikipedia articles, and asked to generate questions whose answers they are interested in finding [1]. To quote an example from the paper [1], given the prompt "Apple is a fruit" a human annotator may ask "What disease did Steve Jobs die of?". This strategy for generating the dataset simulates human curiosity, which could be one of the reasons TyDiQA-GoldP is more difficult than other Multilingual Extractive QA datasets such as XQuAD [2] and MLQA [3]. Once questions are created, matching Wikipedia articles are found by selecting the first article that appears in the Google search results of the question prompt. Annotators are then asked to find the best answer in the articles that matches the question, if any such answer exists. Those question-answer pairs with no answers are discarded, and for those where there is an answer, only the passage that contains the answer is kept.

Each instance consists of the following: a question, an answer (text), the start span of the answer and the instance ID. The dataset covers the following languages: English (en), Bengali (bn), Korean (ko), Telugu (te), Swahili (sw), Russian (ru), Finnish (fi), Indonesian (id), Arabic (ar). As such, it covers 5 scripts (Latin, Brahmic, Cyrillic, Hangul, Arabic) and 7 language families (Indo-European (Indo-Aryan, Germanic, Slavic), Afro-Asiatic, Uralic, Austronesian, Koreanic, Niger-Congo, Dravidian). Unlike many Multilingual NLP datasets, the original TyDiQA-GoldP is NOT parallel. This means that the instances cannot be matched, since they have not been created by translation. However, DeepMind [4] has created a parallel version of TyDiQA-GoldP by taking the English subset and translating it to the other languages. Table 1 shows the number of instances for each language in the original TyDiQA-GoldP dataset, while Table 2 shows statistics for the DeepMind generated dataset. Table 3 shows an instance from the English subset of the dataset.

TyDiQA-GoldP is typically used as a benchmark for multilingual NLP and the parallel dataset appears as part of the XTREME [4] datasets by DeepMind. Overall, it is a very hard dataset, with models achieving up to 77.6 on the F1 score and 68 on the exact match [4]. For a point of comparison, the human performance is 90.1. The original TyDiQA-GoldP is relatively large and good for fine-tuning, especially for improving performance on non-Latin languages. The parallel TyDiQA-GoldP dataset is relatively small in size, making it suitable for training on publicly available GPUs (e.g. Colab).

In this article, I provide a hands on tutorial for retrieving the dataset from multiple sources (from flat files and from HuggingFace through the datasets API), processing it (checking data validity, finding matching instances) and using it (tokenising it for training) for both the original setting, and the parallel setting from DeepMind. I’ve written this article with the following in mind, to ensure a smooth user experience:

  • Under 10 mins
  • Usable scripts for quick retrieval of the dataset
  • Explanations of discrepancies in the data if any

Retrieving the Dataset

Non-Parallel Setting

In the Non-Parallel setting, both the development set and the training set can be downloaded from the TyDiQA repository as .json files. The development set can be found here, while the training set can be found here. Once downloaded, the files can read into datasets.Dataset classes as follows:

Its worth noting that the non-parallel TyDiQA-GoldP dataset also exists on HuggingFace, and is duplicated in two separate locations! It can be downloaded from both the TyDiQA HuggingFace dataset repository and the XTREME HuggingFace dataset repository. The code for loading both as datasets.Dataset classes is shown below (personally I prefer the XTREME one because it is faster…):

It’s worth noting that while the raw format from the .json files does not match that from the HuggingFace data, the datasets are identical. Both of the datasets in their raw format mix all the languages. We will see in the "Processing the Dataset" section how to create separate datasets for each language.

Parallel Setting

The dataset can only be downloaded from the XTREME repository, specifically here&prefix=&forceOnObjectsSortingFiltering=false). Do NOT use the version that exists on the HuggingFace XTREME repository, as that is for the Non-Parallel setting only (I learnt this the hard way…).

Validation data: note that while there are discrepancies with the training data, this is not the case with the validation data. Firstly, the validation data has no "parallel" setting. The validation subset from the TyDiQA files (this is called "dev") and the XTREME/TyDiQA HuggingFace repositories are all identical. Therefore the easiest way to get this would be through using the functions for the non-parallel setting and specifying "validation" for the split. It’s worth noting that translate-test from the XTREME GitHub repo is NOT to be confused with validation data.

Processing the Dataset

After retrieving the datasets, I ran some simple validation checks. These were:

  • Ensuring that there are no empty questions, contexts or answers
  • Ensuring that there is no more than one answer for the training subsets
  • Check that the IDs are unique for each dataset

Thankfully, these tests passed for both the non-parallel setting and the parallel setting.

Non-Parallel Setting

This section is optional. It is only useful if you wish to split your dataset by languages, keeping in mind that the dataset is not parallel

Parallel Setting

For this setting, I ran two extra tests to ensure that the data is indeed parallel:

  • Checking the dataset sizes against those reported in the literature (in the case of the parallel setting)
  • Ensuring that the dataset sizes are the same for each language (in the case of the parallel setting)

Unfortunately, both these tests failed. For the latter, I got the following dataset sizes for each language:

bn: 3585/3585 ids unique
fi: 3670/3670 ids unique
ru: 3394/3394 ids unique
ko: 3607/3607 ids unique
te: 3658/3658 ids unique
sw: 3622/3622 ids unique
id: 3667/3667 ids unique
ar: 3661/3661 ids unique

My best guess for why there are missing data points is that the translation process itself can cause errors. This is because the question answering task is not exactly trivial, and a direct translation may provide question-answer pairs that no longer match, and thus those examples are discarded. After matching the IDs to find the total number of truly parallel examples, I was left with 3150 data points, meaning that a good 15% of the dataset is lost (from the perspective of parallel data).

What I found concerning was that the size of the validation set for TyDiQA-GoldP does not seem to match any of the numbers reported on the XTREME paper. Firstly, it is alleged that the dataset has both a "dev" set and a "test" set, however, no where on the XTREME GitHub repo can this be found. Secondly, the sizes of the "validation" dataset do not match those reported for "dev" and "test". This is an open issue that I have raised on their GitHub page.

That being said, the functions for finding common instances and for checking if there are any empty instances are given below:

(Optional – Only if you want to use the dataset as part of the PyTorch class provided in the article)

We can save the processed dataset to be used later by a PyTorch Data Class.

Using the Dataset

In this section I provide the the tokenisation parameters (and code) for TyDiQA, as well as a PyTorch Dataset class (only for the parallel case) that allows direct use in a training loop. I also provide an academic and practical use case for the TyDiQA-GoldP dataset.

Tokenising the Dataset

Since our problem is Extractive Question Answering, we need to do some processing on each example before tokenising. Mainly, we must be careful not to truncate the answer from a context. As a result, when providing a max length we also need to provide a stride. With this, we make sure that contexts that are very long are split into multiple instances, ensuring that at least in one of them, we will have the full answer. We also set the tokeniser parameter truncation to "second_only" to ensure that only the context gets truncated. We specify max_length to be 384 and stride to be 128, taken directly from the XTREME GitHub repository. We also need to make sure that the training examples are processed differently to the validation examples. The functions for doing this are provided below:

Dataset Class for PyTorch Training Loop

The following is code that prepares the TyDiQA-GoldP dataset (from the preprocessed source) for training in a PyTorch style loop.

Academic Use Case: Pushing Your QA Models to Their Limit

TyDiQA-GoldP is difficult because of the way it was created, and also because of the selection of languages (e.g. it has low resource languages like Swahili and Telugu). This makes it an excellent choice for evaluating the cross-lingual performance of your QA models.

However, it’s worth noting that because of the open issues raised above it may be a bit of a trial and error process to reproduce the results you see in literature, since it is unclear which state of the data was used in arriving at that.

Practical Use Case: TydiQA-GoldP Fine-Tuned Question Answering

The original TyDiQA-GoldP dataset is useful for fine-tuning for 2 reasons: a) the dataset is fairly large and b) it is difficult. What’s more, it contains a very diverse set of languages. Aside from covering 7 language families and 4 scripts as mentioned in the introduction, the languages in this dataset cover a wide array of interesting linguistic phenomena, such as [4]:

  • Diacritics: symbols on letters that determine pronunciation. Example from TyDiQA-GoldP: Arabic
  • Extensive Compounding: combinations of multiple words, e.g. Note+book=Notebook. Example from TyDiQA-GoldP: Telugu
  • Bound words: words that are syntactically independent, but phonologically dependent, e.g. it’s = it is. Example from TyDiQA-GoldP: Bengali
  • Inflection: modification of a word to express grammatical information, e.g. sang, sing, sung. Example from TyDiQA-GoldP: Russian
  • Derivation: creation of a noun from a verb, e.g. slow → slowness. Example from TyDiQA-GoldP: Korean

Concluding Remarks

  • TyDiQA-GoldP is a multilingual Extractive Question Answering dataset
  • By nature it is non-parallel, however a small parallel version based on the original English subset exists
  • The non-parallel dataset has 1636–14805 datapoints, while the parallel one has 3150
  • It covers 9 languages, spanning 4 scripts and 7language families
  • It is a difficult task and dataset
  • It is a good introduction for people interested in multilingual question answering because of its size, but don’t expect very high scores!

Author’s Note

It personally took me a long time to identify which TyDiQA datasets to use for parallel training and evaluation. Having found no similar articles online, I decided to write this so that there is at least some reference online that summarises the different sources of the TyDiQA dataset. I do hope to keep this updated as if I find any answers to the open issues I’ve raised.

If you are interested in this line of work, please consider supporting me by getting a Medium membership using my referral link:

Join Medium with my referral link – yousefnami

This helps me as a portion of your membership fee comes to me (don’t worry, this is at no extra cost to you!) while giving you full access to all articles on Medium!

References

GitHub Repositories

TyDiQA

GitHub – google-research-datasets/tydiqa: TyDi QA contains 200k human-annotated question-answer…

XTREME

GitHub – google-research/xtreme: XTREME is a benchmark for the evaluation of the cross-lingual…

HuggingFace Repositories

TyDiQA

tydiqa · Datasets at Hugging Face

XTRME

xtreme · Datasets at Hugging Face

Reference List

[1] Clark J et al. TYDI QA: A Benchmark for Information-Seeking Question Answering in Typologically Diverse Languages. Available from: https://aclanthology.org/2020.tacl-1.30.pdf

[2] Artetxe et al. On the Cross-lingual Transferability of Monolingual Representations. Available from: https://arxiv.org/pdf/1910.11856.pdf

[3] Lewis et al. MLQA: Evaluating Cross-lingual Extractive Question Answering. Available from: https://arxiv.org/abs/1910.07475

[4] Hu et al. XTREME: A Massively Multilingual Multi-task Benchmark for Evaluating Cross-lingual Generalization. Available from: https://arxiv.org/abs/2003.11080

Declarations

  • The TyDiQA-GoldP is available for use as per the Apache 2.0 license (see Licensing Information on GitHub)
  • All images, tables and code by author unless specified otherwise

Related Articles