The world’s leading publication for data science, AI, and ML professionals.

A Visual Guide to Low-Resource NLP

An overview of recent approaches that help you train NLP models if you only have limited amounts of labeled data.

Thoughts and Theory

Photo by Paolo Chiabrando on Unsplash
Photo by Paolo Chiabrando on Unsplash

Deep neural networks are becoming omnipresent in natural language applications (NLP). However, they require large amounts of labeled training data, which is often only available for English. This is a big challenge for many languages and domains where labeled data is limited.

In recent years, a variety of methods have been proposed to tackle this situation. This article gives an overview of these approaches that help you train NLP models in resource-lean scenarios. This includes both ideas to increase the amount of labeled data as well as methods following the popular pre-train and fine-tune paradigm. We want to give guidance that helps you decide which methods are applicable for your low-resource scenario.

The base for this post is our recent survey on low-resource NLP, where you can find additional details and further discussions on open issues (but more text and less colorful images 😉 ).

Dimensions of Resource Availability

When discussing low-resource scenarios, usually the focus lies on the lack of labeled data. However, different types of data exist, and many low-resource methods have certain assumptions about the availability of specific data. Understanding one’s low-resource setting and the assumptions that particular methods impose is essential to select the best approach. We distinguish three dimensions of resource availability:

  • Task-specific labeled data: the most prominent dimension. Requires a domain expert to manually annotate instances which is often both time and cost-intensive.
  • Unlabeled language- or domain-specific text: usually easier to obtain but can be scarce for certain resource-lean scenarios. With most modern NLP approaches based on some form of pre-trained embedding, it has become an important resource to consider.
  • Auxiliary data: while less dominant in "normal" NLP, most of the low-resource approaches require some form of auxiliary data. This can be, e.g., labeled data in a different language, a knowledge base, or a machine translation tool. It is essential to take this into consideration as insights from one low-resource scenario might not be transferable to another one if a method’s assumptions on the auxiliary data are broken.

We will now give an overview of current approaches for low-resource scenarios.

Data Augmentation

In data augmentation, we need a small amount of labeled data. We then create more data by taking an existing instance and changing its features while not changing the label. This is a popular technique in computer vision where, e.g., rotating an image of a cat does not change the label cat.

In the example below, we have a sentiment task. We can augment the sentence "the film is great" by, e.g., replacing some of the words with synonyms like "movie" or "awesome." This keeps the sentiment label unchanged. The new sentences still have a positive sentiment.

Image by Author
Image by Author

Amit Chaudhary gives a nice, visual survey on different techniques to augment NLP datasets (and his article was an inspiration for this post).

Weak Supervision

Weak Supervision takes unlabeled data and annotates labels through a (semi-)automatic process. Different methods exist on how to obtain this automatic annotation. Some are specific to a task, and others can be applied more generally.

a) Distant Supervision

Distant supervision is a popular method for tasks like Named Entity Recognition or Relation Extraction (Mintz et al., 2009). An external knowledge base is needed as auxiliary data. This can be just a list of names but also a more complex knowledge base like Wikidata.

The tokens in the unlabeled text are automatically mapped to the knowledge base. In the example below, we want to label locations automatically. A list of city names is obtained (e.g. from Wikipedia). If a sequence of tokens matches an entry in the list, it is assigned the corresponding label; in this case LOCATION.

b) Labeling Rules

Many insights from domain experts can be expressed in simple rules. The expert can write down a small set of rules, and the rules are then used to annotate the unlabeled data automatically. This can be a more efficient (and interesting) use of an expert’s time compared to having to manually label a large amount of instances.

In the example below, we wrote a rule for finding date terms in text. A series of 8 digits, separated by two points, will often match dates, like 14.03.1879. Similar date rules have been used, e.g., in (Strötgen & Gerz, 2013) for temporal tagging in various languages or (Hedderich et al., 2020) for detecting dates in Hausa and Yorùbá. Different applications of such labeling rules, labeling functions, or heuristics can be found, e.g., in Snorkel (Ratner et al., 2020).

c) Cross-Lingual Projections

If a task is well supported in one language but not in another, cross-lingual projections can be used (Yarowsky et al., 2001). In the example below, we assume Spanish is a low-resource language, and we only have a named-entity-recognition tool for English. We translate the sentence from Spanish to English. In English, our tool recognizes that Mexico City is a location. Through the translation, we know that Mexico City in English is Ciudad de México in Spanish (so-called alignment). We can, therefore, give Ciudad de México also the label LOCATION.

To obtain texts from two languages, one can use parallel corpora where sentences are already aligned, e.g., OPUS or JW300. Alternatively, one can use machine translation if support for the pair of languages exists.

d) Noise Handling

While weak supervision allows obtaining labeled data quickly and automatically, the quality of the labels is usually lower compared to fully manually annotated data. To remove the negative effects of incorrect labels, a variety of label noise handling methods have been proposed. An additional model can be trained, e.g., to detect and filter incorrect labels or give them a lower weight. Alternatively, the noise itself can be modeled to clean the labels, or a noise-robust type of model can be used.

Pre-Trained Language Representations

Distant supervision and data augmentation generate and extend task-specific training data,

A strong focus in recent works in NLP lies in the use of pre-trained language representations that are trained on unlabeled data. These methods reduce the need for labeled target data by transferring learned representations and models.

a) Pre-Trained Transformers

Transformer models, such as BERT (Devlin et al., 2019) or RoBERTa (Liu et al., 2019), are trained on large-scale text corpora with a language modeling objective to create context-aware word representations by predicting the next word or sentence. These models are particularly helpful for low-resource languages for which large amounts of unlabeled data are available, but task-specific labeled data is scarce (Cruz and Cheng, 2019).

Jay Alammar provides a detailed description with nice visualizations of BERT and related language models in his blog post: The Illustrated BERT, ELMo, and co. (How NLP Cracked Transfer Learning).

b) Domain-specific Pre-Training

The language of a specialized text domain can differ tremendously from what is considered the standard language. Thus, many text domains are often less-resourced as well. However, the majority of recent language models are pre-trained on general-domain data, such as texts from the news or web-domain, which can lead to a so-called "domain-gap" when applied to a different domain.

One solution to overcome this gap is the adaptation to the target domain by finetuning the language model or training a new domain-specific language model from scratch. Popular publicly available domain-adapted models include BioBERT (Lee et al., 2020), ClinicalBERT (Alsentzer et al., 2019), and SciBERT (Beltagy et al., 2019).

c) Multilingual Language Models

In cross-lingual settings, no task-specific labeled data is available in the low-resource target language. Instead, labeled data from a high-resource language is leveraged. A multilingual model can be trained on the target task in a high-resource language and, afterward, applied to the unseen target languages.

This usually requires the training of multilingual language representations by training a single model for many languages, such as multilingual BERT (Devlin et al., 2019) or XLM-RoBERTa (Conneau et al., 2020). These models are trained using unlabeled, monolingual corpora from different languages and can be used in cross- and multilingual settings due to the many languages seen during pre-training.

Conclusion

In addition to the above-mentioned methods, there exist other exciting approaches to tackle low-resource NLP, with many more to come in the future. For a more detailed overview of current methods and references but also discussions on open issues, we would like to refer to our NAACL paper A Survey on Recent Approaches for Natural Language Processing in Low-Resource Scenarios. Based on these insights, we are excited about how things will develop in the future and which other aspects of low-resource NLP will be addressed.

Written by Michael A. Hedderich and Lukas Lange.

Citation

If you found this work useful and want to refer to it in academic contexts, please cite the underlying NAACL paper as:

@inproceedings{hedderich-etal-2021-survey, title = "A Survey on Recent Approaches for Natural Language Processing in Low-Resource Scenarios", author = {Hedderich, Michael A. and Lange, Lukas and Adel, Heike and Str{"o}tgen, Jannik and Klakow, Dietrich}, booktitle = "Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies", year = "2021", doi = "10.18653/v1/2021.naacl-main.201" }

References


Related Articles