The world’s leading publication for data science, AI, and ML professionals.

Unlocking inclusivity possibilities with Polyglot-NER

Fundamentals of massive multilingual Named Entity Recognition

Thoughts and Theory

Image by Nick Fewings via Unsplash
Image by Nick Fewings via Unsplash

Introduction

Named Entity Recognition (NER) advances have come fast and furious in recent years, with new models made available regularly. It’s a thrilling space to watch and to be a part of. Just recently, my colleagues Sybren Jansen and Stéphan Tulkens explored the advantages and limitations of biomedical named entity recognition (BioNER), as an example.

There is still a major challenge when building trustworthy AI leveraging NER capabilities, however: How to accommodate for the world’s diverse languages. The vast majority of NER models, training sets, data, open-source code and more, are all in English. Yet of the world’s nearly 8 billion people, only 375 million are native English speakers. The English bias prevalent in NER is effectively leaving out a considerable part of the world from this important technological advancement.

Polyglot (multilingual) NER models offer the possibility of a bridge between languages.

In this first blog post, I will further explore the importance of polyglot models, and explore specific polyglot models in more detail. In subsequent posts, I’ll dive more deeply into tests and results.

Enter: The Polyglot

There are an estimated 7000 active languages in the world. [1] NLP specialists typically divide languages into so-called Low, Medium and High resource languages. A High resource language is a language where there is much labeled and unlabeled text available online in different genres, such as English. A Low resource language is a language where very little data is available; for example, the Yoruba language. A Medium resource language falls between these two categories.

In NLP, most research is done on English and other High resource languages. As such, there is an asymmetry in reaping the benefits of language technology. [2]

There are two common terms used to describe models that can be used for different languages: multilingual and polyglot. In this field, specialists also apply the phrase ‘cross-lingual learning’, which is the process of applying learned knowledge of one language to another language. This technique is used to train polyglot models and can be leveraged to perform better on lower resource languages.

Named Entity Recognition (NER)

We will explore polyglot models through the subtask of NER. In this task, a model must detect within a text which tokens refer to a Named Entity (NE). The definition of an NE varies between domains, but the simplest definition of an NE would be a token or set of tokens which refer to a specific thing in the real world. However, this is not entirely accurate as the character ‘Harry Potter’ is also an NE but has no existence in the real world.

It is not necessary to have a strict definition of an NE as which token belongs to an NE will vary per domain and dataset. For example, in the medical domain, a specific condition like pneumonia can be considered an NE but in a business domain it would not be considered a NE.

The NER task becomes more difficult when dealing with multiple languages as different languages have different conventions on how to indicate an NE. In English, for example, words starting with a capital letter mid-sentence are likely to be an NE. In German this is less reliable as all nouns start with a capital letter, but not every noun is an NE. Moreover, there are many languages that do not use the Latin script and do not have a distinction between capital and lower-case letters. These different conventions in languages to indicate an NE are additional challenges specifically for the polyglot domain.

Applying NER: Anonymization

Being able to identify Named Entities in 100 different languages is different from conversing in 100 different languages. Yet it is nonetheless a useful and impressive skill. The domain of anonymization can obtain significant value by using NER models. With the increased trend of open data [3], anonymization plays a key role as documents often need to be anonymized before being published. Additionally, there is an increasing demand by the public for a transparent government, with concerns of privacy raised by government officials. The trade-off between transparency and privacy is a valid concern. Anonymization through NER, can help erase the negative aspect of the trade-off.

There are several other use cases which could benefit from NER anonymization of documents. The European Union (EU) has expressed interest in the anonymization of court and medical documents and has funded the MAPA project [4]. This project aims to create a tool to anonymize medical and legal text in all 24 official EU languages. This project is still in a developmental phase.

Currently Available Models

There are several models available for use today. First, a monolingual Brazilian Portuguese model by Luz et al [5] has been trained on the LENER-BR dataset. This dataset was specifically developed in order to train this model. LENER stands for Legal Named Entity Recognition and BR signifies that is in Brazilian Portuguese. The LENER-BR dataset consists of 66 legal documents that are manually annotated with specific legal entities. These entities are not the same as the standard named entities like persons, organizations, and locations. While these types of entities are also included, the dataset includes annotations for specific legal entities and jurisprudence, leading to a total of 6 classes. The model trained on this data is LSTM-CRF based and uses word embeddings as input. In their paper they report an average of 92.5 F-score on the 6 classes.

Additionally, a polyglot model named XLM-R by [](https://arxiv.org/pdf/1901.07291.pdf)Conneau et al is available for use. [6] XLM-R is a Transformer [7] based model and is a combination of XLM [8] and RoBERTa [9]. XLM-R was trained on a hundred languages using a Masked Language Modelling approach. The included languages and amount of data per language is shown in the image below.

Table 1: Training data of XLM-R model, taken from original paper [6]
Table 1: Training data of XLM-R model, taken from original paper [6]

The XLM-R model has been evaluated on four tasks, one of which being NER. While the XLM-R model is trained on a hundred languages, the original paper reports NER scores for only four languages: English, Dutch, German, and Spanish. These results have been reported on the 2003 CONLL NER dataset. [10] This is disappointing because on the other tasks the model performed well, even on lower resource languages. It would be interesting to see the NER performance evaluated on more languages.

Table 2: Performance of XLM-R on NER compared to other models taken from original XLM-R paper [9]
Table 2: Performance of XLM-R on NER compared to other models taken from original XLM-R paper [9]

In the table above we see the results reported on NER from the original XLM-R paper. Multilingual BERT (M-BERT) [11] is one of the most established polyglot models. M-BERT was used to compare the performance of XLM-R in their original paper. The top two models were used as a baseline and both are monolingual models. For these four languages, we can see that the XLM-R model outperforms M-BERT on all languages and outperforms the monolingual baseline on 2 out of 4 languages.

Luckily, we can go beyond these four languages because Plu [12] took XLM-R and fine-tuned it on 40 languages and performed NER experiments on all of them. It is available for use on Huggingface. [12] The model was fine-tuned on the PAN-X dataset [13], also known as the Wiki-Ann dataset which makes use of Wikipedia data. On Wikipedia, there are articles in over 295 different languages. Hand-labelling all these pages would be a very resource intensive task. Hand-labels are known as a gold standard. Pan et al [13] created a so-called silver standard Named Entity corpus using Wikipedia pages by leveraging Wikipedia page markup, Named Entities are often linked to other pages. There is a lot of variation of performance, depending on the language, but overall, the model performed well: For many languages, an F-score close to .9 was reached. Impressively, these 40 languages came from vastly different language families.

Experiments to take XLM-R further

We’ve seen exciting results from XLM-R. But there are 2 ways to take this model further for greater benefit to perform NER:

  1. Domain change: The XLM-R model is only trained on three NE classes: persons, organizations, locations. While XLM-R has showed itself to be capable of handling different languages, we don’t yet know if it is capable of handling different domains. It would be interesting to see what the results would be when we switch the domain without any additional changes. It could be the model is so powerful that it can be applied cross-domain out of the box. This would make it very accessible.
  2. Anonymization: At the beginning of this post, we introduced the anonymization task and its importance. The good news is that we can apply a 3-class model to a 6-class dataset by transforming the task into a binary task. That is, predicting whether something is a NE or not.

If you want to anonymize a document, you want to redact every NE in a document. However, in some cases, it might be good to have control over which type of entities to anonymize and which to keep in place and perform the experiment; predicting a 1 for a token that is believed to be part of an NE and a 0 for a token that is not. This way we can apply the XLM-R model to this dataset.

In such an experiment, there are two things being tested: the effect of binarization and whether the XLM-R model also can work in a cross-domain setting. To compare the monolingual model to the binarized XLM-R, the monolingual model should also be transformed in a binary way.

Wrap-up and what’s next

We’ve now laid the foundation of polyglot-NER as a key technique to build more inclusive NLP. In the next blog post, we will see the results of the binary domain switch experiment. What do you think the results will be? And what other experiments would you like to see performed? I’d love to hear from you!

If you’d like to learn more about Slimmer AI and the latest research from our AI Innovation team Fellowship, see: https://medium.com/slimmerai/innovation/home.


References

[1] How many languages are there in the world? (2021, February 23). Retrieved from https://www.ethnologue.com/guides/how-many-languages

[2] Joshi, P., Santy, S., Budhiraja, A., Bali, K., & Choudhury, M. (2020). The State and Fate of Linguistic Diversity and Inclusion in the NLP World. arXiv, 2004.09095. Retrieved from https://arxiv.org/abs/2004.09095v3

[3] Publications Office of the European Union., Capgemini Invent., & European Data Portal. (2020). Open data and privacy. Publications Office. https://doi.org/10.2830/532195

[4] Mapa – Multilingual Anonymisation for Public Administrations. (2021, June 03). Retrieved from https://mapa-project.eu

[5] Luz de Araujo, P. H., de Campos, T. E., de Oliveira, R. R. R., Stauffer, M., Couto, S., & Bermejo, P. (2018). LeNER-Br: A Dataset for Named Entity Recognition in Brazilian Legal Text. Computational Processing of the Portuguese Language. Springer. doi: 10.1007/978–3–319–99722–3_32

[6] Conneau, A., Khandelwal, K., Goyal, N., Chaudhary, V., Wenzek, G., Guzmán, F., …Stoyanov, V. (2019). Unsupervised Cross-lingual Representation Learning at Scale. arXiv, 1911.02116. Retrieved from https://arxiv.org/abs/1911.02116v2

[7] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., …Polosukhin, I. (2017). Attention Is All You Need. arXiv, 1706.03762. Retrieved from https://arxiv.org/abs/1706.03762v5

[8] Lample, G., & Conneau, A. (2019). Cross-lingual Language Model Pretraining. arXiv, 1901.07291. Retrieved from https://arxiv.org/abs/1901.07291v1

[9] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., …Stoyanov, V. (2019). RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv, 1907.11692. Retrieved from https://arxiv.org/abs/1907.11692v1

[10] Sang, E. T. K., & De Meulder, F. (2003). Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition. ACL Anthology, 142–147. Retrieved from https://www.aclweb.org/anthology/W03-0419

[11] Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2018). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv, 1810.04805. Retrieved from https://arxiv.org/abs/1810.04805v2

[12] jplu/tf-xlm-r-ner-40-lang · Hugging Face. (2021, June 22). Retrieved from https://huggingface.co/jplu/tf-xlm-r-ner-40-lang

[13] Pan, X., Zhang, B., May, J., Nothman, J., Knight, K., & Ji, H. (2017). Cross-lingual Name Tagging and Linking for 282 Languages. ACL Anthology, 1946–1958. doi: 10.18653/v1/P17–117


Related Articles