XLM-RoBERTa: The alternative for non-english NLP

Are multilingual models closing the gap on single language models?

Published in

deepset-ai

4 min readJan 15, 2020

If you are doing NLP in a non-english language, you’ll often be agonising over the question “what language model should I use?” While there’s a growing number of monolingual models trained by the community, there’s also an alternative that seems to get less attention: multilingual models.

In this article, we highlight the key ingredients of the XLM-R model and explore its performance on German. We find that it’s outperforming our monolingual GermanBERT on three popular German datasets; while being on par with SOTA on GermEval18 (Hate speech detection), it significantly outperforms previous methods on GermEval14 (NER).

Why multilingual models?

XLM-Roberta comes at a time when there is a proliferation of non-English models such as Finnish BERT, French BERT(a.k.a. CamemBERT) and German BERT. Our interaction with researchers as well as contacts in industry signal to us that there is a real need for cutting edge NLP technologies that work on languages other than English.

We at deepset also see multilingual models as a great solution for companies who anticipate future expansion. In the past, we have worked with clients who currently operate only in one language but have ambitions to extend their services globally. For them, multilingual models are a form of future proofing that ensures their existing NLP infrastructure is able to scale to however many regions they choose to do business in.

What’s New in XLM-Roberta?

The Facebook AI team released XLM-RoBERTa in November 2019 as an update to their original XLM-100 model. Both are transformer based language models, both rely on the Masked Language Model objective and both are capable of processing text from 100 separate languages. The biggest update that XLM-Roberta offers over the original is a significantly increased amount of training data. The cleaned CommonCrawl data that it is trained on takes up a whopping 2.5tb of storage! It is several orders of magnitude larger than the Wiki-100 corpus that was used to train its predecessor and the scale-up is particularly noticeable in the lower resourced languages. The “RoBERTa” part comes from the fact that its training routine is the same as the monolingual RoBERTa model, specifically, that the sole training objective is the Masked Language Model. There is no Next Sentence Prediction á la BERT or Sentence Order Prediction á la ALBERT.

The increase in size of the CommonCrawl dataset over Wikipedia per language (from XLM-RoBERTa paper)

XLM-Roberta now uses the one large shared Sentence Piece model to tokenize instead of having a slew of language specific tokenizers as was the case in XLM-100. Also validation perplexity is no longer used as the stopping criterion during training since the researchers found that downstream performance continues to improve even when perplexity does not.

Results

In the end we evaluated XLM-RoBERTa on one classification and two NER tasks where it showed very impressive performance. XLM-RoBERTa Large is on par with the best submission of GermEval18 (Classification). On GermEval14 (NER) the model outperforms Flair by 2.35% F1.

These results were produced without extensive hyperparameter tuning and we expect that they could improve with more tweaking of learning rates and batch sizes. Also, for the NER tasks, we believe there are gains to be made by adding a CRF layer on top of XLM-RoBERTa.

Conclusion

The strength of these results show that multilingual models exhibit great performance even when evaluated on a single language and we suggest that German NLP practitioners at least consider one of the XLM-Roberta variants when choosing a language model for their NLP systems. The importance of breaking the English-centric focus of NLP research is something that has already been extensively covered by the likes of Professor Emily Bender and we believe that research in non-English languages will only increase. We don’t find it so inconceivable that the best models of the future might learn from text, not only from different domains, but also from different languages.

Edit 17/02/20:

We had previously reported scores for CoNLL2003 which were incorrect due to a dataset issue