Opinion

Challenges in Developing Multilingual Language Models in Natural Language Processing (NLP)

Paul Barba
Towards Data Science
4 min readOct 22, 2020

--

Image from Lexalytics

One of the hallmarks of developing NLP solutions for enterprise customers and brands is that more often than not, those customers serve consumers who don’t all speak the same language. While the majority of our customers are running VoC (voice of the customer), social listening and market research programs with a predominantly English-speaking consumer base, we do have some that serve consumers that span more than 20 languages, and many with a heavy presence in Nordic countries, Latin America and the Asia-Pacific region…not to mention other parts of Europe.

The challenge in NLP in other languages is that English is the language of the Internet, with nearly 300 million more English-speaking users than the next most prevalent language, Mandarin Chinese. Modern NLP requires lots of text — 16GB to 160GB depending on the algorithm in question (8–80 million pages of printed text) — written by many different writers, and in many different domains. These disparate texts then need to be gathered, cleaned and placed into broadly available, properly annotated corpora that data scientists can access. Finally, at least a small community of Deep Learning professionals or enthusiasts has to perform the work and make these tools available. Languages with larger, cleaner, more readily available resources are going to see higher quality AI systems, which will have a real economic impact in the future.

What results is a tiered system of language support that looks like the following:

  1. Cutting-edge models generally show up first to support English and sometimes Mandarin.
  2. Some time after those are released, large nations with many data scientists and linguists replicate the work natively. These include nations like Germany, France and Russia.
  3. For around the next 100 most popular languages, single multilingual models are often produced to produce ok results for the entire set at once
  4. And for the rest of the world: AI doesn’t understand it, which precludes speakers from benefiting from these new technologies.

Accuracy in NER and sentiment analysis

One way the industry has addressed challenges in multilingual modeling is by translating from the target language into English and then performing the various NLP tasks. If you’ve laboriously crafted a sentiment corpus in English, it’s tempting to simply translate everything into English, rather than redo that task in each other language.

However, determining sentiment is part of the creative process of translating. Understanding that “pot calling the kettle black” is expressing negativity, and then coming up with a corresponding way of expressing the idea in another language, requires an understanding of sentiment in both languages. We also use cultural shorthands and conventions that do not have direct analogues in other languages. There are often multiple choices for translating an individual word. Is something “pequeño” in Spanish “little”? Or “puny”? Or “diminutive”? or perhaps even “cutesy”?

If your models were good enough to capture nuance while translating, they were also good enough to perform the original task. But more likely, they aren’t capable of capturing nuance, and your translation will not reflect the sentiment of the original document. Factual tasks, like question answering, are more amenable to translation approaches. Topics requiring more nuance (predictive modelling, sentiment, emotion detection, summarization) are more likely to fail in foreign languages.

What the future holds for multilingual NLP

In the quest for highest accuracy, non-English languages are less frequently being trained. One solution in the open source world which is showing promise is Google’s BERT, which offers an English language and a single “multilingual model” for about 100 other languages. People are now providing trained BERT models for other languages and seeing meaningful improvements (e.g .928 vs .906 F1 for NER). Still, in our own work, for example, we’ve seen significantly better results processing medical text in English than Japanese through BERT. It’s likely that there was insufficient content on special domains in BERT in Japanese, but we expect this to improve over time.

Creating and maintaining natural language features is a lot of work, and having to do that over and over again, with new sets of native speakers to help, is an intimidating task. It’s tempting to just focus on a few particularly important languages and let them speak for the world. However, the world is not homogenous. A company can have specific issues and opportunities in individual countries, and people speaking less-common languages are less likely to have their voices heard through any channels, not just digital ones.

It’s therefore in the interest of businesses to try to listen as carefully as possible in every language they service, and for universities, governments and citizen scientists to contribute to the creation of text corpora and models to facilitate access for their linguistic brethren to cutting edge technology in the present and future.

--

--