How to Detect and Translate Languages for NLP Project

From a Text Data with Multiple Languages To a Single Language

Davis David

Published in

Towards Data Science

6 min readJan 9, 2021

Image by PublicDomainPictures from Pixabay

This article was updated on 20th June 2022

For Spanish speakers, you can read the translated version of this article here

Happy new year to you, 2021 is here and you did it 💪. 2020 is now behind us, and even though 2020 has been a tough and strange year for many people around the world, there’s still a lot to celebrate. In 2020, I learned that all we need is the love & support of our loved ones, family members, and friends.

“In the face of adversity, we have a choice. We can be bitter, or we can be better. Those words are my North Star.”- Caryn Sullivan

This will be my first article for 2021, and I will talk about some language challenges a Data scientist or Machine Learning Engineer can face while working on a NLP project and how you can solve them.

Imagine you as a data scientist assigned to work on a NLP project to analyze what people post on social media (e.g Twitter) about covid-19. One of your first tasks is to find different hashtags for COVID-19 (e.g #covid19 ) and then start collecting all tweets related to covid-19.

when you start to analyze the collected data related to covid-19, you find out that the data is generated from different languages around the world such as English, Swahili, Spanish, Chinese, Hindi e.t.c. In this case, you will have two problems to solve before you start analyzing the dataset, the first is to identify the language of the particular data and the second is how you can translate the data to the language of your choice (e.g all data should be in the English language).

So how can we solve these two problems?

First Problem: Language Detection

The first problem is to know how you can detect language for particular data. In this case, you can use a simple python package called langdetect.

langdetect is a simple python package developed by Michal Danilák that supports detection of 55 different languages out of the box (ISO 639-1 codes):

af, ar, bg, bn, ca, cs, cy, da, de, el, en, es, et, fa, fi, fr, gu, he,
hi, hr, hu, id, it, ja, kn, ko, lt, lv, mk, ml, mr, ne, nl, no, pa, pl,
pt, ro, ru, sk, sl, so, sq, sv, sw, ta, te, th, tl, tr, uk, ur, vi, zh-cn, zh-tw

Install langdetect

To install langdetect run the following command in your terminal.

pip install langdetect

Basic Example

To detect the language of the text: e.g “Tanzania ni nchi inayoongoza kwa utalii barani afrika”. First, you import the detect method from langdetect and then pass the text to the method.

Output: “sw”

The method detects the text provided is in the Swahili language (‘sw’).

You can also find out the probabilities for the top languages by using detect_langs method.

Output: [sw:0.9999971710531397]

NOTE: You also need to know that the language detection algorithm is non-deterministic, if you run it on a text which is either too short or too ambiguous, you might get different results every time you run it.

Call the following code before language detection in order to enforce consistent results.

Now you can detect any language in your data by using the langdetect python package.

Second Problem: Language Translation

The second problem you need to solve is to translate a text from one language to the language of your choice. In this case, you will use another useful python package called google_trans_new.

google_trans_new is a free and unlimited python package that implemented Google Translate API and It also performs auto language detection.

Install google_trans_new

To install google_trans_new run the following command in your terminal.

pip install google_trans_new

Basic example

To translate a text from one language to another, you have to import the google_translator class from google_trans_new module. Then you have to create an object of the google_translator class and finally pass the text as a parameter to the translate method and specify the target language by using lang_tgt parameter e.g lang_tgt=”en”.

In the example above we translate a Swahili sentence into the English language. Here is the output after translation.

Tanzania is the leading tourism country in Africa

By default, the translate() method can detect the language of the text provided and returns the English translation to it. If you want to specify the source language of the text, you can use the lang_scr parameter.

Here are all the languages names along with their shorthand notation.

{'af': 'afrikaans', 'sq': 'albanian', 'am': 'amharic', 'ar': 'arabic', 'hy': 'armenian', 'az': 'azerbaijani', 'eu': 'basque', 'be': 'belarusian', 'bn': 'bengali', 'bs': 'bosnian', 'bg': 'bulgarian', 'ca': 'catalan', 'ceb': 'cebuano', 'ny': 'chichewa', 'zh-cn': 'chinese (simplified)', 'zh-tw': 'chinese (traditional)', 'co': 'corsican', 'hr': 'croatian', 'cs': 'czech', 'da': 'danish', 'nl': 'dutch', 'en': 'english', 'eo': 'esperanto', 'et': 'estonian', 'tl': 'filipino', 'fi': 'finnish', 'fr': 'french', 'fy': 'frisian', 'gl': 'galician', 'ka': 'georgian', 'de': 'german', 'el': 'greek', 'gu': 'gujarati', 'ht': 'haitian creole', 'ha': 'hausa', 'haw': 'hawaiian', 'iw': 'hebrew', 'hi': 'hindi', 'hmn': 'hmong', 'hu': 'hungarian', 'is': 'icelandic', 'ig': 'igbo', 'id': 'indonesian', 'ga': 'irish', 'it': 'italian', 'ja': 'japanese', 'jw': 'javanese', 'kn': 'kannada', 'kk': 'kazakh', 'km': 'khmer', 'ko': 'korean', 'ku': 'kurdish (kurmanji)', 'ky': 'kyrgyz', 'lo': 'lao', 'la': 'latin', 'lv': 'latvian', 'lt': 'lithuanian', 'lb': 'luxembourgish', 'mk': 'macedonian', 'mg': 'malagasy', 'ms': 'malay', 'ml': 'malayalam', 'mt': 'maltese', 'mi': 'maori', 'mr': 'marathi', 'mn': 'mongolian', 'my': 'myanmar (burmese)', 'ne': 'nepali', 'no': 'norwegian', 'ps': 'pashto', 'fa': 'persian', 'pl': 'polish', 'pt': 'portuguese', 'pa': 'punjabi', 'ro': 'romanian', 'ru': 'russian', 'sm': 'samoan', 'gd': 'scots gaelic', 'sr': 'serbian', 'st': 'sesotho', 'sn': 'shona', 'sd': 'sindhi', 'si': 'sinhala', 'sk': 'slovak', 'sl': 'slovenian', 'so': 'somali', 'es': 'spanish', 'su': 'sundanese', 'sw': 'swahili', 'sv': 'swedish', 'tg': 'tajik', 'ta': 'tamil', 'te': 'telugu', 'th': 'thai', 'tr': 'turkish', 'uk': 'ukrainian', 'ur': 'urdu', 'uz': 'uzbek', 'vi': 'vietnamese', 'cy': 'welsh', 'xh': 'xhosa', 'yi': 'yiddish', 'yo': 'yoruba', 'zu': 'zulu', 'fil': 'Filipino', 'he': 'Hebrew'}

Detect and Translate Python Function

I have created a simple python function that you can do both detect and translate the text into the language of your choice.

The python function receives a text and target language as parameters. Then it detects the language of the text provided and if the language of the text is the same as the target language it returns the same text, but it is not the same it translates the text provided to the target language.

Example:

In the above source code, we translate the sentence into the Swahili language. Here is the output:-

Natumai kwamba, nitakapojiwekea akiba, nitaweza kusafiri kwenda Mexico

Wrapping Up

In this article, you have learned how to solve two language challenges when you have text data with different languages and you want to translate the data into the single language of your choice.

Congratulations 👏, you have made it to the end of this article!

You can download the notebook used in this article here: https://github.com/Davisy/Detect-and-Translate-Text-Data

If you learned something new or enjoyed reading this article, please share it so that others can see it. Until then, see you in the next post! I can also be reached on Twitter @Davis_McDavid.

One last thing: Read more articles like this in the following links.

How to Deploy your NLP Model to Production as an API with Algorithmia

A simple way to deploy the NLP model on a serverless production, step by step.

medium.com

How to Use Texthero to Prepare a Text-based Dataset for Your NLP Project

A simple python toolkit to work with a text-based dataset quickly and effortlessly.

chatbotslife.com

Meet The Winners of Swahili News Classification Challenge

The first Zindi Africa NLP Virtual Hackathon focus on African Languages.

davis-david.medium.com