How to Detect and Translate Languages for NLP Project
From a Text Data with Multiple Languages To a Single Language
This article was updated on 20th June 2022
For Spanish speakers, you can read the translated version of this article here
Happy new year to you, 2021 is here and you did it 💪. 2020 is now behind us, and even though 2020 has been a tough and strange year for many people around the world, there’s still a lot to celebrate. In 2020, I learned that all we need is the love & support of our loved ones, family members, and friends.
“In the face of adversity, we have a choice. We can be bitter, or we can be better. Those words are my North Star.”- Caryn Sullivan
This will be my first article for 2021, and I will talk about some language challenges a Data scientist or Machine Learning Engineer can face while working on a NLP project and how you can solve them.
Imagine you as a data scientist assigned to work on a NLP project to analyze what people post on social media (e.g Twitter) about covid-19. One of your first tasks is to find different hashtags for COVID-19 (e.g #covid19 ) and then start collecting all tweets related to covid-19.
when you start to analyze the collected data related to covid-19, you find out that the data is generated from different languages around the world such as English, Swahili, Spanish, Chinese, Hindi e.t.c. In this case, you will have two problems to solve before you start analyzing the dataset, the first is to identify the language of the particular data and the second is how you can translate the data to the language of your choice (e.g all data should be in the English language).
So how can we solve these two problems?
First Problem: Language Detection
The first problem is to know how you can detect language for particular data. In this case, you can use a simple python package called langdetect.
langdetect is a simple python package developed by Michal Danilák that supports detection of 55 different languages out of the box (ISO 639-1 codes):
af, ar, bg, bn, ca, cs, cy, da, de, el, en, es, et, fa, fi, fr, gu, he,
hi, hr, hu, id, it, ja, kn, ko, lt, lv, mk, ml, mr, ne, nl, no, pa, pl,
pt, ro, ru, sk, sl, so, sq, sv, sw, ta, te, th, tl, tr, uk, ur, vi, zh-cn, zh-tw
Install langdetect
To install langdetect run the following command in your terminal.
pip install langdetect
Basic Example
To detect the language of the text: e.g “Tanzania ni nchi inayoongoza kwa utalii barani afrika”. First, you import the detect method from langdetect and then pass the text to the method.
Output: “sw”
The method detects the text provided is in the Swahili language (‘sw’).
You can also find out the probabilities for the top languages by using detect_langs method.
Output: [sw:0.9999971710531397]
NOTE: You also need to know that the language detection algorithm is non-deterministic, if you run it on a text which is either too short or too ambiguous, you might get different results every time you run it.
Call the following code before language detection in order to enforce consistent results.
Now you can detect any language in your data by using the langdetect python package.
Second Problem: Language Translation
The second problem you need to solve is to translate a text from one language to the language of your choice. In this case, you will use another useful python package called google_trans_new.
google_trans_new is a free and unlimited python package that implemented Google Translate API and It also performs auto language detection.
Install google_trans_new
To install google_trans_new run the following command in your terminal.
pip install google_trans_new
Basic example
To translate a text from one language to another, you have to import the google_translator
class from google_trans_new
module. Then you have to create an object of the google_translator
class and finally pass the text as a parameter to the translate method and specify the target language by using lang_tgt parameter e.g lang_tgt=”en”.
In the example above we translate a Swahili sentence into the English language. Here is the output after translation.
Tanzania is the leading tourism country in Africa
By default, the translate()
method can detect the language of the text provided and returns the English translation to it. If you want to specify the source language of the text, you can use the lang_scr parameter.
Here are all the languages names along with their shorthand notation.
{'af': 'afrikaans', 'sq': 'albanian', 'am': 'amharic', 'ar': 'arabic', 'hy': 'armenian', 'az': 'azerbaijani', 'eu': 'basque', 'be': 'belarusian', 'bn': 'bengali', 'bs': 'bosnian', 'bg': 'bulgarian', 'ca': 'catalan', 'ceb': 'cebuano', 'ny': 'chichewa', 'zh-cn': 'chinese (simplified)', 'zh-tw': 'chinese (traditional)', 'co': 'corsican', 'hr': 'croatian', 'cs': 'czech', 'da': 'danish', 'nl': 'dutch', 'en': 'english', 'eo': 'esperanto', 'et': 'estonian', 'tl': 'filipino', 'fi': 'finnish', 'fr': 'french', 'fy': 'frisian', 'gl': 'galician', 'ka': 'georgian', 'de': 'german', 'el': 'greek', 'gu': 'gujarati', 'ht': 'haitian creole', 'ha': 'hausa', 'haw': 'hawaiian', 'iw': 'hebrew', 'hi': 'hindi', 'hmn': 'hmong', 'hu': 'hungarian', 'is': 'icelandic', 'ig': 'igbo', 'id': 'indonesian', 'ga': 'irish', 'it': 'italian', 'ja': 'japanese', 'jw': 'javanese', 'kn': 'kannada', 'kk': 'kazakh', 'km': 'khmer', 'ko': 'korean', 'ku': 'kurdish (kurmanji)', 'ky': 'kyrgyz', 'lo': 'lao', 'la': 'latin', 'lv': 'latvian', 'lt': 'lithuanian', 'lb': 'luxembourgish', 'mk': 'macedonian', 'mg': 'malagasy', 'ms': 'malay', 'ml': 'malayalam', 'mt': 'maltese', 'mi': 'maori', 'mr': 'marathi', 'mn': 'mongolian', 'my': 'myanmar (burmese)', 'ne': 'nepali', 'no': 'norwegian', 'ps': 'pashto', 'fa': 'persian', 'pl': 'polish', 'pt': 'portuguese', 'pa': 'punjabi', 'ro': 'romanian', 'ru': 'russian', 'sm': 'samoan', 'gd': 'scots gaelic', 'sr': 'serbian', 'st': 'sesotho', 'sn': 'shona', 'sd': 'sindhi', 'si': 'sinhala', 'sk': 'slovak', 'sl': 'slovenian', 'so': 'somali', 'es': 'spanish', 'su': 'sundanese', 'sw': 'swahili', 'sv': 'swedish', 'tg': 'tajik', 'ta': 'tamil', 'te': 'telugu', 'th': 'thai', 'tr': 'turkish', 'uk': 'ukrainian', 'ur': 'urdu', 'uz': 'uzbek', 'vi': 'vietnamese', 'cy': 'welsh', 'xh': 'xhosa', 'yi': 'yiddish', 'yo': 'yoruba', 'zu': 'zulu', 'fil': 'Filipino', 'he': 'Hebrew'}
Detect and Translate Python Function
I have created a simple python function that you can do both detect and translate the text into the language of your choice.
The python function receives a text and target language as parameters. Then it detects the language of the text provided and if the language of the text is the same as the target language it returns the same text, but it is not the same it translates the text provided to the target language.
Example:
In the above source code, we translate the sentence into the Swahili language. Here is the output:-
Natumai kwamba, nitakapojiwekea akiba, nitaweza kusafiri kwenda Mexico
Wrapping Up
In this article, you have learned how to solve two language challenges when you have text data with different languages and you want to translate the data into the single language of your choice.
Congratulations 👏, you have made it to the end of this article!
You can download the notebook used in this article here: https://github.com/Davisy/Detect-and-Translate-Text-Data
If you learned something new or enjoyed reading this article, please share it so that others can see it. Until then, see you in the next post! I can also be reached on Twitter @Davis_McDavid.
One last thing: Read more articles like this in the following links.