The world’s leading publication for data science, AI, and ML professionals.

4 NLP Libraries for Automatic Language Identification of Text Data In Python

Comprehensive overview of open-source libraries for automatic language detection

Image from Soner Eker on Unsplash
Image from Soner Eker on Unsplash

Introduction

Most industries dealing with textual data focus more on using digital capabilities because it is the fastest way of processing those documents. At an international level, it might be beneficial to automatically identify the underlying language of a document before any further processing. A simple use case could be for a company to detect the language of incoming textual information in order to route it to the relevant department for processing.

This article aims to provide an overview of four python libraries that can perform such a task.

Language detection libraries

The goal of this section is to walk you through the description of some libraries and their implementation for text language detection. All the source code is available on my Google colab.

1. LangDetect

This library is the direct port of Google’s language-detection library from Java to Python and can recognize over 50 languages. Developed by Nakatani Shuyo at Cybozu Labs, Inc, lang_detectcomes with two main functions, both taking the text to analyze as input:

  • detect outputs a two-letter ISO 693 corresponding to the code of the language identified.
  • detect_langsoutputs a list of all the candidate top languages detected and their corresponding probability score.

    In the text variable, I added the last sentence in French on purpose in order to add noise.

text = """This library is the direct port of Google's language-detection library from Java to Python. Elle est vraiment éfficace dans la détection de langue."""
First case tests, 1st and 5th execution of the language detection (Image by Author)
First case tests, 1st and 5th execution of the language detection (Image by Author)
Second case tests, 1st and 4th execution of the language detection (Image by Author)
Second case tests, 1st and 4th execution of the language detection (Image by Author)
  • Test first case (detect): the languages identified in the first execution (fr for French) and 5th execution (en for English) are not the same.
  • Test second case (detect_langs): we get two candidate languages French and English. In the first execution, the probability score of the language being French is high. After the 4th execution, the probability of English became higher.

All these inconsistencies in the results are because the algorithm under the hood is non-deterministic, and the higher the noise in the text, the higher the inconsistency. This issue can be fixed by setting the seed before running the language detection instructions as:

The result from the previous code will be always the same, because of the use of DetectorFactor. This library even if easy of use might be relevant for simple use cases.

2. Spacy-langdetect

Spacy is an NLP library and its features include everything from tokenization, Named entity recognition to pre-trained models. Spacy proposes spacy-langdetect **** in order to add language detection flavor to its text processing pipeline.

  • model can be any custom language detectors, meaning your own pre-trained model or one from the Spacy models’ hub.
  • LanguageDetector is the Class that performs the language detection and uses the detect_langs function under the hood.
  • name parameter set to language_detector makes it possible to access the language detection feature in the pipeline.
  • ._.languageattributes correspond to the dictionary containing the candidate languages information detected in the document.

Below are the results on English and French text both license-free, respectively extracted from actuia.com and jeuneafrique.com.

  • Line 14 shows {‘language’: ‘en’, ‘score’: 0.9999963977276909}, almost 100% confidence that the text is in English.
  • Line 17 shows {‘language’: ‘fr’, ‘score’: 0.9999963767662121}, almost 100% confidence that the text is in French.

3. fastText

This library was developed by the Facebook AI Research (FAIR) lab. It is built for industrialization use cases instead of research, which makes it a fast, accurate, and light tool with less than 1MB of memory and the ability to recognize more than 170 languages. It comes with the following two versions of pre-trained models.

  • lid.176.bin: this one is faster and slightly more accurate but might be a bit large (size = 126MB)
  • lid.176.ftz: this is the compressed version of the model, with a file size of 917kB

Let’s focus on the first model (lid.176.bin), since it is the most accurate.

  • ft_modelis the instance of the fasttext pre-trained model loaded from my _pretrained_model_ folder.
  • .replace("n", " ")is used to that fasttextdoes not throw an error.

Below are the outputs of the previous print statements in lines 16 and 17. [‘labelen’] means that the first text is predicted to be in English with 89% confidence. [‘labelfr’] means that the second one was detected to be in French with 99% confidence.

([['__label__en']], [array([0.8957091], dtype=float32)])
([['__label__fr']], [array([0.99077034], dtype=float32)])

4. gcld3

This is the Google Compact Language Detector v3, a neural network language identification library developed by Google. At the time of writing this article, this pre-trained model supports 107 languages and has two main features for language identification.

  • FindLanguagecorresponds to the first feature, and returns the BCP-47 style code corresponding to the detected language along with the confidence score.
  • FindTopNMostFreqLangs, corresponds to the second feature which instead generates top candidate languages (using num_langsparameter) and their confidence score.

Prior to using those features, we need to instantiate a detector that requires the following two parameters during the inference.

  • min_num_bytesthe minimum number of bytes.
  • max_num_bytesthe maximum number of bytes.

Note: keep in mind that it is highly recommended to use a virtual environment when implementing this library.

Below are the implementations of the two features.

  • Results from the first feature

    {'language': 'en', 'probability': 0.9998331069946289}
    {'language': 'fr', 'probability': 0.9999961853027344}
  • Results from the second feature

I initialized num_langs to two because I wanted the top 2 languages identified for each text.

[{'language': 'en', 'probability': 0.9998331069946289}, {'language': 'und', 'probability': 0.0}]
[{'language': 'fr', 'probability': 0.9999961853027344}, {'language': 'und', 'probability': 0.0}]

We notice the und language type in the result, because there was no second language identified even though we wanted the top 2. This happens often when the first language is identified with a very high confidence score, almost 100% in our cases.

Thanks for reading!

Congrats on making it this far, I hope you have enjoyed reading the article, and that it gave you a clear vision of the benefit of using pre-trained models for automatic language detection. Please find below additional resources to further your learning.

Do not hesitate to add me on LinkedIn or follow me on YouTube and Twitter. It is always a pleasure to discuss AI, ML, Data Science, NLP stuffs!

Article’s Source code on google colab

Spacy-langdetect

Google’s langdetect

Fasttext language identification

Introduction to Google’s Compact Language Detector v3 in python

Bye for now 🏃🏾


Related Articles