Lost In (Machine) Translation

Small batch machine translation of speeches and news articles (English to Chinese/Tamil, and vice versa) in under-30 lines of code, using Hugging Face’s version of MarianMT and Facebook’s Fairseq.

Chua Chin Hon
Towards Data Science

--

Illustration: Chua Chin Hon

*UPDATED Dec 30, 2020*:

Facebook recently released recently released its machine translation models for English to Tamil (and vice versa), and I was eager to give it a try since Tamil is among the most under-served languages in machine learning, and related language pairs are pretty hard to come by.

The new notebooks and toy datasets are in the repo. Or, go here for the demo for English-to-Tamil translation of speeches and news articles, and here for Tamil-to-English translation of the same type of material.

There are obvious problems with the quality of the translation in some parts. But machine translation gets about 70–80% of the job done, in my view, allowing human translators to work more efficiently.

Sample output for English-to-Tamil translation of news articles.

Machine translation doesn’t generate as much excitement as other emerging areas in NLP these days, in part because consumer-facing services like Google Translate have been around since April 2006.

But recent advances, particularly the work by Hugging Face in making transformer models more accessible and easy to use, have opened up interesting new possibilities for those looking to translate more than just piecemeal sentences or articles.

For one, small batch translation in multiple languages can now be run pretty efficiently from a desktop or laptop without having to subscribe to an expensive service. No doubt the translated works by neural machine translation models are not (yet) as artful or precise as those by a skilled human translator, but they get 60% or more of the job done, in my view.

This could be a huge time saver for workplaces that run on short deadlines, such as newsrooms, to say nothing of the scarcity of skilled human translators.

Over three short notebooks, I’ll outline a simple workflow for using Hugging Face’s version of MarianMT to batch translate:

Download the output files here

1. DATA SET AND TRANSLATED OUTPUT

There are two datasets for this post. The first comprises 11 speeches in four languages (English, Malay, Chinese, and Tamil) taken from the website of the Singapore Prime Minister’s Office. The second dataset consists of five random English news stories on Covid-19 published on Singapore news outlet’s CNA’s website in March 2020.

Results of the output CSV files with the machine translated text and original copy can be downloaded here.

At the time of writing, you can tap over 1,300 open source models on Hugging Face’s model hub for machine translation. As no MarianMT models for English-Malay and English-Tamil (and vice versa) have been released to date, this series of notebooks will not deal with these two languages for now. I’ll revisit them as and when the models are available.

I won’t dive into the technical details behind machine translation. Broad details of Hugging Face’s version of MarianMT can be found here, while those curious about the history of machine translation can start with this fairly recent article.

2A. ENGLISH-TO-CHINESE MACHINE TRANSLATION OF 3 SPEECHES

The three English speeches I picked for machine translation into Chinese range from 1,352 to 1,750 words. They are not highly technical but cover a wide enough range of topics — from Covid-19 to Singapore politics and domestic concerns — to stretch the model’s capability.

The text was lightly cleaned. For best results, the sentences were translated one at a time (you’ll notice a significant drop in translation quality if you run the entire speech through the model without tokenizing at the sentence level). The notebook took just minutes to run on my late-2015 iMac (32Gb RAM) — but may vary depending on your hardware.

Batch machine translation made easy, thanks to Hugging Face’s implementation of MarianMT

Some of the familiar problems with machine translation are apparent right away, particularly with the literal translation of certain terms, phrases and idioms.

May Day, for instance, was translated as “五月节” (literally May Festival) instead of 劳动节. A reference to “rugged people” was translated as “崎岖不平的人”, which would literally mean “uneven people”. Clearly the machine translation mixed up the usage of “rugged” in the context of terrain versus that of a society.

Here’s a comparison with Google Translate, using a snippet of the second speech. Results from the Hugging Face — MarianMT model held up pretty well against Google’s translation, in my view:

To be sure, neither version can be used without correcting for some obvious errors. Google’s results, for instance, translated the phrase “called the election” to “打电话给这次选举”, or to literally make a telephone call to the election. The phrasing of the translated Chinese sentences is also awkward in many instances.

A skilled human translator would definitely do a better job at this point. But I think it is fair to say that even a highly experienced translator won’t be able to translate all three speeches within minutes.

The draw of machine translation, for now, appears to be scale and relative speed rather than precision. Let’s try the same model on a different genre of writing to see how it performs.

2B. ENGLISH-TO-CHINESE MACHINE TRANSLATION OF 5 NEWS ARTICLES

Speeches are more conversational, so I picked five random news articles on Covid-19 (published in March 2020) to see how the model performs against a more formal style of writing. To keep things simple, I selected articles that are under 500 words.

The workflow in this second trial is identical to the first, except for additional text cleaning rules and the translation of the articles’ headlines alongside the body text. Here’s what the output CSV file looks like:

Download a copy of the above CSV here.

Let’s compare one of the examples with the results from Google Translate:

Both Google and MarianMT tripped up on the opening paragraph, which was fairly long and convoluted. The two models performed slightly better on the shorter sentences/paragraphs, but the awkward literal translations of simple phrases continue to be a problem.

For instance, the phrase “as tourists stay away” was translated by the MarianMT model as “游客离家出走” or “tourists ran away from home”, while Google translated it as “游客远离了当地” or “tourists kept away from the area”.

These issues could result in misinterpretations of factual matters, and cause confusion. I haven’t conducted a full-fledged test, but based on the trials I’ve done so far, both MarianMT and Google Translate appear to do better with text that’s more conversational in nature, as opposed to more formal forms of writing.

2C. DASH APP

Plotly has released a good number of sample interactive apps for transformer-based NLP tasks, including one that works with Hugging Face’s version of MarianMT. Try it out on Colab, or via Github (I edited the app’s headline in the demo gif below).

3. CHINESE-TO-ENGLISH MACHINE TRANSLATION OF 3 SPEECHES

Machine translation of Chinese text into English tend to be a trickier task in general, as most NLP libraries/tools focus are English-based. There isn’t a straightforward equivalent of the sentence tokenizer in NLTK, for instance.

I experimented with jieba and hanlp but didn’t get very far. As a temporary workaround, I adapted a function to split the Chinese text in the dataframe column into individual sentences prior to running them via the Chinese-to-English MarianMT model.

But even with the somewhat clumsy workaround, the batch translation of the three Chinese speeches took just about 5 minutes. Do note that these three speeches are the official Chinese versions of the three earlier English speeches. The two sets of speeches cover the same ground but there are some slight variations in the content of the Chinese speeches, which aren’t direct, word-for-word translations of the English versions.

You can download the output CSV file here. Let’s see how the results for the third speech compare with Google Translate:

The Google Translate version reads much better, and didn’t have the glaring error of translating “National Day” as “Fourth of July”. Overall, the results of Chinese-to-English machine translation appear (to me at least) considerably better than that for English-to-Chinese translation. But one possible reason is that the Chinese speeches in my sample were more simply written, and did not push the neural machine translation models that hard.

4. FOOD FOR THOUGHT

While Hugging Face has made machine translation more accessible and easier to implement, some challenges remain. One obvious technical issue is the fine-tuning of neural translation models for specific markets.

For instance, China, Singapore and Taiwan differ quite significantly in their usage of the written form of Chinese. Likewise, Bahasa Melayu and Bahasa Indonesia have noticeable differences even if they sound/look identical to non-speakers. Assembling the right datasets for such fine tuning won’t be a trivial task.

In terms of the results achievable by the publicly available models, I would argue that the machine translated English-to-Chinese and Chinese-to-English texts aren’t ready for public publication unless a skilled translator is on hand to check and make corrections. But that’s just one use case. In situations where the translated text is part of a bigger internal workflow that does not require publication, machine translation could come in very handy.

For instance, if I’m tasked to track disinformation campaigns on social media by Chinese or Russian state actors, it just wouldn’t make sense to try to translate the torrent of tweets and FB posts manually. Batch machine translation of these short sentences or paragraphs would be far more efficient in trying to get a broad sense of the messages being peddled by the automated accounts.

Likewise, if you are tracking social media feedback for a new product or political campaign in a multi-lingual market, batch machine translation will likely be a far more efficient way of monitoring non-English comments collected from Instagram, Twitter or Facebook.

Ultimately, the growing use of machine translation will be driven by the broad decline of language skills, even in nominally multi-lingual societies like Singapore. You may scoff at some of the machine translated text above, but most working adults who have gone through Singapore’s bilingual education system are unlikely to do much better if asked to translate the text on their own.

The Github repo for this post can be found here. This is the fourth in a series on practical applications of new NLP tools. The earlier posts/notebooks focused on:

If you spot mistakes in this or any of my earlier posts, ping me at:

--

--