How we trained a translation model from West African Pidgin to English without a single parallel sentence

“Every act of communication is a miracle of translation.”
― Ken Liu

Published in

Towards Data Science

6 min readAug 28, 2019

source — https://unsplash.com/photos/I_LgQ8JZFGE

TLDR: We trained a model that can translate sentences from West African Pidgin (Creole) to English — and vice versa — without showing it a single parallel sentence (a Pidgin sentence and its English equivalent) to learn from. You can skip to the Results section at the end of the article to see some example translations by our model and the link to the code on github.

Introduction

Translation is an important area of research in Artificial Intelligence, and, most of all, communication. Many Machine Translation works have focused on popular languages like English, French, German, Chinese and so on. However, little work has been done on African languages.

Over 1000 languages are spoken across West and Central Africa, with over 250 of them being Nigerian. Despite the obvious diversity amongst these languages, one language significantly unifies them all — Pidgin English. There are over 75 million speakers in Nigeria alone, however, there is no known Natural Language Processing work on this language.

The problems this research addresses are the following:

Provision of a Pidgin corpus and training of Pidgin Word Vectors
Cross-lingual embedding of Pidgin and English
Unsupervised Machine Translation from Pidgin to English

1. Obtaining Corpus and Training Word Vectors

In total, we obtained a corpus consisting of 56048 sentences and 32925 unique words by scraping pidgin news websites. Below are examples of sentences in the corpus:

I. dis one na one of di first songs wey commot dis year for nigeria but as dem release am, yawa dey.

II. dem say na serious gbege if dem catch anybody with biabia for inside di campus

We initialized word vectors with Glove and fine tuned on the corpus with a CBOW model trained with 8 negative samples, window size of 5, dimension of 300 and a batch size of 3000 for 5 epochs.

2. Training Cross-Lingual Embeddings

Given the absence of parallel data, we performed Unsupervised Translation, which relies on cross-lingual embeddings. A dictionary consisting of 1097 word pairs was scraped and manually edited for supervised alignment. Evaluation of methods was done on a validation set of 108 pairs.

Alignment of the word vectors was performed with the Procrustes method in which an orthogonal weight matrix is learned to map source to target word vectors (Conneau et al., 2018) and with the Retrieval Criterion following (Joulin et al., 2018). The latter method outperformed the former achieving a Nearest Neighbor accuracy of 0.1282, compared to 0.0853 for the former and a baseline of 0.009, which is the probability of selecting the right nearest neighbor from the validation set of 108 pairs.

Given a word, below are some examples of its three English nearest neighbor words after alignment and their cosine similarity:

pikin — child (0.7461), infant (0.5493), children (0.5357)

presido — president (0.9173), vice (0.6589), presidents (0.5875)

wahala — problem (0.7265), problems (0.6983), trouble (0.6906)

3. Unsupervised NMT

For this, we used the Transformer with 10 attention heads. There are 4 encoder and 4 decoder layers with 3 encoder and decoder layers shared across both languages.

For a decoder to work well, its inputs should be produced by the encoder it was trained with or they should come from a similar distribution as that encoder. Hence, we make sure the encoder encodes sentences from source and target languages to the same latent space. This ensures the decoder can translate regardless of the input source language. We perform this enforcement by adversarial training following (Lample et al., 2018a) where we constrain the encoder to map the two languages to the same feature space. We do this by training a discriminator to classify encodings of source and target sentences. The encoder is trained to fool the discriminator such that latent representations of both source or target are indistinguishable.

We also make sure the same latent space is used for both language modelling and translation so that the language model can be transferred nicely to the translation task.

At each training step, we perform the following:

Discriminator training which aims to predict the language of an encoded sentence.
Denoising autoencoder training on each language (this is equivalent to training a language model as the model learns useful patterns for reconstruction and becomes immune to noisy input sentences)
On-the-fly back translation such that a given sentence is translated with the current translation model M and we then attempt to reconstruct it from the translation while leveraging the trained language model in step 2 above.

Thus, our final objective function is:

The discriminator loss is minimized in parallel.

We trained for 8 epochs on a V100 (approx 3 days). To select the best model, we evaluated on a test parallel set of sentences.

Results

The best performing model achieves a BLEU score of 7.93 from Pidgin to English and 5.18 from English to Pidgin on a test set of 2101 sentence pairs.

Below are some translations made by our model.

NB: [pd — Pidgin, en — English]

From these results, we can see that the language model helps with performing translations that are not necessarily word-for-word, but also grammatically correct as in translation 2.

Code, trained translation model, aligned pidgin word vectors and more translations by the model can be seen in the project’s Github repository — https://github.com/keleog/PidginUNMT

Conclusion

There has been some interesting NLP work done on African languages, however it barely scratches the surface in comparison to more popular languages. This is the very first NLP work - that we know of - performed on West African Pidgin. We hope that this work spurs the exploration of more unsupervised research for African languages.

Acknowldgements

Special thanks to my amazing colleagues at Instadeep for constant support and constructive feedback throughout the project.
Thanks to Naijalingo.com for permission to scrape their website to create a Pidgin to English dictionary.
Thanks to Deepquest AI and AI Saturday's Lagos for computation support.
Test set was obtained from the JW300 dataset [6] and preprocessed by the Masakhane group here

References

1. Mikel Artetxe, Gorka Labaka, and Eneko Agirre. 2017. Learning bilingual word embeddings with (almost) no bilingual data.

2. Alexis Conneau, Guillaume Lample, Marc’Aurelio Ranzato, Ludovic Denoyer, and Herv Jegou. 2018. Word translation without parallel data. In ICLR

3. Guillaume Lample, Alexis Conneau, Ludovic Denoyer, and Marc’Aurelio Ranzato. 2018a. Unsupervised machine translation using monolingual corpora only. In ICLR.

4. Guillaume Lample, Myle Ott, Alexis Conneau, Ludovic Denoyer, and Marc’Aurelio Ranzato. 2018b. Phrase-based & neural unsupervised machine translation. In EMNLP.

5. A. Joulin, P. Bojanowski, T. Mikolov, H. Jegou, E. Grave, Loss in Translation: Learning Bilingual Word Mapping with a Retrieval Criterion, In EMNLP.

6. Željko Agić, Ivan Vulić: “JW300: A Wide-Coverage Parallel Corpus for Low-Resource Languages”, In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL), 2019.

7. J. Tiedemann: “Parallel Data, Tools and Interfaces in OPUS.” In Proceedings of the 8th International Conference on Language Resources and Evaluation, 2012.

Update: Edit about BLEU scores made on 7/11/2019after final testing of model on more parallel sentences.