

For Spanish Readers: Traducción automática simple: del yorùbá a inglés
Introduction
In this article, we will build a machine translation model to translate sentences from the Yorùbá language to the English language. These sentences are from various resources such as news articles, conversations on social media, spoken transcripts, and books written purely in Yorùbá language.
Machine translation for low resource language is quite rare and its quite hard to get accurate result due to limited size available training data on these languages. We have a data dataset available for Yoruba text (JW300), but that is used to train religions domain. We need a model that is generalized and can be used in multiple domains, this is where ai4d.ai comes with more generalized data and all we have to do is to train our model on this data and produce an accurate result to secure top position in **** AI4D Yorùbá Machine Translation Challenge – Zindi.
In this project we will be using Helsinki NLP mode so let’s talk about them as organization.
Helsinki NLP
Language Technology Research Group at the University of Helsinki trained Helsinki NLP models. They are on a mission to provide machine translation for all the human languages. They also study and develop tools for processing human language which includes automatic spelling and grammar checking, machine translation, and ASR (automatic speech recognition) for more information visit Language Technology. These models are publicly available at HuggingFace and on GitHub.
Importing Essential Libraries
Selecting options
- To clean your text, you need to make Clean variable True, for now I am training my data without cleaning.
- If you want to train your model the make Train variable True.
Reading Translation Data
Reading training data using pandas to have initial understanding on how our data set looks like.
The training data consist of 10,054 parallel Yorùbá-English sentences. We have three columns ID: unique identifiers, Yoruba: containing text in Yorùbá language and English: text contains English translation of Yorùbá text. The data looks fairly clean, and there are no missing values.
Cleaning Data
We have removed punctuations marks, converted the text lower and removed digits from text to make our model perform better. For now, I am keeping this feature off, but I will use this for future experiments.
Loading Tokenizer and Model
We have used hidden gems of machine translation which was trained on multiple languages including Yoruba. The Helsinki NLP models are one of the bests in the machine language translation domain and they come with transfer learning, which means we can use the same model with the same weights and fine-tune it on our new data to get the best results. We will be using Helsinki-NLP/opus-mt-mul-en · Hugging Face model and fine-tune it on more generalized Yoruba text provided by Artificial Intelligence for Development-Africa Network (ai4d.ai).
The model is publicly available at Hugging Face and it’s quite easy to download and train using the transformers library. We will be using GPU and adding .to(cuda)
activates the GPU on this model.
- source group: Multiple languages
- target group: English
- OPUS readme: mul-eng
- model: transformer
If you have good internet connection, it won’t take much to download and load the model
Downloading: 100% 1.15k/1.15k [00:00<00:00, 23.2kB/s]
Downloading: 100% 707k/707k [00:00<00:00, 1.03MB/s]
Downloading: 100% 791k/791k [00:00<00:00, 1.05MB/s]
Downloading: 100% 1.42M/1.42M [00:00<00:00, 2.00MB/s]
Downloading: 100% 44.0/44.0 [00:00<00:00, 1.56kB/s]
Downloading: 100% 310M/310M [00:27<00:00, 12.8MB/s]
Preparing Model for Training
Optimizer
we will be using the AdamW optimizer to make our model converge fast and provide better results with a 0.0001 learning rate. I have experimented with different optimizers available at PyTorch library and AdamW works better for this problem. By doing hyperparameter tuning I got the best result at a 0.0001 learning rate.
I have used the ekshusingh technique for fine-tuning the Helsinki NLP model. It’s fast to train and it requires few samples to produce better results.
I have done some Hyperparameter tuning and got final parameters 27 Epochs and 32 batch sizes.
The model_train()
function first divides the batch into local_X
and local_y
. We use the prepare_seq2seq_batch
function from the tokenizer to convert text into tokens which can be used as input for our model. Then we are using gradient descent to reduce the losses and print the final loss.
Training
The training of a model took 30 minuets with 27 Epochs and each epoch take approximately 38 seconds to run. The final loss is 0.0755 which is quite good and it’s evident that our model performed well, but we still need to check it on the evaluation metric.
100% 125/125 [00:38<00:00, 3.84it/s]
Loss: tensor(0.0755, device='cuda:0', grad_fn=<DivBackward0>)
Testing model
Testing on single sample from test data set.
The model performed quite well, and the sentence makes sense. We need to remove brackets and <pad>
with other unnecessary punctuation marks to future clean our data set.
Using string sub
and replace
we have removed <pad>,', [,]
from out text.
The final text looks clean and its almost perfect for an initial evalution.
Prediction on Test Dataset
Let’s predict from the test data set and check how well our model performed. We will be loading the test dataset which you can access from the Zindi platform.
Generating prediction and batch decoding the tensor into text. Using .progress_apply
we have translated all the Yoruba the text from the test dataset and moved it to mew column named Label
Cleaning our predicted translation.
You can see the final version of our test dataset and it’s quite somehow accurate. To check how our model performed on test data set we will try to upload the file on Zindi Platform.
Metric
The competition is using Rouge Score metric and the higher the score the better your model performs. The leaderboard score of our model is 0.3025 which is not bad, and it will get you into top 20.

Conclusion
Using the power of transfer learning we have created a model that performed quite well on generalized Yoruba text. The Helsinki NLP Machine Translation multilanguage model performed quite well. I have experimented with many publicly available models on HugginFace and opus-mt-mul-en by far perform best on our low resource language. Due to Google translate, machine translation research has seen a downfall, but Google translate does not provide translation for low resource language and some of these languages are less accurate so fine-tuning your machine translation model provides you with a freedom of translating any low resource language using transformers. In the end, this was my starter code, and with multiple experiments and preprocessing the text I got 14th rank in the competition with a final Rogue Score of 0.35168.

You can find my model on kingabzpro/Helsinki-NLP-opus-yor-mul-en · Hugging Face