To translate or not to translate, best practices in non-English sentiment analysis

In this article a recurrent neural network is build and trained on a diverse 50.000 Dutch reviews dataset with positive/negative labels. The performance is compared to the situations where the test set is translated to English and classified by a pretrained English sentiment model. Next, the model is tested with the removal of stop words and without the use word embeddings. The ConceptNet Numberbatch is looked upon more closely, as this multilingual word embedding outperforms other word embeddings.

Matts
Towards Data Science

--

Image from Wikimedia commons

Summary
Translation is not a good option for sentiment analysis, it causes a drop in accuracy of 16%. For sentiment analysis stop words and word embeddings are useful. It turns out the Conceptnet Numberbatch word embedding is superior over Word2vec/Glove for Dutch Natural Language Processing.

Translated versus recurrent neural networks
In this article the accuracy of recurrent neural network on an independent test set is compared to the performance of Textblob on the English google translation of this test set. The Dutch trained model is trained on a dataset consisting almost 50.000 various reviews of hotels, shopping products, food and services with a label being positive or negative. After training the model is used to predict the labels of the independent test set. The translated model is pretrained from Textblob. Textblob is a widely used pretrained text analysis library in Python. The translated model first translates the test set and then uses a pretrained sentiment analysis model. The results on the independent test set were as follows:

As you can see, the performance is more the 16% lower. The accuracy of the Textblob on a native English test set is around 87.5% according to research by done by Stanford. So if the translation would be perfect, the Textblob model already performs around 4% less. The other 12% is due to the translation not being perfect. So in this case it would be recommendable to collect a big native language dataset.

But sometimes in minor languages it can be difficult to acquire big datasets. Outside English and Chinese, there are not a lot of big standard datasets you can use, and web content is also less widely available. If you can manage to get a couple of hundred datapoints (but not 10.000 up), then transfer learning could be an option. To apply transfer learning in this case, an (big) English dataset with positive and negative labels was translated to Dutch and then trained. Then the weights of this model were used to train further on a small part (1000 reviews) of the Dutch dataset. This gave better results then the translated Textblob model:

Overview of the three methods in training and testing:

Without stop words
If the stop words are filtered, the most common words in the dataset are good, tasty, hotel, book, quick and delivery. Some of these words are neutral, others give a sense of sentiment. But, it turns out the accuracy drops by more then 7%:

If the most common words in the unfiltered dataset are checked, it can be seen that the whole top 20 words are stop words (which makes sense). The drop in performance becomes obvious. The words “not” and “but” reverse the meaning of a sentence. The words “too” and “also” change the meaning of a sentence as well. If the model can’t take into account these words, it can’t predict the reversed sentiment of a sentence/review.

So in this case, it’s better to keep the stop words in the dataset.

Without word embeddings
To fresh up, word embeddings are representations of word in numbers (or tensors). For example ‘the man walks the dog ‘ could be represented as a two dimensional word embedding like this: [1.3,-0.2] [0.4,1.1] [-0.3,0.1] [1.3,-0.2] [1.2,0.7]. So when a 2-dimensional embedding is used like above, every (unique) word is converted to a combination of two numbers. Word embedding work so well because the semantics of the words are captured. words with the same meaning have similar tensor values and differences with other word groups are similar as well. As you can see in the graph below, the performance with word embeddings is much higher.

Conceptnet Numberbatch
ConceptNet is a multilingual set of word embeddings. ConceptNet outperforms other well known word embeddings as Word2Vec and GloVe, as can be seen in this graph:

Graph from ConceptNet project

The word embeddings file can be downloaded here under the download section. The file is build up with the following structure: /c/language_code/word 0.4 … 0.1. So for example dog would have the following structure: /c/en/dog 0.2 … 0.9. When you use Numberbatch, make sure you only upload the needed language(s), to reduce redundancy. The vocabulary sizes of Numberbatch are truly impressive, see the top 15 languages below:

So how does the embedding look for Dutch? The words are represented in 300 dimensions. To compress this number to two, a T-SNE algorithm can be used. Eighteen words were tested in three categories, namely the Dutch translations of jacket, scarf, pants, shoes, socks and sweater in the clothing category (yellow below), tiger, eagle, spider, hawk, lion and hyena in the animal category (gray below) and crown, king, queen, prince, palace and coronation in the monarchy category (light blue below). As you can see, the words are clustered according to similarity, which means this word embedding helps a model that needs to handle Dutch:

Finally, here is some code on how to implement Numberbatch. The code in the section below opens the Numberbatch word embedding (make sure the file only contains the part with your language, so for Dutch that is c/nl/). It creates a dictionary with all Dutch words and their Numberbatch word embedding representation in 300 dimensions. So for example, the word dog will be saved like { ‘dog’: [0.2, 0.1, 0.5 … 0.3]}:

This part tokenizes your dataset. The code assigns a number to each unique word, and all the unique words are counted. Next, the dictionary with words that occur in the training set will be build and the script adds the 300 dimensional embedding to it:

Finally, make the first layer of your Keras network an embeddings layer. This makes sure the input data sentences are converted to the Numberbatch tensors:

Conclusion
If you are able to collect a big dataset in a native language, then the model trained on this dataset will probably give the best classification accuracy. If you are unable to get a large dataset, but can only get a couple of hundred data points, then transfer learning still can give pretty good results. Translating data and then using an English trained model is not recommendable, next to a bad accuracy this will also worsen performance of your application, as the data first needs to be translated. Concerning stop words and word embeddings, in this case they were useful. Conceptnet Numberbatch is a really good word embedding to use for non-English natural language processing tasks.

Next up 2021
I will write an article about neural style transfer. To already view the resulting application of the article view the following page (in Dutch):

Schilderij laten maken

See you then!

--

--