The world’s leading publication for data science, AI, and ML professionals.

Arabic Sentiment Analysis

An Illustrative guide on how to perform sentiment analysis on Arabic texts

Photo by Mutia Rahmah on Unsplash
Photo by Mutia Rahmah on Unsplash

Identifying and categorizing opinions expressed in a piece of text (otherwise known as Sentiment Analysis) is one of the most performed tasks in NLP. Arabic, despite being one of the most spoken languages of the world, receives little attention as regards sentiment analysis. Therefore this article is dedicated to the implementation of Arabic Sentiment Analysis (ASA) using Python.

Overview

  1. The dataset
  2. Library import and data exploration
  3. Text pre-processing
  4. Sentiment Analysis with different ML algorithms.
  5. Conclusion
  6. References

The Data

The dataset used in this article is made up of 1800 tweets labelled as positive and negative. It can be found here

Library import and data exploration

figure 1
figure 1
figure 2
figure 2

We have a very balanced classes here.

Text pre-processing

As someone who is used to working with English texts, I found it difficult in the first place to translate preprocessing steps routinely used for English texts to Arabic. Luckily, I later came across a Github repository with the code for cleaning texts in Arabic. The steps basically involve removing punctuation, Arabic diacritics (short vowels and other harakahs), elongation, and stopwords (which is available in NLTK corpus).

preprocessed data
preprocessed data

Sentiment Analysis with different techniques

The aim of this article is to demonstrate how different information extraction techniques can be used for SA. But for the sake of simplicity, I’ll only demonstrate word vectorization (i.e tf-idf) here. As with any supervised learning task, the data is first divided into features (Feed) and label (Sentiment). Next, the data is split into train and test sets, and different classifiers are implemented starting with Logistic Regression.

Logistic Regression

Logistic Regression is a very common classification algorithm. It is simple to implement and can serve as a baseline algorithm for classification tasks. In order to make the code shorter, Pipeline class in Scilkit-Learn which combines vectorization, transformation, gridsearch and classification is used. You can read more about gridsearch in the official documentation here

An accuracy of 84% was achieved

Random Forest Classifier

Naive Bayes Classifier (Multinomial)

Support Vector Machine

Conclusion

This article demonstrates the steps involved in Arabic sentiment analysis. The major difference between Arabic and English NLP is the pre-processing step. All the classifiers fitted gave impressive accuracy scores ranging from 84 to 85%. While Naive Bayes, logistic regression, and random forest gave 84% accuracy, an improvement of 1% was achieved with linear support vector machine. The models can be improved further by applying techniques such as word embedding and recurrent neural networks which I will try to implement in a follow-up article.

References

Multi-Class Text Classification Model Comparison and Selection

https://github.com/motazsaad/process-arabic-text/blob/master/clean_arabic_text.py


Related Articles