
Identifying and categorizing opinions expressed in a piece of text (otherwise known as Sentiment Analysis) is one of the most performed tasks in NLP. Arabic, despite being one of the most spoken languages of the world, receives little attention as regards sentiment analysis. Therefore this article is dedicated to the implementation of Arabic Sentiment Analysis (ASA) using Python.
Overview
- The dataset
- Library import and data exploration
- Text pre-processing
- Sentiment Analysis with different ML algorithms.
- Conclusion
- References
The Data
The dataset used in this article is made up of 1800 tweets labelled as positive and negative. It can be found here
Library import and data exploration



We have a very balanced classes here.
Text pre-processing
As someone who is used to working with English texts, I found it difficult in the first place to translate preprocessing steps routinely used for English texts to Arabic. Luckily, I later came across a Github repository with the code for cleaning texts in Arabic. The steps basically involve removing punctuation, Arabic diacritics (short vowels and other harakahs), elongation, and stopwords (which is available in NLTK corpus).

Sentiment Analysis with different techniques
The aim of this article is to demonstrate how different information extraction techniques can be used for SA. But for the sake of simplicity, I’ll only demonstrate word vectorization (i.e tf-idf) here. As with any supervised learning task, the data is first divided into features (Feed) and label (Sentiment). Next, the data is split into train and test sets, and different classifiers are implemented starting with Logistic Regression.
Logistic Regression
Logistic Regression is a very common classification algorithm. It is simple to implement and can serve as a baseline algorithm for classification tasks. In order to make the code shorter, Pipeline class in Scilkit-Learn which combines vectorization, transformation, gridsearch and classification is used. You can read more about gridsearch in the official documentation here

An accuracy of 84% was achieved
Random Forest Classifier

Naive Bayes Classifier (Multinomial)

Support Vector Machine

Conclusion
This article demonstrates the steps involved in Arabic sentiment analysis. The major difference between Arabic and English NLP is the pre-processing step. All the classifiers fitted gave impressive accuracy scores ranging from 84 to 85%. While Naive Bayes, logistic regression, and random forest gave 84% accuracy, an improvement of 1% was achieved with linear support vector machine. The models can be improved further by applying techniques such as word embedding and recurrent neural networks which I will try to implement in a follow-up article.
References
Multi-Class Text Classification Model Comparison and Selection
https://github.com/motazsaad/process-arabic-text/blob/master/clean_arabic_text.py