Sentiment Analysis for Low Resourced Languages on Social Media

An NLP Journey into a Genuine Linguistic Mess

Taha Tobaili
Towards Data Science

--

The world was recently taken by the social media storm which gave a privilege for every individual connected to the internet to express their opinion publicly at any point of time anywhere on earth. The resulting opinionated texts prevalent on the internet presented an opportunity for the datascience research and industry to mine this data for sentiment analysis, a quick artificial way to gauge the opinion of the masses towards products, news, events, and policies where each piece of text gets classified as positive, negative, or neutral automatically.

However, the complexity of classifying words into sentiment classes differs from one language to another. It relies hugely on the availability of linguistic resources and the natural sparsity¹ of the language. For example, the abundance of digital dictionaries, corpora, and labelled datasets advanced the science of sentiment analysis for English quite well, but not as good or as quick for lower resourced languages such as the spoken dialectal Arabic², the broader domain of my PhD research.

Lexical Sparsity

Sparsity is the lexical magnitude of words in a language, the higher the number of forms pertaining to the words of a language, the higher the sparsity that language has:

Sparsity = ∑ word x forms[word]

In English, the inflectional forms enjoyed, enjoying, and the derivational form enjoyable can easily be mapped with the positive sentiment word enjoy. As for morphologically rich languages, an inflection is not limited to tenses and number, but spans to subjects, objects, pronouns, clitics, and gender as well. A simple example is the word love حبّ in Arabic; it can be inflected as I-love-you (masculine) احبكَ, I-love-you (feminine) احبكِ, I-love-you (plural) احبكم, we-loved-them احببناهم, we-will-love-you (plural) سنحبكم reaching over 100 inflections on this front. On another front, the vocabulary derived from a word consists of prefixes, infixes, suffixes, and diacritics such that affection محبة, endearment تحابب, desirable محبب, preferable مستحب, amicable متحاب, and favourable محبذ all originate from the same word love.

Background from myfreetextures

“Befriend a friend who is honest in his friendship, for the sincerity of a friendship is in an honest friend” — Unknown. (All nouns stem from صدق honesty).

The derivational morphology of Arabic works in a beautiful harmony with tri-literal words that it is simply too complex to be sorted out by a stemmer, lemmatiser, or a word segmenter. The word استنكار the act of denying for instance could be mapped to eleven words of different meanings: استنار انار ستر ستار سار سكر كر تنك تنكر نكر نار .

The lexical sparsity of a morphologically rich language becomes very large if there is no consistent orthography, where each form of a word may be spelled differently. Inconsistent orthography is common for languages that are solely spoken, such that if transcribed there is no standard spelling to follow, as a result each individual expresses their tongue in text differently. This is the case for dialectal Arabic, a spoken Arabic esoteric to each region of the 420 million natives. Different Arabic dialects in different regions differ in word choice, interpretation, morphology, pronunciation, and speech tempo, as well as the borrowed words given the influence of foreign languages throughout the history of the regions such as English, French, Spanish, Italian, and Turkish.

Jean-Baptiste Hilair, Yeni Camii and The Port of İstanbul, Late 18th Century

Arabizi

Dialectal Arabic was known to be a spoken-only language until the rise of digital communication³ some 30 years ago, at that time a new variety of Arabic was born, Arabizi, a portmanteau of Araby and Englizi, the main focus of my PhD research. Arabizi is a very informal transcription of the spoken Dialectal Arabic in Latin script. Let alone Dialectal Arabic naturally lacks a consensus in orthography, Arabs started to map their phonemes with a script of a different language heuristically. This widened the orthographic inconsistency even further because:

  1. A word is transcribed based on how it is pronounced, hence different pronunciations of the same word are transcribed differently: For example, the one letter vowel phoneme ي /yā/ in the positive wordخير fine is pronounced as /āy/ khāyr or /eh/ kher, it is therefore common to see both khayr and kher transcriptions for the same word.
  2. There is no consistent orthography on how to transcribe distinct vowel phonemes. The transcriptions khāyr or kher for the same word could be transcribed as kher, kheir, khair, kheyr, khayr, or even with most or all the vowel letters omitted khyr.
  3. There is little consistency on how to transcribe distinct consonant phonemes, such as the guttural ح Ḥā’, خ Khā’, ع ᶜayn, غ Ghayn that articulate in the post-velar areas of the oral cavity and the glottal stop ء Hamzah. For example, the خ Khā’ from the same wordخير khāyr is standardised to some extent in some regions as compound letters kh or numeral 5 or even 7’. This immediately triples the number of orthographies for the wordخير Khāyr.

Therefore, the possible orthographies for the simple tri-literal base wordخير Khāyr are: kher, kheir, khair, kheyr, khayr, khyr, 5er, 5eir, 5air, 5eyr, 5ayr, 5yr, 7’er, 7’eir, 7’air, 7’eyr, 7’ayr, 7’yr.

It is by now apparent that the magnitude of the lexical sparsity is as large as the number of inflections for each word multiplied by the number of its possible orthographies.

Sparsity = ∑ word x (morphologic x orthographic) forms[word]

Codeswitching

Finally, the lexical sparsity is topped with codeswitching, mixing Arabizi with other Latin script languages, mainly English or French. The Arab youth of several regions constantly switch between Arabic and French or English as they speak; since Arabizi is a transcription of that spoken language it is not surprising to see this reflected in their texts. The mixing of Arabizi with French is quiet normal in Algeria, Morocco, and Tunisia⁴ and with English in Lebanon⁵ and Egypt⁶. Codeswitching may occur in a conversation where each part is in different language, a sentence where each word-clause is in a different language, or a sentence with intermittent use of code-switched words. This can be demonstrated in one Facebook post from Lebanon:

Codeswitching in a Facebook conversation

This codeswitching infers that Arabizi is not a stand-alone language but tied with other languages in its nature. Hence efforts to tackle Natural Language Processing (NLP) tasks for Arabizi, especially sentiment analysis, require an integration with the English or French words and phrases used by the Arab natives in social texting. This causes an overlap between two languages, another challenge for word disambiguation, such as bait (بيت home), damn (ضمن insure or including), fine (فيني I-can), kill (كل every), or insane (انساني humane).

Thus the lexical sparsity becomes:

Sparsity = ∑ word x (morphologic x orthographic)forms[word] + (English and French) words

To find out how large this could reach, we needed to address one of my PhD research questions first:

How to map Arabizi sentiment words with their inflectional and orthographic variants?

Word Matching

Lest we forget, Arabizi is extremely low-resourced, we had no starting point particularly for the Levant dialects. We created labelled datasets and trained a language identifier to compile an Arabizi corpus of 1M comments from code-switched Facebook data.

We trained word embeddings on the Facebook corpus, whereby each word in the corpus gets projected as a real numerical vector by a neural network architecture into a new Arabizi embedding space. We then combined the cosine vector similarity with a rule-based approach to discover the inflectional forms and their orthographic variants of input sentiment words from that embedding space.

Image by author using a neural nets background

We found inflectional and orthographic forms for almost every sentiment word we input, but to our surprise some words reached over 1K forms! Such as the word i7tiram احترام respect:

1,069 forms matched the positive sentiment Arabizi word i7tiram (respect)

Sentiment Analysis

What does 1K+ forms for one word entail for sentiment analysis?

The fundamental techniques for sentiment analysis are unsupervised and supervised approaches⁷ known as Lexicon Based and Machine Learning approaches.

In a lexicon-based approach each word in the input text is searched in a pre-defined lists of positive and negative words or a list of words with sentiment scores, with the intuition that the number of positive or negative words in a sentence dictates the sentiment of that sentence. This is true to a high extent, however it fails against positive or negative sentences that lack sentiment words such as in sarcasm or multi-word expressions.

In a more intelligent approach, machine learning, an algorithm learns to classify the input text as positive, negative, or neutral through learning from training data, sentences pre-labelled by human annotators as positive, negative, or neutral. The more training data is provided for the algorithm the better it becomes in learning which patterns, words or word co-occurrences, lead to the correct sentiment class. This approach should perform better in classifying sentences that lack sentiment words, but creating large datasets is quite expensive in terms of time and price.

The very large magnitude of lexical sparsity defies both techniques:

How can we create a lexicon of sentiment words with all its forms? On the other hand, how large the labelled datasets should be to cover all the forms of sentiment words?

Anticipating the complexity of the challenge for both approaches, I decided to induce morphologically and orthographically rich sentiment lexicons for Arabizi as automatic as possible. We release the outcome publications and resources of this work on project-rbz.

Transliteration

Although Arabizi is written in Latin script, it is Arabic anyway, so why am I treating it as a new language and resourcing it for sentiment; Can’t we just transliterate it to Arabic script?

Image by Author

First of all, this unorthodox transcription generates a severe word ambiguity. Because Arabic has short and long vowel phonemes, and soft and emphasised consonant phonemes, an Arabizi word can easily be confused for two or more Arabic words.

Short vowels are diacritics and long vowels are the vowel letters ا و ي / ā, ū (or wā) and yā (or ī)/, for that the word غابة ghābeh forest with long vowel ā can be confused for the negative word غبي ghabeh stupid with short vowel a in Levantine dialect Arabizi. Since there is no distinction between short and long vowels, both can be transcribed as ghabeh.

As for the soft and emphasised consonant phoneme counterparts, ذ د ك س ت and ظ ض ق ص ط are two distinct groups of letters for soft and emphasised th, d, k, s, and t. The Arabizi word dareb could either be a transcription of the negative ضرب Ḍarb (emphasised d) hit or strike physically, or درب darb (soft d) path or route.

As such, the inconsistent orthography of vowel letters and the Latin consonant letters that map to two or more Arabic letters impact the transliteration, giving several possibilities for most Arabizi words. Transliteration works online, word by word, where the user manually selects their intended Arabic word from a list of possible transliterations after typing each Arabizi word. Yamli is known for this:

Yamli Arabic Transliterator

Second, Arabizi is a transcription of the dialectal Arabic not the Modern Standard Arabic (MSA), where the word choice and expressions differ among regions. For that, it would be naïve to develop a one-fits-all transliterator without catering for each dialect. Here is Google’s effort to automate transliteration of whole text, we test a positive Arabizi tweet from Lebanon: da5l jamelik w hadamtik / oh-my (expression) your-beauty and your-humour (feminine).

Google Translate in 2019; in 2020 it gave: inside your sentences and demolish you

In this case an Arabizi sentiment lexicon that contains the inflected positive words jamelik (your-beauty) and hadamtik (your-humour) would perform better in classifying this sentence as positive than transliterating it to wrong Arabic words and attempting to classify it afterwards.

Third, for the sake of sentiment analysis, dialectal Arabic is low-resourced: what is the point of converting the script of a low-resourced language to another script?

Original image: Benhance gallery, what I think what I say by Ghonemi

Conclusion

In a nutshell, I learned through my research that is vitally important to fully absorb the challenges that a language poses and assess the available or required resources for the desired NLP task before tackling that task.

Although Arabizi encompasses a plethora of challenges, inconsistent orthography and Latinisation are not novel phenomenons. For example, Javanese dialects and Alsatian⁸ are transcribed heuristically; Greek, Farsi, Hindi, Bengali, Urdu, Telugu, and Tamil are also transcribed in Latin script known as Greeklish, Finglish, Hinglish, Binglish

I hope that my understanding of the lexical challenges of Arabizi detailed in this read sheds some light on the complexity of sentiment analysis for low-resourced languages on social media and motivates the NLP community to explore similar approaches for resourcing other high-sparsity languages.

[1] Baly, Ramy, et al. “A sentiment treebank and morphologically enriched recursive deep models for effective sentiment analysis in Arabic.” ACM Transactions on Asian and Low-Resource Language Information Processing (TALLIP) 16.4 (2017): 1–21.

[2] Farha, Ibrahim Abu, and Walid Magdy. “Mazajak: An online Arabic sentiment analyser.” Proceedings of the Fourth Arabic Natural Language Processing Workshop. 2019.

[3] Yaghan, Mohammad Ali. ““Arabizi”: A contemporary style of Arabic Slang.” Design issues 24.2 (2008): 39–52.

[4] Seddah, Djamé, et al. “Building a user-generated content north-african arabizi treebank: Tackling hell.” Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020.

[5] Sullivan, Natalie. Writing Arabizi: Orthographic Variation in Romanized Lebanese Arabic on Twitter. Diss. 2017.

[6] Aboelezz, Mariam. “Latinised Arabic and connections to bilingual ability.” Lancaster University Postgraduate Conference in Linguistics & Language Teaching. Lancaster, UK. Vol. 3. 2009.

[7] Zhang, Lei, Shuai Wang, and Bing Liu. “Deep learning for sentiment analysis: A survey.” Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 8.4 (2018): e1253.

[8] Millour, Alice, and Karën Fort. “Text Corpora and the Challenge of Newly Written Languages.” Proceedings of the 1st Joint Workshop on Spoken Language Technologies for Under-resourced languages (SLTU) and Collaboration and Computing for Under-Resourced Languages (CCURL). 2020.

Hope you enjoyed the read, I sometimes tweet about Low Resourced NLP and stuff: @TahaTobaili

--

--

PhD student in Natural Language Processing at the Knowledge Media Institute UK. Leveraging deep learning to resource low-resourced language(s).