The world’s leading publication for data science, AI, and ML professionals.

Artificial Intelligence for Arabic

Google's DialogFlow has no Arabic support for building chatbots, and standard natural language machine learning frameworks such as spaCy do…

Google’s DialogFlow has no Arabic support for building chatbots, and standard natural language machine learning frameworks such as spaCy do not contain Arabic support either. Microsoft’s Arabic Toolkit is being discontinued this month (July 2018). Moreover, until recently, even research models using gloVe and word2vec were not easy to obtain. That’s not very helpful. There is generally a lack of available off-the-shelf high quality models for interpreting the Arabic language with artificial intelligence.

Google still offers excellent APIs to AI capabilities like neural machine translation, but not the vectors (the AI stuff) used to DO the translation. These language models are important when performing common non-translation text processing tasks such as sentiment analysis, spam filtering, plagiarism detection, and so much more. Moreover, these models are critical for automating enterprise tasks that require natural language understanding as part of a workflow, such as resume processing in Human Resources (HR), document clustering in governmental reports, and document prioritization in financial services. The need for Arabic AI models is quite strong.

A very nice starting point for otaining an Arabic word embedding model is AraVec (2017), a model created by Abu Bakr Soliman and his colleagues at the Center for Informatics Science of Nile University, in Giza, Egypt. The following links lead to their article and the related code.

bakrianoo/aravec

AraVec: A set of Arabic Word Embedding Models for use in Arabic NLP

In the past year or so, some articles on this topic have been published in high quality journals and conferences. Here are some of the most relevant ones:

Word embeddings for Arabic sentiment analysis – IEEE Conference Publication

Word Embeddings and Convolutional Neural Network for Arabic Sentiment Classification – ACL…

The surprisingly strong demand for AI solutions in the MENA region got us thinking about why there is this gap in the market. Our solution is to fill that gap. We established a joint venture called Stallion.ai to serve the MENA region with B2B Artificial Intelligence solutions for enterprise clients.

Taking this problem head-on, we decided to design an Arabic word embedding model from scratch. We scraped Wikipedia pages and books from the public domain. That’s 14 GB of text. We have been augmenting this dataset with other large sources of text, to gain additional language context and versatility. Rather than delving into the technological details of the work we have been undertaking, consider the practical business reasons for wanting an AI system that understands Arabic text.

One interesting project that highlights the need for more Arabic language support is the NOOR programming language.

There is an old legal saying, "He who drafts, wins." I could not track down the attribution and neither could the books who cite it. The idea of this old quote is that drawing up a contract gives the drafter the ability to set the terms of the contract, and they will do it in their own favor. Similarly, it is essential for a business operating in Arabic to apply AI techniques to their original documents, rather than machine translations of these documents. Operating on the borrowed context from another language simply does not work as well as employing a real embedding model built on text from the same language.

In science fiction, and in the press, artificial intelligence is like a universal translator. Even in engineering systems like compilers (e.g. GCC), several high-level languages (e.g. Java, C, C++) can be compiled into one universal middle language (GIMPLE) before being emitted into assembly for one of several processor targets. The structure looks like this:

How GCC understands (compiles) many language frontends into a single common representation and then emits code for a target architecture.
How GCC understands (compiles) many language frontends into a single common representation and then emits code for a target architecture.

Having a universal representation of language like GIMPLE is excellent because we can apply common useful optimizations to this intermediate (universal) representation of language. In effect, a universal language allows us to understand meaning rather than think about the idiosyncratic nature of just one language. Unfortunately, Machine Learning does not represent language like compilers do. Computer code is based on rigorous assumptions about formal language theory that we do not assume in natural language like contracts and text messages. Natural language is full of ambiguity. For example, synonyms are not the same across languages. Code, on the other hand, leaves basically no room for ambiguity. Worse yet, the ambiguity in Arabic is not the same as it is in English. This line of thinking tells us that dedicated per-language models will outperform borrowed across-language models that were not trained on the language we care about. However, the situation is even worse, as we will soon see.

We just discussed reasons we want to understand Arabic specifically, rather than learning across languages. Now consider that within arabic text there are variations that we need to consider separately.

Firstly, there are dialect issues and slang including emoji, but let’s skip over that, and go to the second issue: style heterogeneity. It is well established that different types of text logs contain different semantic information. For example, the corpus of all text from the newspaper Al Hayat does not give us enough information to understand the tweets by Nancy Ajram. Why? Because formal text and informal text are not the same thing. Machine learning works best when trained on text very similar to the text it will be evaluating.

There is even a third problem of context. Word embedding models like Aravec are an excellent first step to support at least SOMETHING in Arabic. The next step must be to encode context-specific business terminology and phrases into these models. These are typically out-of-dictionary terms that the off-the-shelf models were not trained on, or existing words that mean something different. Sometimes these words are English or named entities used inside Arabic documents (e.g. "كابل RS232 إلى واجهة RJ45"). Within dictionary words often still need adjusting. For example, the word collision means accident on the road to a road engineer, but a database key problem to a database engineer. These contexts need to be used to adjust the AI solution, and ofen the words involved are not even in the AI model before this adjustment takes place. On a project by project basis these custom modifications are achieved using our word embedding augmentation technology.

To summarize the problems we are addressing at Stallion.ai: Businesses that process Arabic language documents needs customized AI solutions that the market has ignored for far too long. We have seen new projects arise as others shut down. We have identified gaps in the market, including the following shortcomings in existing models: dialect support, understanding slang, understanding technical context and out-of-dictionary words, and understanding various kinds of text. We are exploring both industrial and academic opportunities to study and apply this technology.

Are you looking for AI help in the MENA region? Say hi to [email protected] or reach me at [email protected]

-Daniel

[email protected] ← Say hi. Lemay.ai 1(855)LEMAY-AI

Other articles you may enjoy:


Related Articles