Poetic Neural Networks

Teaching A Neural Net How to Write Arabic Poetry

Nadim Kawwa
Towards Data Science

--

Source: artfire.com

If poetry disappeared tomorrow, the stock market would not crash, bridges would stay in place, and computers would still operate.

However, poetry is uniquely valuable because it speaks to something within us that can’t be quantified or measured. In this post, we are going to attempt to generate poetry using a neural network with one additional caveat: it will be in Arabic.

In summary, this post spans the points below:

  • How we created a custom dataset
  • How to preprocess the data
  • Hyperparameter tuning for the RNN
  • Poetry output in Arabic (and English translation)

Feel free to skip the technical bits and jump straight to the output. A link to the GitHub repository is included in the bottom.

A Poet From Damascus

Nizar Qabbani (source: Edarabia.com)

Nizar Qabbani was a Syrian poet who is most remembered for his poems that explored love, nationalism, eroticism, and religion. In addition, he was a prolific writer, which means that his work provides a potentially large amount of data for our neural net to learn from.

Below is sample of his work:

Who are you
woman entering my life like a dagger
mild as the eyes of a rabbit
soft as the skin of a plum
pure as strings of jasmine
innocent as children’s bibs
and devouring like words?

As a first step we need to create a corpus of text that contains most, if not all, of his known work. Luckily we can find websites that are solely dedicated to preserving Qabbani’s work.

Using packages such as BeautifulSoup, one can scrape the data and create a corpus that contains all available works we could find. With all of the poems gathered, the amount of data is just below 1MB, which is about 1 million characters, and has about 32,000 unique words. For more information how much data is needed, refer to Andrej Karpathy post in the references below.

Despite appearing as a massive amount of text, in reality, it is considered to be a very small dataset, which will probably be a limitation for our purposes.

On The Particularity of Arabic

Source: Whyseen on pinterest.com

Unlike Latin characters, Arabic is read from right to left. In addition there is no such thing as uppercase or lowercase characters. Furthermore, the concept of vowels and consonants is different than say English.

There are more aspects on how this language and others differs from English. There have been successful examples of generating poems in languages other than English, such as Chinese (see references at bottom).

Preparing the Data

This step involves creating a lookup table that returns two dictionaries:

  • integer to vocab
  • vocab to integer

Next, we split the script into a word array using spaces as delimiters. However, punctuation marks like periods and exclamation marks can create multiple ids for the same word. For example, “bye” and “bye!” would generate two different word ids.

We implement a function to return a dictionary that will be used to tokenize symbols like “!” into “||Exclamation_Mark||”, our list looks like:

  • Period ( . )
  • Comma ( , )
  • Return ( \n )
  • Carriage Return (\r)

This dictionary will be used to tokenize the symbols and add the delimiter (space) around it. This separates each symbols as its own word, making it easier for the neural network to predict the next word.

Hyperparameters & Tuning

In general, we may get better results with larger hidden and n_layer dimensions, but larger models take a longer time to train. Below is a list of parameters to tune:

  • sequence_length: the length of a sequence.
  • batch_size: the batch size.
  • num_epochs: the number of epochs to train for.
  • learning_rate: the learning rate for an Adam optimizer.
  • vocab_size: the number of uniqe tokens in our vocabulary.
  • output_size: the desired size of the output.
  • embedding_dim: the embedding dimension; smaller than the vocab_size.
  • hidden_dim: the hidden dimension of our RNN.
  • n_layers: the number of layers/cells in our RNN.

Imitating the Poet

With the parameters set and the model trained we can move on to the fun part: Generating poems!

To do so we must define the prime_word which is the word to begin with. In the subsections below we present some results.

أنا” (I/me):

أنا كهذا الماء ؟.
وظلت شيخ حارتنا..
وتشاهد أسماكٌ وبحار.
وتشاهد فهم أنفسنا
الخائفون من الخشب
ويضاجعون تطرف يعنيها.
تنقرض الأمة من الأصل ذكورا..
تهاجر أسماكٌ وبحار..
وعدت..
بإلغاء

Rough translation:

Am I like this water?
And I stayed the Sheikh of our neighborhood ..
And you see fish and seas.
And you watch the understanding of ourselves
Afraid of wood
And they sleep with extremism.
The nation is extinct from male origin.
Fish and seas migrate ..
I promised ..
To cancel

“يا” (O)

يا قطتي الخليفة
وباركنا في جبيني
خليفةٌ في خاصرة التعذيب الأوسط

هل سأقرأ حـيٍ ، ينسـى القط؟
والزنبق الأسود ، والشذا ومستواه
فقصها..
وبقينا نسأل أنفسنا
وفي غرف الإنعاش ،
وألوان للعشاء
قطعاً في

Rough translation:

O my successor cat
Bless us on my forehead
A caliph in the side of middle torture

Will I read alive, the cat forgets?
Black lily, scent and its level
So cut it ..
We kept asking ourselves
And in the recovery rooms,
And colors for dinner
Absolutely in

“نحن” (We)

نحن عشته
لا تحسبي أن أحبك في البيادر
في أخبار التاريخ,.
تنقرض الأمة يعنيها.
تنقرض الأمة من عارٍ فيها– الحداد..
عيونها على ذراعيها..
ومذيع الدولة في أجساد الأميره ؟

يا رب أيـن ملتفٌ نسبي

Rough translation:

We experienced it
Do not think that I love you in the Gardens
In history news ,.
The nation becomes extinct.
The nation becomes extinct from its disgrace — mourning ..
Her eyes are on her arms ..
And the state broadcaster in the princess’s bodies?

O Lord, where is the relative winding?

“امرأة” (Woman)

امرأة كلها...
يا كل عامٍ في الطبيعة..
ومذيع الدولة في جرحنا
نتفاءل جميله..
ووجدنا جسداً مغتصباً..
ومذيع الدولة ؟؟
من هؤلاء هؤلاء الهدبـا
من هؤلاء سقيت أعماقي وإرهاقي برأس أدبي؟

Rough translation:

A whole woman …
Oh every year in nature ..
The state broadcaster is in our wound
Beautiful optimism ..
We found a raped body.
And the state broadcaster ??
Of these are cilia
Who are these people watered deep down and exhausted with a literary head?

Conclusion

We can see that our attempts at poetry are not as coherent and certainly not as eloquent as the original author. At some points the writing was comical and broke all rules of grammar and logic.

One possible reason for our shortcomings might be insufficient training data, as ideally we want at least 3MB worth of text. In addition there might be distinctive facets of the language itself that need to be accounted for. However bear in mind that the RNN had to learn one of the hardest languages from scratch.

I hope you enjoyed reading this article and got a sense of what is possible in terms of text generation. It is also my wish that members of the deep learning community who are not native English speakers can envision the potential beneficial applications in their own native communities.

References

--

--