The world’s leading publication for data science, AI, and ML professionals.

UNIKUD: Adding Vowels to Hebrew Text with Deep Learning

Introducing the first open-source Hebrew nakdan (נקדן) tool which uses no rule-based logic.

Introducing an open-source Hebrew nakdan (נקדן) tool which uses no rule-based logic.

Hebrew text with "nikud" (vowel pointing) from the Birds' Head Haggadah, written in the early 14th century in the region which is now Southern Germany. (Source: Wikimedia Commons)
Hebrew text with "nikud" (vowel pointing) from the Birds’ Head Haggadah, written in the early 14th century in the region which is now Southern Germany. (Source: Wikimedia Commons)

Natural Language Processing (NLP) research has focused heavily on the English language because of its widespread usage and economic importance, but recent progress in NLP has made it easier to tackle the unique challenges posed by thousands of other languages used across the world. In this project, we applied a state-of-the-art Deep Learning architecture to a particular problem in the Hebrew language – adding vowel signs (nikud; ניקוד) to text – showing that we can leverage the limited available data for smaller languages using powerful machine learning models.

Sample UNIKUD output, from its Huggingface Spaces Streamlit deployment (Image source: Author)
Sample UNIKUD output, from its Huggingface Spaces Streamlit deployment (Image source: Author)

We present our model, UNIKUD, an open-source tool for adding vowel signs to Hebrew text using absolutely no rule-based logic, built with a CANINE transformer network. An interactive demo is available at Huggingface Spaces, and all of our data, preprocessing and training scripts, and experiments are available on its UNIKUD DagsHub repo page. We hope that UNIKUD will be a springboard for further progress in Hebrew NLP and under/mid-resourced languages in general.

Contents:

  1. The Hebrew Writing System
  2. Introduction to UNIKUD
  3. UNIKUD Datasets
  4. Methods
  5. Results
  6. Limitations and Further Directions
  7. Conclusion
  8. References

1. The Hebrew Writing System

To understand the problem that UNIKUD addresses, we first must digress briefly to discuss the Hebrew writing system.

The Hebrew language is a Semitic language with its origins in the ancient Levant region. Currently the most widespread language spoken in Israel as well as the liturgical language of the Jewish diaspora, Hebrew has approximately 5 million native speakers as well as millions of second-language learners (as of 2014, per Wikipedia). The Modern Hebrew writing system uses the Hebrew alphabet, which is also used for writing Yiddish, Ladino, and various other living languages, as well as classical Rabbinic literature in Aramaic.

Unlike the Latin alphabet used for English, the Hebrew alphabet is an abjad, meaning that letters represent consonants and vowels are normally unwritten. To illustrate, let’s consider the Hebrew word לחם "lehem" which means "bread":

(Image source: Author)
(Image source: Author)

Hebrew is written from right to left, so the reader observes the letters L-H-M in that order. You might wonder how the reader knows that the word is pronounced "lehem" and not, say, *"laham". In fact, this must be inferred from context. If this sounds strange, keep in mind that even in reading English we must memorize the spellings and pronunciations of many words, so reading English also requires context – but reading Hebrew requires much more context.

Hebrew can also be written with vowel points, known as nikud in Hebrew. These were invented long after the Hebrew script itself, and not used in normal writing. Instead, they are added to texts where precision or clarity is important, like in dictionaries, language learning materials, and poetry. For example when writing with nikud the word לחם lehem is written:

(Image source: Author)
(Image source: Author)

or in digital text: לֶחֶם. The three dots underneath the first two letters indicate that they are followed by an "e" vowel.

The same unvocalized (no vowel points) written word in Hebrew could correspond to many possible vocalized words depending on context. For example, the written word ספר S-P-R could be read as סֵפֶר sefer ("book"), סָפַר safar ("counted"), סַפָּר sapar ("barber"), or סְפָר sfar ("frontier"). The objective of the UNIKUD model is to automatically add the correct vowel pointing to Hebrew text, using context to determine the vocalization of each word.

Side note: There are other abjads in use today, most notably the Arabic script used for writing the Arabic language. Arabic and other languages using abjads typically also have similar optional diacritics for marking vowels, and we believe that UNIKUD could be adapted for these writing systems with minor adjustments. If you’d like more information about writing systems across the world including abjads, check out my 2012 MIT Splash presentation: Writing Systems of the World’s Languages.


2. Introduction to UNIKUD

We have seen that the Hebrew language is normally written without vowels but requires context to be read; a complex system of vowel diacritics can be used to help the reader but adding these vowels requires a deep understanding of Hebrew and of the text at hand. Various nakdanim (נקדנים) – tools for automatically adding nikud (ניקוד; vowel marks) to texts— are available, but until recently most models were proprietary and/or programmed using a complex set of hand-written rules that are specific to the Hebrew language and are extremely laborious to produce.¹

We propose the UNIKUD model which uses absolutely no hand-written rules and instead learns how to add vowel marks to texts by itself, by training on a dataset of existing Hebrew texts. We chose the name UNIKUD as a triple play on words – a UNIque model for adding NIKUD (ניקוד, Hebrew vowel marks) to text using the UNIcode text standard.

The inspiration for this model is the recent paradigm shift in NLP with the advent of transformer neural networks; introduced by a team of researchers from Google in the 2017 paper Attention Is All You Need, transformer models such as BERT have advanced the state-of-the-art in NLP on almost every benchmark task, from machine translation to question-answering and beyond. The key features of these models are the use of a self-attention mechanism to extract contextual meaning from complex inputs encoded as vectors; for technical details the reader can refer to Peter Bloem’s lucid 2019 post "Transformers From Scratch".

Most transformer models have relied upon "subword tokenization" as a preprocessing step for textual input. In NLP, tokenization refers to splitting text into smaller units – for example, word tokenization would split the sentence I love tacos. into the word units I, love, tacos, and . (punctuation symbols are usually treated as separate word tokens). Models like BERT use subword tokenization, meaning that more complex or uncommon words could be split into "subword" units – e.g. tacos might be tokenized as the two units ta and ##cos.

However, word or even subword tokenization makes less sense for languages like Hebrew where a single word can be modified by internal changes to add complex meaning. For example, the Hebrew root כת״ב K-T-V refers to writing; by adding vowels inside this root we can create the complex words כָּתַב KaTaV "he wrote", יִכְתֹּב yiKhToV "he will write", מִכְתָּב miKhTaV "a letter", and more.

In their March 2021 paper, Clark et al. from Google Research introduced the CANINE transformer model, whose name stands for Character Architecture with No tokenization In Neural Encoders. Instead of taking subword tokens as input, CANINE takes in raw characters, and learns an embedding (vector representation) for each Unicode codepoint. The model is intended for languages such as Hebrew with complex morphology (word formation).

As an autoencoding character-level transformer model, CANINE is particularly suited for Hebrew vocalization since we can frame this problem as a character token classification problem. Our UNIKUD model uses CANINE as its language backbone, initialized with pretrained weights. We add a classification head on top of CANINE to convert its contextual embeddings into vowel mark predictions.

We will see that training UNIKUD requires a complex data pipeline – first, collecting and normalizing Hebrew text with vowels; second, training an auxiliary model to rewrite these texts in "full spelling" (כתיב מלא); finally , training the core UNIKUD model to predict vowel marks using our processed data. We managed this multi-stage pipeline using DagsHub, a platform which hosts repositories containing both Git-versioned script files and DVC-versioned data files and pipeline definitions. We provide the trained UNIKUD model as well as instructions for reproducing its preprocessing and training steps.


3. UNIKUD Datasets

Deep learning models are data-hungry and UNIKUD is no exception. In order to train UNIKUD to learn how to vocalize Hebrew text, we needed to collect a dataset of Hebrew text with vowel points. Hebrew is normally written without vowel signs in everyday usage, so it required careful searching to find sufficient quantities of digitized Hebrew text with vowels. We curated such data from public-domain sources, and provide them as DVC-versioned data in our repository for the benefit of the Hebrew NLP community.

We collected text with nikud (vowel marks) from the following sources:

  • Hebrew Wikipedia (most text does not have vowels, but we searched for and extracted texts with vowels)
  • The Ben-Yehuda project (an initiative to digitize classic Hebrew literature; we only used public-domain sources from this project, which often contain vowel marks)

The basic idea of UNIKUD is that we can use text with nikud for "self-supervised learning", to use the terminology of modern deep learning research. We can use the text itself to create the input and desired output of our model: e.g. a word סֵפֶר "sefer" with vowels can be converted to unvocalized ספר by automatically removing the vowel characters; the latter will be the model’s input and the former its desired output.

How Hebrew text with vowels is used to train UNIKUD. The text with vowels removed is used as the model's input, and the original text with vowels is used as the target (what we are trying to predict). (Image source: Author)
How Hebrew text with vowels is used to train UNIKUD. The text with vowels removed is used as the model’s input, and the original text with vowels is used as the target (what we are trying to predict). (Image source: Author)

However, there is another feature of Modern Hebrew writing which complicates this in some cases: Over time, scribes began to add extra letters to unvocalized texts to hint at the presence of vowels. Known as matres lectionis (Hebrew אימות קריאה, literally "mothers of reading"), the letters yud (י) and vav (ו) often must be removed when vowels are added to text, as shown in the following example:

"Ktiv male" (full spelling): The red letter is only used without vowels. (Image source: Author)
"Ktiv male" (full spelling): The red letter is only used without vowels. (Image source: Author)

The spelling of words with these extra letters is colloquially called "full spelling" (ktiv male כתיב מלא, more formally כתיב חסר ניקוד) in Hebrew. Because we needed text in full spelling as the input for UNIKUD, we had to train an auxiliary model to add these extra letters to text with vowels. Therefore we needed training data of pairs of words in full and defective (not full) spellings. We collected this data from two sources:

  • Hebrew Wiktionary (ויקימילון): Most entries are in defective spelling, with vowels in the article headings and a side bar listing the corresponding full spelling.
  • Hebrew Wikisource (ויקימקור): A number of articles on classic poems and other sources contain parallel versions of the same text with both full spelling (no vowels) and defective spelling (with vowels).

We collated data from both of these sources and arranged them in a machine-readable format.

As the first preprocessing stage in our data and training pipeline, we combined data from these sources together and lightly cleaned the texts. The most important preprocessing steps were:

  • Removing some source-specific boilerplate, such as template tags encased in curly braces {{}} from Wikipedia/Wiktionary/Wikisource data.
  • Unicode Normalization – Some visual characters can be represented in multiple ways in raw Unicode. For example Hebrew bet+dagesh can be either represented as בּ = U+0531 (bet) & U+05BC (dagesh), or as בּ = U+FB31 (bet with dagesh). The two representations look the same visually, but the first is made up of two Unicode code points while the second is a single code point representing the composed character. We used NFC Unicode normalization to unify such cases.
  • We combined together the two code points U+05BA and U+05B9 which both can be used for the single holam vowel point in Hebrew.

Because the CANINE backbone model accepts Unicode character input with no subword tokenization required, we left punctuation and other complex characters as-is.


4. Methods

The key insight of UNIKUD is that we may treat vocalizing Hebrew text as a multilabel token classification problem, where the tokens are individual characters. A single Hebrew consonant character may be decorated with zero, one, or multiple nikud marks, which can be represented by a binary vector:

Hebrew vocalization as multilabel classification: Each Hebrew letter may be decorated with multiple nikud, which can be represented as a binary vector. UNIKUD uses this label encoding as its target. The figure is condensed for clarity but UNIKUD's binary targets actually contain 15 entries. (Image source: Author)
Hebrew vocalization as multilabel classification: Each Hebrew letter may be decorated with multiple nikud, which can be represented as a binary vector. UNIKUD uses this label encoding as its target. The figure is condensed for clarity but UNIKUD’s binary targets actually contain 15 entries. (Image source: Author)

The UNIKUD model consists of a CANINE transformer backbone with a classification head with 15 outputs (for each input character token). The first 14 outputs roughly correspond to Unicode code points U+05B0 through U+05C2, which are used to encode various Hebrew nikud (see Wikipedia’s Hebrew Unicode chart for details). The 15th and final output gives the probability that the character should be deleted, if it is an extra yud or vav letter used for "full spelling" (see above).

Although there are various restrictions on which marks may appear together (for example, dots which only appear on the letters shin or sin), we did not hard-code them into our model. We found that UNIKUD was able to learn these restrictions on its own.

We implemented this model using Pytorch and Huggingface’s transformers library. In particular, we used the Huggingface port of CANINE initialized with pretrained weights (canine-c ), and trained using binary cross-entropy loss. More details on optimization and training hyperparameters can be viewed at the UNIKUD DagsHub experiments page.

As mentioned above, preparing the training data for UNIKUD required first training an auxiliary model – KtivMaleModel in our code – to convert text with vowels to "full spelling" (ktiv male כתיב מלא). We built this model using another copy of pretrained CANINE with a three-class classification head – for each character in the input, we predicted either 0 (no change), 1 (insert י yud) or 2 (insert ו vav). For example, given input מְנֻקָּד the model would predict 0 for all characters except for the letter ק which would be trained to predict 2 (to insert ו vav), yielding the full spelling מנוקד (vowels omitted). Once this model was trained, we used it to add "full spellings" to our dataset for use in training the main UNIKUD model.

Our complete pipeline can be conveniently viewed as a directed graph:

Our full data and training pipeline as a Directed Acyclic Graph (DAG) via UNIKUD's DagsHub repo page. After preprocessing the raw data, we train our auxiliary "ktiv male" model, use it to add full spellings to the primary model's training data, and then train the primary UNIKUD model. (Image source: Author)
Our full data and training pipeline as a Directed Acyclic Graph (DAG) via UNIKUD’s DagsHub repo page. After preprocessing the raw data, we train our auxiliary "ktiv male" model, use it to add full spellings to the primary model’s training data, and then train the primary UNIKUD model. (Image source: Author)

If our repository is cloned from DagsHub, each step of this pipeline can be reproduced with DVC, e.g. dvc repro preprocess runs data preprocessing, dvc repro train-ktiv-male trains the "full spelling" model using this preprocessed data, and so forth. As seen in the graph, the output of the final step of the pipeline is the trained UNIKUD model.

Once the model is trained, it can be used to add nikud to texts using a simple decoding method. Since we used binary targets with binary cross-entropy loss, our models’ outputs can be interpreted as probabilities (logits to be more precise) of the presence of each of 14 different nikud types, or of character deletion. In decoding, we modify the input text with the actions corresponding to outputs above a fixed probability threshold (0.5). For more details, see the decoding method defined within theNikudTask object in our code.


5. Results

We first present quantitative results, and then qualitative evaluation of selected examples. The results of training with different hyperparameter settings are visible in our repository’s DagsHub experiment table:

DagsHub experiment table showing hyperparameter settings and metrics for different training runs (Image source: Author)
DagsHub experiment table showing hyperparameter settings and metrics for different training runs (Image source: Author)

We will just discuss the final UNIKUD model training (DagsHub experiment table label: unikud), but you may see the results of the ktiv male model training there as well (label: unikud). The final settings used in deployment are tagged with the label deployed.

We trained our model with cross-entropy loss, and also tracked accuracy and macro-F1 metrics. We held out 10% of our data for validation; validation metrics are visible in the eval_loss, eval_accuracy, eval_macro_f1 columns. Accuracy is interpreted as the proportion of characters in the validation set whose labels (possibly multiple nikud, or deletion) were perfectly predicted. Macro-F1 is the F1 score averaged across different label types, which is often a more meaningful metric than accuracy for classification problems with imbalanced data (such as our case, where some nikud are far less common than others). Ideally validation accuracy and macro-F1 should both be close to 1, while validation loss should be as close as possible to 0.

Our final model achieved final validation metrics: loss 0.0153, accuracy 0.9572, macro-F1 0.9248. Its experiment page shows how these metrics improved throughout training, as logged by MLFlow:

Validation metrics throughout training, from our model's experiment page (Image source: Author)
Validation metrics throughout training, from our model’s experiment page (Image source: Author)

For qualitative evaluation, we present the output of our model on various inputs representing different text styles:

כל בני אדם נולדו בני חורין ושווים בערכם ובזכויותיהם. כולם חוננו בתבונה ובמצפון, לפיכך חובה עליהם לנהוג איש ברעהו ברוח של אחוה.

כָּל בְּנֵי אָדָם נוֹלְדוּ בְנֵי חוֹרִין וְשָׁוִים בְּעָרְכָּם וּבִזְכֻוּיּוֹתֵיהֶם. כֻּלָּם חֻוֹנְנוּ בִּתְבוּנָה וּבְמַצְפּוּן, לְפִיכָךְ חוֹבָה עֲלֵיהֶם לִנְהֹג אִישׁ בְּרֵעֵהוּ בְּרוּחַ שֶׁל אַחֲוָה.

האם תנין הוא יותר ארוך או יותר ירוק?

הַאִם תַּנִּין הוּא יוֹתֵר אָרךְ אוֹ יוֹתֵר יָרוֹק?

לנמלה יש שש רגליים, אבל לי יש רק שתיים.

לַנְמָלָה יֵשׁ שֵׁשׁ רַגְלַיִם, אֲבָל לִי יֵשׁ רַק שְׁתַּיִם.

נחשים באוסטרליה, עכבישים באפריקה: מפגשים עם חיות קטלניות שכמעט נגמרו רע מאוד

נְחָשִׁים בְּאוֹסְטְרְלִיָה, עַכָּבִישִׁים בְּאַפְרִיקָה: מְפַגְּשִׁים עִם חַיּוֹת קַטְלָנִיּוֹת שֶׁכִּמְעַט נִגְמְרוּ רַע מֵאוֹד

אח שלי תודה נשמה וואלה אחלה גבר אתה

אָח שֶׁלִּי תּוֹדָה נְשָׁמָה ואֵלָה אַחלֶה גֶּבֶר אֶתָה

שלום חברים

שְׁלוֹם חַבְרִים

חתול

חתּוֹל

Our main observations from testing UNIKUD on these and various other inputs:

  • We found that UNIKUD usually performed better when the input was longer and contained more context. When given very short input (one or a only a couple of words), it would sometimes not complete the nikud or even make mistakes, while it would perform better on average on long input sentences or texts, as seen above.
  • The ktiv male step in the training pipeline is important. We see that UNIKUD does learn to remove yud and vav in words such as כולם-כֻּלָּם and רגליים-רַגְלַיִם above. In prior experiments where we trained UNIKUD without this step, it was unable to properly add nikud to such words because it never encountered ktiv male spellings during training.
  • Outputs appear more accurate on average for formal or classical texts, while UNIKUD struggles with informal Modern Hebrew. Compare the relatively high-quality output for the formal texts above to the mistakes on slang terms like **** וואלה.
  • The model sometimes does not output any nikud when it cannot predict a vowel with high confidence. This can be adjusted by changing the probability threshold for decoding, which is 0.5 by default.

You may also try the model out yourself at its Huggingface Spaces Streamlit deployment:

UNIKUD with Streamlit interface, available at Huggingface Spaces. Sliders control probability thresholds used in decoding. Results in this article use the default value of 0.5 for all thresholds. (Image source: Author)
UNIKUD with Streamlit interface, available at Huggingface Spaces. Sliders control probability thresholds used in decoding. Results in this article use the default value of 0.5 for all thresholds. (Image source: Author)

6. Limitations and Further Directions

We briefly mention a few limitations of the UNIKUD model and possible directions for future research:

  • UNIKUD only uses the surface form of the text it is fed to guess the proper nikud symbols to add, so it struggles on foreign borrowings or other words that were not seen during training and whose nikud must be memorized. A promising direction would be to augment UNIKUD with a retrieval component that allows learned access to a dictionary file, similar to recent research on Retrieval-Augmented Generation where a model is given dynamic access to Wikipedia to extract relevant knowledge.
  • UNIKUD outputs probabilities for each character, and during decoding we simply treat each of these probabilities separately using a variant of greedy decoding. Nikud have various co-occurence restrictions that could be taken into account while decoding to generate more logical output, for example using a CRF (Conditional Random Field) layer with Viterbi decoding. We also cannot easily generate multiple candidate outputs for ambiguous text (e.g. "ספר"), as can be done with autoregressive text generation models via beam search.
  • As a large deep learning model, UNIKUD is substantially slower at performing inference (adding vowels to text) when it is run on CPU rather than GPU. Running on long texts (thousands of characters or more) also requires splitting the text into segments and runtime is highly dependent on how these segments are built and fed into the model for batched inference. We made no attempt to optimize these aspects of UNIKUD.

7. Conclusion

In this project, we presented the UNIKUD model which adds nikud to Hebrew text, an open-source nakdan built with deep learning and using no hand-written rules specific to Hebrew. UNIKUD was built with a CANINE transformer backbone and required a complex data pipeline for training.

We provide our data files, model, and results at the following links:

We hope that our model will prove useful for those working with Hebrew NLP and as a test case for under- and mid-resourced languages.

We wish to thank DagsHub for sponsoring this project as part of its creators program and for providing the platform that streamlined its development.


8. References

Morris Alper. "Writing Systems of the Worlds’ Languages." 2012. https://github.com/morrisalp/WSOWL

Peter Bloem. "Transformers From Scratch." 2019, http://peterbloem.nl/blog/transformers.

Jonathan H. Clark, Dan Garrette, Iulia Turc, John Wieting: "CANINE: Pre-training an Efficient Tokenization-Free Encoder for Language Representation", 2021; arXiv:2103.06874.

Elazar Gershuni, Yuval Pinter: "Restoring Hebrew Diacritics Without a Dictionary", 2021; arXiv:2105.05209.

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, Douwe Kiela: "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks", 2020; arXiv:2005.11401.

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin: "Attention Is All You Need", 2017; arXiv:1706.03762.


¹ But see Gershuni & Pinter (2021) for a recent exception, using a somewhat different approach than ours.


Morris Alper is a data scientist located in Tel Aviv, Israel. He is the Data Science Lead and Lecturer at Israel Tech Challenge and a current MSc student of Computer Science at Tel Aviv University. Inquiries can be directed via email ([email protected]) or LinkedIn:

Morris Alper – Data Science Lead and Lecturer – Israel Tech Challenge | LinkedIn


Related Articles