Introducing an open-source Hebrew nakdan (נקדן) tool which uses no rule-based logic.

Natural Language Processing (NLP) research has focused heavily on the English language because of its widespread usage and economic importance, but recent progress in NLP has made it easier to tackle the unique challenges posed by thousands of other languages used across the world. In this project, we applied a state-of-the-art Deep Learning architecture to a particular problem in the Hebrew language – adding vowel signs (nikud; ניקוד) to text – showing that we can leverage the limited available data for smaller languages using powerful machine learning models.

We present our model, UNIKUD, an open-source tool for adding vowel signs to Hebrew text using absolutely no rule-based logic, built with a CANINE transformer network. An interactive demo is available at Huggingface Spaces, and all of our data, preprocessing and training scripts, and experiments are available on its UNIKUD DagsHub repo page. We hope that UNIKUD will be a springboard for further progress in Hebrew NLP and under/mid-resourced languages in general.
Contents:
- The Hebrew Writing System
- Introduction to UNIKUD
- UNIKUD Datasets
- Methods
- Results
- Limitations and Further Directions
- Conclusion
- References
1. The Hebrew Writing System
To understand the problem that UNIKUD addresses, we first must digress briefly to discuss the Hebrew writing system.
The Hebrew language is a Semitic language with its origins in the ancient Levant region. Currently the most widespread language spoken in Israel as well as the liturgical language of the Jewish diaspora, Hebrew has approximately 5 million native speakers as well as millions of second-language learners (as of 2014, per Wikipedia). The Modern Hebrew writing system uses the Hebrew alphabet, which is also used for writing Yiddish, Ladino, and various other living languages, as well as classical Rabbinic literature in Aramaic.
Unlike the Latin alphabet used for English, the Hebrew alphabet is an abjad, meaning that letters represent consonants and vowels are normally unwritten. To illustrate, let’s consider the Hebrew word לחם "lehem" which means "bread":

Hebrew is written from right to left, so the reader observes the letters L-H-M in that order. You might wonder how the reader knows that the word is pronounced "lehem" and not, say, *"laham". In fact, this must be inferred from context. If this sounds strange, keep in mind that even in reading English we must memorize the spellings and pronunciations of many words, so reading English also requires context – but reading Hebrew requires much more context.
Hebrew can also be written with vowel points, known as nikud in Hebrew. These were invented long after the Hebrew script itself, and not used in normal writing. Instead, they are added to texts where precision or clarity is important, like in dictionaries, language learning materials, and poetry. For example when writing with nikud the word לחם lehem is written:

or in digital text: לֶחֶם. The three dots underneath the first two letters indicate that they are followed by an "e" vowel.
The same unvocalized (no vowel points) written word in Hebrew could correspond to many possible vocalized words depending on context. For example, the written word ספר S-P-R could be read as סֵפֶר sefer ("book"), סָפַר safar ("counted"), סַפָּר sapar ("barber"), or סְפָר sfar ("frontier"). The objective of the UNIKUD model is to automatically add the correct vowel pointing to Hebrew text, using context to determine the vocalization of each word.
Side note: There are other abjads in use today, most notably the Arabic script used for writing the Arabic language. Arabic and other languages using abjads typically also have similar optional diacritics for marking vowels, and we believe that UNIKUD could be adapted for these writing systems with minor adjustments. If you’d like more information about writing systems across the world including abjads, check out my 2012 MIT Splash presentation: Writing Systems of the World’s Languages.
2. Introduction to UNIKUD
We have seen that the Hebrew language is normally written without vowels but requires context to be read; a complex system of vowel diacritics can be used to help the reader but adding these vowels requires a deep understanding of Hebrew and of the text at hand. Various nakdanim (נקדנים) – tools for automatically adding nikud (ניקוד; vowel marks) to texts— are available, but until recently most models were proprietary and/or programmed using a complex set of hand-written rules that are specific to the Hebrew language and are extremely laborious to produce.¹
We propose the UNIKUD model which uses absolutely no hand-written rules and instead learns how to add vowel marks to texts by itself, by training on a dataset of existing Hebrew texts. We chose the name UNIKUD as a triple play on words – a UNIque model for adding NIKUD (ניקוד, Hebrew vowel marks) to text using the UNIcode text standard.
The inspiration for this model is the recent paradigm shift in NLP with the advent of transformer neural networks; introduced by a team of researchers from Google in the 2017 paper Attention Is All You Need, transformer models such as BERT have advanced the state-of-the-art in NLP on almost every benchmark task, from machine translation to question-answering and beyond. The key features of these models are the use of a self-attention mechanism to extract contextual meaning from complex inputs encoded as vectors; for technical details the reader can refer to Peter Bloem’s lucid 2019 post "Transformers From Scratch".
Most transformer models have relied upon "subword tokenization" as a preprocessing step for textual input. In NLP, tokenization refers to splitting text into smaller units – for example, word tokenization would split the sentence I love tacos.
into the word units I
, love
, tacos
, and .
(punctuation symbols are usually treated as separate word tokens). Models like BERT use subword tokenization, meaning that more complex or uncommon words could be split into "subword" units – e.g. tacos
might be tokenized as the two units ta
and ##cos
.
However, word or even subword tokenization makes less sense for languages like Hebrew where a single word can be modified by internal changes to add complex meaning. For example, the Hebrew root כת״ב K-T-V refers to writing; by adding vowels inside this root we can create the complex words כָּתַב KaTaV "he wrote", יִכְתֹּב yiKhToV "he will write", מִכְתָּב miKhTaV "a letter", and more.
In their March 2021 paper, Clark et al. from Google Research introduced the CANINE transformer model, whose name stands for Character Architecture with No tokenization In Neural Encoders. Instead of taking subword tokens as input, CANINE takes in raw characters, and learns an embedding (vector representation) for each Unicode codepoint. The model is intended for languages such as Hebrew with complex morphology (word formation).
As an autoencoding character-level transformer model, CANINE is particularly suited for Hebrew vocalization since we can frame this problem as a character token classification problem. Our UNIKUD model uses CANINE as its language backbone, initialized with pretrained weights. We add a classification head on top of CANINE to convert its contextual embeddings into vowel mark predictions.
We will see that training UNIKUD requires a complex data pipeline – first, collecting and normalizing Hebrew text with vowels; second, training an auxiliary model to rewrite these texts in "full spelling" (כתיב מלא); finally , training the core UNIKUD model to predict vowel marks using our processed data. We managed this multi-stage pipeline using DagsHub, a platform which hosts repositories containing both Git-versioned script files and DVC-versioned data files and pipeline definitions. We provide the trained UNIKUD model as well as instructions for reproducing its preprocessing and training steps.
3. UNIKUD Datasets
Deep learning models are data-hungry and UNIKUD is no exception. In order to train UNIKUD to learn how to vocalize Hebrew text, we needed to collect a dataset of Hebrew text with vowel points. Hebrew is normally written without vowel signs in everyday usage, so it required careful searching to find sufficient quantities of digitized Hebrew text with vowels. We curated such data from public-domain sources, and provide them as DVC-versioned data in our repository for the benefit of the Hebrew NLP community.
We collected text with nikud (vowel marks) from the following sources:
- Hebrew Wikipedia (most text does not have vowels, but we searched for and extracted texts with vowels)
- The Ben-Yehuda project (an initiative to digitize classic Hebrew literature; we only used public-domain sources from this project, which often contain vowel marks)
The basic idea of UNIKUD is that we can use text with nikud for "self-supervised learning", to use the terminology of modern deep learning research. We can use the text itself to create the input and desired output of our model: e.g. a word סֵפֶר "sefer" with vowels can be converted to unvocalized ספר by automatically removing the vowel characters; the latter will be the model’s input and the former its desired output.

However, there is another feature of Modern Hebrew writing which complicates this in some cases: Over time, scribes began to add extra letters to unvocalized texts to hint at the presence of vowels. Known as matres lectionis (Hebrew אימות קריאה, literally "mothers of reading"), the letters yud (י) and vav (ו) often must be removed when vowels are added to text, as shown in the following example:

The spelling of words with these extra letters is colloquially called "full spelling" (ktiv male כתיב מלא, more formally כתיב חסר ניקוד) in Hebrew. Because we needed text in full spelling as the input for UNIKUD, we had to train an auxiliary model to add these extra letters to text with vowels. Therefore we needed training data of pairs of words in full and defective (not full) spellings. We collected this data from two sources:
- Hebrew Wiktionary (ויקימילון): Most entries are in defective spelling, with vowels in the article headings and a side bar listing the corresponding full spelling.
- Hebrew Wikisource (ויקימקור): A number of articles on classic poems and other sources contain parallel versions of the same text with both full spelling (no vowels) and defective spelling (with vowels).
We collated data from both of these sources and arranged them in a machine-readable format.
As the first preprocessing stage in our data and training pipeline, we combined data from these sources together and lightly cleaned the texts. The most important preprocessing steps were:
- Removing some source-specific boilerplate, such as template tags encased in curly braces
{{}}
from Wikipedia/Wiktionary/Wikisource data. - Unicode Normalization – Some visual characters can be represented in multiple ways in raw Unicode. For example Hebrew bet+dagesh can be either represented as בּ =
U+0531 (bet)
&U+05BC (dagesh)
, or as בּ =U+FB31 (bet with dagesh)
. The two representations look the same visually, but the first is made up of two Unicode code points while the second is a single code point representing the composed character. We used NFC Unicode normalization to unify such cases. - We combined together the two code points
U+05BA
andU+05B9
which both can be used for the single holam vowel point in Hebrew.
Because the CANINE backbone model accepts Unicode character input with no subword tokenization required, we left punctuation and other complex characters as-is.
4. Methods
The key insight of UNIKUD is that we may treat vocalizing Hebrew text as a multilabel token classification problem, where the tokens are individual characters. A single Hebrew consonant character may be decorated with zero, one, or multiple nikud marks, which can be represented by a binary vector:

The UNIKUD model consists of a CANINE transformer backbone with a classification head with 15 outputs (for each input character token). The first 14 outputs roughly correspond to Unicode code points U+05B0
through U+05C2
, which are used to encode various Hebrew nikud (see Wikipedia’s Hebrew Unicode chart for details). The 15th and final output gives the probability that the character should be deleted, if it is an extra yud or vav letter used for "full spelling" (see above).
Although there are various restrictions on which marks may appear together (for example, dots which only appear on the letters shin or sin), we did not hard-code them into our model. We found that UNIKUD was able to learn these restrictions on its own.
We implemented this model using Pytorch and Huggingface’s transformers
library. In particular, we used the Huggingface port of CANINE initialized with pretrained weights (canine-c
), and trained using binary cross-entropy loss. More details on optimization and training hyperparameters can be viewed at the UNIKUD DagsHub experiments page.
As mentioned above, preparing the training data for UNIKUD required first training an auxiliary model – KtivMaleModel
in our code – to convert text with vowels to "full spelling" (ktiv male כתיב מלא). We built this model using another copy of pretrained CANINE with a three-class classification head – for each character in the input, we predicted either 0 (no change), 1 (insert י yud) or 2 (insert ו vav). For example, given input מְנֻקָּד the model would predict 0 for all characters except for the letter ק which would be trained to predict 2 (to insert ו vav), yielding the full spelling מנוקד (vowels omitted). Once this model was trained, we used it to add "full spellings" to our dataset for use in training the main UNIKUD model.
Our complete pipeline can be conveniently viewed as a directed graph:

If our repository is cloned from DagsHub, each step of this pipeline can be reproduced with DVC, e.g. dvc repro preprocess
runs data preprocessing, dvc repro train-ktiv-male
trains the "full spelling" model using this preprocessed data, and so forth. As seen in the graph, the output of the final step of the pipeline is the trained UNIKUD model.
Once the model is trained, it can be used to add nikud to texts using a simple decoding method. Since we used binary targets with binary cross-entropy loss, our models’ outputs can be interpreted as probabilities (logits to be more precise) of the presence of each of 14 different nikud types, or of character deletion. In decoding, we modify the input text with the actions corresponding to outputs above a fixed probability threshold (0.5). For more details, see the decoding method defined within theNikudTask
object in our code.
5. Results
We first present quantitative results, and then qualitative evaluation of selected examples. The results of training with different hyperparameter settings are visible in our repository’s DagsHub experiment table:

We will just discuss the final UNIKUD model training (DagsHub experiment table label: unikud
), but you may see the results of the ktiv male model training there as well (label: unikud
). The final settings used in deployment are tagged with the label deployed
.
We trained our model with cross-entropy loss, and also tracked accuracy and macro-F1 metrics. We held out 10% of our data for validation; validation metrics are visible in the eval_loss, eval_accuracy, eval_macro_f1
columns. Accuracy is interpreted as the proportion of characters in the validation set whose labels (possibly multiple nikud, or deletion) were perfectly predicted. Macro-F1 is the F1 score averaged across different label types, which is often a more meaningful metric than accuracy for classification problems with imbalanced data (such as our case, where some nikud are far less common than others). Ideally validation accuracy and macro-F1 should both be close to 1, while validation loss should be as close as possible to 0.
Our final model achieved final validation metrics: loss 0.0153, accuracy 0.9572, macro-F1 0.9248. Its experiment page shows how these metrics improved throughout training, as logged by MLFlow:

For qualitative evaluation, we present the output of our model on various inputs representing different text styles:
כל בני אדם נולדו בני חורין ושווים בערכם ובזכויותיהם. כולם חוננו בתבונה ובמצפון, לפיכך חובה עליהם לנהוג איש ברעהו ברוח של אחוה.
כָּל בְּנֵי אָדָם נוֹלְדוּ בְנֵי חוֹרִין וְשָׁוִים בְּעָרְכָּם וּבִזְכֻוּיּוֹתֵיהֶם. כֻּלָּם חֻוֹנְנוּ בִּתְבוּנָה וּבְמַצְפּוּן, לְפִיכָךְ חוֹבָה עֲלֵיהֶם לִנְהֹג אִישׁ בְּרֵעֵהוּ בְּרוּחַ שֶׁל אַחֲוָה.
האם תנין הוא יותר ארוך או יותר ירוק?
הַאִם תַּנִּין הוּא יוֹתֵר אָרךְ אוֹ יוֹתֵר יָרוֹק?
לנמלה יש שש רגליים, אבל לי יש רק שתיים.
לַנְמָלָה יֵשׁ שֵׁשׁ רַגְלַיִם, אֲבָל לִי יֵשׁ רַק שְׁתַּיִם.
נחשים באוסטרליה, עכבישים באפריקה: מפגשים עם חיות קטלניות שכמעט נגמרו רע מאוד
נְחָשִׁים בְּאוֹסְטְרְלִיָה, עַכָּבִישִׁים בְּאַפְרִיקָה: מְפַגְּשִׁים עִם חַיּוֹת קַטְלָנִיּוֹת שֶׁכִּמְעַט נִגְמְרוּ רַע מֵאוֹד
אח שלי תודה נשמה וואלה אחלה גבר אתה
אָח שֶׁלִּי תּוֹדָה נְשָׁמָה ואֵלָה אַחלֶה גֶּבֶר אֶתָה
שלום חברים
שְׁלוֹם חַבְרִים
חתול
חתּוֹל
Our main observations from testing UNIKUD on these and various other inputs:
- We found that UNIKUD usually performed better when the input was longer and contained more context. When given very short input (one or a only a couple of words), it would sometimes not complete the nikud or even make mistakes, while it would perform better on average on long input sentences or texts, as seen above.
- The ktiv male step in the training pipeline is important. We see that UNIKUD does learn to remove yud and vav in words such as כולם-כֻּלָּם and רגליים-רַגְלַיִם above. In prior experiments where we trained UNIKUD without this step, it was unable to properly add nikud to such words because it never encountered ktiv male spellings during training.
- Outputs appear more accurate on average for formal or classical texts, while UNIKUD struggles with informal Modern Hebrew. Compare the relatively high-quality output for the formal texts above to the mistakes on slang terms like **** וואלה.
- The model sometimes does not output any nikud when it cannot predict a vowel with high confidence. This can be adjusted by changing the probability threshold for decoding, which is 0.5 by default.
You may also try the model out yourself at its Huggingface Spaces Streamlit deployment:

6. Limitations and Further Directions
We briefly mention a few limitations of the UNIKUD model and possible directions for future research:
- UNIKUD only uses the surface form of the text it is fed to guess the proper nikud symbols to add, so it struggles on foreign borrowings or other words that were not seen during training and whose nikud must be memorized. A promising direction would be to augment UNIKUD with a retrieval component that allows learned access to a dictionary file, similar to recent research on Retrieval-Augmented Generation where a model is given dynamic access to Wikipedia to extract relevant knowledge.
- UNIKUD outputs probabilities for each character, and during decoding we simply treat each of these probabilities separately using a variant of greedy decoding. Nikud have various co-occurence restrictions that could be taken into account while decoding to generate more logical output, for example using a CRF (Conditional Random Field) layer with Viterbi decoding. We also cannot easily generate multiple candidate outputs for ambiguous text (e.g. "ספר"), as can be done with autoregressive text generation models via beam search.
- As a large deep learning model, UNIKUD is substantially slower at performing inference (adding vowels to text) when it is run on CPU rather than GPU. Running on long texts (thousands of characters or more) also requires splitting the text into segments and runtime is highly dependent on how these segments are built and fed into the model for batched inference. We made no attempt to optimize these aspects of UNIKUD.
7. Conclusion
In this project, we presented the UNIKUD model which adds nikud to Hebrew text, an open-source nakdan built with deep learning and using no hand-written rules specific to Hebrew. UNIKUD was built with a CANINE transformer backbone and required a complex data pipeline for training.
We provide our data files, model, and results at the following links:
- DagsHub repository (code, data, and experiments): https://dagshub.com/morrisalp/unikud
- Streamlit deployment: https://huggingface.co/spaces/malper/unikud
We hope that our model will prove useful for those working with Hebrew NLP and as a test case for under- and mid-resourced languages.
We wish to thank DagsHub for sponsoring this project as part of its creators program and for providing the platform that streamlined its development.
8. References
Morris Alper. "Writing Systems of the Worlds’ Languages." 2012. https://github.com/morrisalp/WSOWL
Peter Bloem. "Transformers From Scratch." 2019, http://peterbloem.nl/blog/transformers.
Jonathan H. Clark, Dan Garrette, Iulia Turc, John Wieting: "CANINE: Pre-training an Efficient Tokenization-Free Encoder for Language Representation", 2021; arXiv:2103.06874.
Elazar Gershuni, Yuval Pinter: "Restoring Hebrew Diacritics Without a Dictionary", 2021; arXiv:2105.05209.
Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, Douwe Kiela: "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks", 2020; arXiv:2005.11401.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin: "Attention Is All You Need", 2017; arXiv:1706.03762.
¹ But see Gershuni & Pinter (2021) for a recent exception, using a somewhat different approach than ours.
Morris Alper is a data scientist located in Tel Aviv, Israel. He is the Data Science Lead and Lecturer at Israel Tech Challenge and a current MSc student of Computer Science at Tel Aviv University. Inquiries can be directed via email ([email protected]) or LinkedIn:
Morris Alper – Data Science Lead and Lecturer – Israel Tech Challenge | LinkedIn