
Introduction
In a previous post, we have introduced Spell Checking, and the need for considering contextual clues when generating corrections. We also discussed the availability of pre-trained models for Spell Checking in the Spark-NLP library. Although, pre-trained models are convenient for many situations, sometimes you need to target a specific domain like finance or a specific language like German. In this post we’re going to explore training a Conextual Spell Checker from scratch.
We will delve into the training process and data modeling involved in creating a Contextual Spell Checking model. We will be using generic Italian as our target language, but these steps can be applied to any other domain or language of your interest.
We will be using the PAISÀ Corpus as our training dataset, and we will add custom word classes to handle things like places, and names; finally we will see how to use the GPU with Spark-NLP to speed up training. Let’s get started!.
Code for this article
You can find the entire code for this article along with some additional examples, in the following notebook,
Training a Context Spell Checker – Italian.
Data Preparation
You’re going to be downloading the corpus file from this URL. Next, we will be applying some cleaning to the raw text to get rid of some noise.
Let’s get started by adding some basic imports,
from sparknlp.annotator import *
from sparknlp.common import *
from sparknlp.base import *
import sparknlp
spark = sparknlp.start()
Let’s follow by loading the corpus data.
from pyspark.sql.functions import *
paisaCorpusPath = "/path/to/paisa.raw.utf8"
# do some brief DS exploration, and preparation to get clean text
df = spark.read.text(paisaCorpusPath)
df = df.filter(~col('value').contains('</text')).
filter(~col('value').contains('<text')).
filter(~col('value').startswith('#')).
limit(10000)
df.show(truncate=False)
And this is what we get,
+--------------------+
| value|
+--------------------+
|Davide Guglielmin...|
|Avete partecipato...|
|Otto mesi di rita...|
|Il prezzo dei big...|
|Dopo la "Shopping...|
|Il nome è tutto u...|
|Ovvero? Ovvero do...|
|Il patto ha lo sc...|
|La tre giorni org...|
|Quanto pagate per...|
|Per contro, nell'...|
|Non è poco, ovvia...|
|Dopo una lunga as...|
|La serata sarà un...|
|I primi 4 negozi ...|
|La Chaingang Rota...|
|La Chaingang Rota...|
|Tra le molte rami...|
|Un primo corollar...|
|E' proprio nel di...|
+--------------------+
only showing top 20 rows
We have 7555997 paragraphs there, each containing a few sentences, and as training does not require labeled data, that’s all we need.
Basic Pipeline Setup
Now that we have our data in good shape, we’re going to setup a basic pipeline to do our processing. This is the typical Spark-NLP pipeline with data flowing as annotations inside annotators.
assembler = DocumentAssembler()
.setInputCol("value")
.setOutputCol("document")
tokenizer = RecursiveTokenizer()
.setInputCols("document")
.setOutputCol("token")
.setPrefixes([""", """, "(", "[", "n", ".", "l'", "dell'", "nell'", "sull'", "all'", "d'", "un'"])
.setSuffixes([""", """, ".", ",", "?", ")", "]", "!", ";", ":"])
spellChecker = ContextSpellCheckerApproach().
setInputCols("token").
setOutputCol("corrected").
setLanguageModelClasses(1650).
setWordMaxDistance(3).
setEpochs(10).
addVocabClass('_NAME_', names).
updateRegexClass('_DATE_', date_regex)
We start with our DocumentAssembler for which we feed the data that we have prepared in previous step. Then we continue with our RecursiveTokenizer, in which we setup some special options to deal with some details of the Italian language.
Finally, we have our Spell Checker. The first two method calls are the usual ones for connecting inputs and outputs in a pipeline. Then we have a setting for the language model inside the Spell Checker setLanguageModelClasses(), this one is dependent on the vocabulary size, and the model will use it to control the factoring in the language model.
Basically, the idea behind factoring is that we won’t treat words in isolation, but rather we will group them assigning each word a specific class, and an id within the class. This helps speeding up the internal processing both in training and during inference.
Following we have setWordMaxDistance(3), which simply tells the Spell Checker that our errors are going to be at an edit distance of 3, at most.
The number of epochs is the typical gradient descent parameter that you know from neural networks, and it will be used by the neural language model inside the Spell Checker.
Last but not least, we have the addition of a special word class, in this case for treating names. Let’s get into more detail there!.
Adding special classes
This section is optional and explains the addVocabClass() and addRegexClass() calls that we made at the end of the annotator setup code. If you don’t want to add special classes or you want to come back later, just comment out those two lines.
At this point we have a corpus which is fairly big enough to give us good exposure to different flavors of the Italian language.
Still, we may want to have another mechanism for building our model vocabulary outside what we’ve seen in the corpus. This is where special classes come into play.
With this, we can unlock the power of knowledge contained in specific datasets like name dictionaries or gazetteers.
They will not only allow us to teach the spell checker to preserve some words, but also to propose corrections, and unlike the words in the main vocabulary, you will be able to update them once the model is trained. For example, if after the model is trained you still want to be able to support a new name, you can do so by just updating the names class, without re-training everything from scratch.
When you train a Spell Checker from scratch as we’re doing here you get two special classes predefined: DATE, and NUM, for handling dates and numbers respectively. You can override them and also add more classes.
Following our example about names, we will add Italian names as a special class in our Context Spell Checker in Spark-NLP,
import pandas as pd
import io
import requests
# Get a list of common Italian names
url="https://gist.githubusercontent.com/pdesterlich/2562329/raw/7c09ac44d769539c61df15d5b3c441eaebb77660/nomi_italiani.txt"
s=requests.get(url).content
# remove the first couple of lines (which are comments) & capitalize first letter
names = [name[0].upper() + name[1:] for name in s.decode('utf-8').split('n')[7:]]
# visualize
names
This is what we get(truncated),
['Abaco',
'Abbondanza',
'Abbondanzia',
'Abbondanzio',
'Abbondazio',
'Abbondia',
'Abbondina',
'Abbondio',
'Abdelkrim',
'Abdellah',
'Abdenago',
'Abdon',
'Abdone',
'Abela',
'Abelarda',
'Abelardo',
'Abele',
And this is what we passed in the names variable __ to the Spell Checker during annotator setup in previous listing.
If you paid attention on how we set up the annotator, you may also have noticed that we configured a second special class for numbers using the updateRegexClass() function call.
This is going to serve the same purpose of adding additional sources for correction candidates, but this time for dates, so we’ll need to pass a regex describing the dates that we want to cover.
We will focus on a specific format here, but you can add your own as well,
format: dd/mm/yyyy
regex:
([0–2][0–9]|30|31)/(01|02|03|04|05|06|07|08|09|10|11|12) /(19|20)[0–9]{2}
It’s important to understand what happens behinds the scenes with the regex we’ve just created. First the regex we created is a finite regex, in the sense that the vocabulary, or set of strings it matches is finite. This is important because the Spell Checker will enumerate the regex, and create a Levenshtein Automaton with it.
Training the Model
Now that we have our pipeline ready we will just call fit(), and pass our dataset.
pipeline = Pipeline(
stages = [
assembler,
tokenizer,
spellChecker
])
model = pipeline.fit(df)
Using the GPU
You will notice that the training process is not very fast. That’s because so far we’ve only used CPU. We can significantly speedup training by just replacing the spark-nlp library with spark-nlp-gpu.
You can do this with almost no change just by passing the right flag to the Spark-NLP start() function like this,
sparknlp.start(gpu=True)
Playing with the model
Let’s see what this model can do!.
lp.annotate("sonno Glorea ho lasciatth la paterte sul tavolo acanto allu fruttu")['corrected']
['sono',
'Gloria',
'ho',
'lasciato',
'la',
'patente',
'sul',
'tavolo',
'accanto',
'alla',
'frutta']
We can see that the model is indeed generating corrections that consider the context, for example ‘allu’ was corrected as ‘alla’, when other options like ‘alle’, or ‘allo’ were also at the same edit distance of 1 from the input word, however the model took the right choice. We can also see the names lexicalization helping the model to understand and correct the proper name ‘Gloria’.
In general, this is what we can expect from this model, to take the right choice when multiple similar, equally distant candidates, are possible. The model can take some more risky decisions like actually replace input words that are present in the vocabulary.
This is more the job of a grammar check, think of Grammarly, for example. Still, if you want to explore how much you can stretch this model to perform these corrections there are more examples in the companion notebook for this article.
Access the model through the Model Hub
Although this model is not perfect, I have uploaded it to the Model Hub and made it available to the community in case somebody finds it useful. You can access it in the following manner,
ContextSpellChecker.pretrained("italian_spell", "it")
You can as well train your own models, and share them with the community following these steps.
A word on Tokenization
This section is optional but recommended for those trying to fine tune their models for a particular language or domain.
The tokenization part is crucial in many NLP tasks, and Spell Checking is not exception. The units that you’ll get out of tokenization will be very important as those will define the level at which your model will detect errors, generate corrections and leverage the context for ranking different corrections.
Here, we’re using the RecursiveTokenizer, which is called like that because it will continue to apply its tokenization strategy recursively, that is, to the results of previous tokenization steps, until the strategy cannot be applied anymore.
The default tokenization in the RecursiveTokenizer is to split on white space. To tokenize further you can specify infix patterns, prefix patterns and suffix patterns.
Infix patterns are characters that can be found in the middle of a token and on which you want to split. Accordingly, prefix patterns, and suffix patterns are characters that you will like to split on and will appear in the beginning, and in the end of each token, respectively.
There are more options for this RecursiveTokenizer that you can find in the documentation. Also there are other tokenizers available to you in the Spark-NLP library.
Conclusion
We’ve explored how we can train our own Spell Checker, how to prepare our data and pipeline, how to provide rules for special classes, how to use the GPU, and how to adjust tokenization to better match our particular needs.
The Spark-NLP library provides some reasonable default behaviors, but can also be adjusted for specific domains and use cases.
Hope you can start training and sharing your own models, in bocca al lupo!!