This article will guide you through the creation of a simple auto-correct program in python. This project is to create two different spelling recommenders, that will be able to take in a user’s input and recommend a correctly spelled word. Very cool!
Note: Test inputs: [‘cormulent’, ‘incendenece’, ‘validrate’].
Packages
This effort will be largely centered around the use of the nltk library. nltk stands for Natural Language Toolkit, and more info about what can be done with it can be found here.
Specifically, we’ll be using the _words, edit_distance, jaccarddistance and ngrams objects.
edit_distance, jaccard_distance refer to metrics which will be used to determine word that is most similar to the user’s input
An n-gram is a contiguous sequence of n items from a given sample of text or speech. For example: "White House" is a bigram and carries a different meaning from "white house".
Additionally, we’ll also use pandas as a way to create an indexed series of the list of correct words.
words.words() gives a list of correctly spelled words which has been included in the nltk library as the word object. _spellingsseries is an indexed series of these words, with the output shown below the code chunk.
spellings_series output
Recommender 1
Metric: Jaccard Distance
This is a measure of how dissimilar two sets are, I’ve attempted to explain it in plain English below:
The string in question will be compared iteratively with each word in spellings_series.
For each comparison instance, we count the total number of non-unique alphabets and also the number of shared alphabets between both sets as Total and Non-Unique. We then divide Non-Unique by Total, and multiply that by 100% to obtain a percentage similarity. This is the Jaccard Index.
The Jaccard Distance is a measure of how dissimilar two sets are, and can be found as the complement of the Jaccard Index (Ie. Jaccard Distance = 100% – Jaccard Index)
Defining a Jaccard function to iterate through possible words:
We are going to use an empty list with a for loop to iteratively look through _spellingsseries.
The jaccard function will take arguments _entries and gramnumber, whereentries refers to the words in question, and _gramnumber sets the number of n-grams to use (preparing for the case where a list of words which are all part of the same ngram is used) Do find more info on ngrams here, it took me some time to fully understand it!
spellings will create a list of possible words, based on the first letter of the entry string. This assumes that the first letter is not typed wrongly.
Next, distances will iteratively calculate the respective jaccard distances for words in spellings by using the inbuilt _jaccarddistance function.
Lastly, closest will give the resulting best matching word through the min function on distances. This can then be appended to the outcomes empty list, and this list will be returned.
Now, with our amazing new jaccard function, we can create the word recommender, JDreco. This function will by default take in the list of strings "cormulent", "incidence", "validrate", and return a suggested list of words.
Congrats, the first recommendation model is completed!
It’s a method is evaluating how dissimilar two strings are based on the minimum number of operations required to transform one string into another. Similar to before, this function will by default take in the same list of words as in recommender 1.
The function will iteratively compare entries with the list of correct words and get their respective edit distances. Then, the word with the lowest distance will be deemed as the best matching word, appended to outcomes, and returned by the function.
Second recommended model completed!
User’s input
Now let’s put our models into work. We will try a list of 3 words: a misspelled version of "Three words please" – "threa woeds pleese", let’s see how it goes. The code below prompts the user for three seperate words to create a list userinput.
Recommender 1:
Result was ‘thread’, ‘woe’, ‘plea’.
JDreco output
Recommender 2:
Result was ‘tarea’, ‘Moed’, ‘please’
editreco output
Further improvements
The results were so far from what I expected! Seems like editreco performed better by correctly suggesting ‘please’, versus JDreco.
This should largely because the algorithm is currently too ‘mechanical’ and judges words solely based the individual letters. I believe that this goes to show that creating a Google-level autocorrect will definitely take up lots of time and effort.
Some future improvements I could see adding to the code that would improve upon suggestions would be taking into account grammar and lemetization through popular machine learning enabler Pytorch.
Through this article, I hope that you have learned the basics of nltk, though the library is so vast I could not possibly cover everything in one article. I would say this would be a Data Science project that is easy enough for most people to understand and code, given appropriate time and effort.
I was able to learn this through the MOOC "Applied Text Mining in Python" by the University of Michigan, hosted by Coursera. Recommended!
Do feel free to reach out to me on LinkedIn if you have questions or would like to discuss ideas on applying data science techniques in a post-Covid-19 world!
I hope that I was able to help you in learning about data science methods in one way or another!