Deep Learning: Autocorrect, Spell check for short words

Correct spellings of uncommon words like proper nouns using state of the art neural networks. This article talks about the abstract idea of the network, its implementation and results.

Tejas Bawaskar
Towards Data Science

--

Photo by Biswapati Acharya on Unsplash

Here is a beautiful picture of the New York City skyline, why do I have it on a neural network article? Maybe I’m appreciating outer beauty more now that i have quarantined myself since Feb 15 (76 days and counting). Yes its more than 2 months!!! What could go wrong typing out an article at 2.45 am on a Friday midnight. We’ll find out.

I’ve been working at Reorg for more than a year now and it’s been incredible. Reorg is in the financial sector that provides bankruptcy related news and analysis on distress debt companies. Thats right, we make money when someone’s on the verge of bankruptcy, it sounds evil, but trust me its not. Reorg has a website that can be used by its subscribers to get the latest analysis on distress debt companies. To find their company of interest, we provide them with a search bar that goes through a database that has easily over a million companies. So what can go wrong? There’s one problem, all of them are proper nouns (company names), so you have to get the exact name correctly in the search bar.

Unfortunately, we humans are designed to make mistakes. Solution? Robots to the rescue. *Screams in robot language*

Vision:

My vision here is to implement something similar to what Google provides, i.e. auto suggestions (or auto predictions as they call it) under Showing results for or Did you mean.

Why do we have to do this? To keep users engaged on our website. Imagine if each time you typed an error and Google never displayed any results till you actually typed in the right spelling. How f*cked up would that world be?

Below are the search results for “deep learning” with errors inflected on purpose.

Source: google.com

One would notice that there is a difference in these two results. When it shows ‘Showing results for’, Google is highly confident of what you are looking for and thus shows you the search results for the predicted keywords. On the other hand when it displays results for ‘Did you mean’, its confidence drops and asks for you (the user) to help verify what you are actually looking for. Despite that they still show links to websites that they think is closely related to the words typed in search box.

Scroll all the way down to AI in action¹ if you are interested only in the model results.

Photo by Samrat Khadka on Unsplash

Omg my mother nature, I miss thou so much! :( Alright thats enough Shakespeare from me. Back to the cool stuff!

Network:

We will use a sequence to sequence translator commonly used in machine neural transmission networks. This will also have an attention mechanism and the use of teacher forcing to learn. For this you will require a basic understanding of RNN’s, GRU’s. I’ll describe the network in brief and how I used my existing database as a training dataset.

In general, the idea is to convert Source sequence ‘X’ into some target sequence ‘Y’ where,
X = x₁ + x₂ + …. + xᵣ,
Y = y₁ + y₂ + …. + yₐ

Typically, this work is done by creating a representation of our source sequence ‘X’. And we are going to model the conditional probability of the next target word/character in the target sequence ‘Y’.

Why Sequence to Sequence models?

The abstract idea is to convert a given input, i.e. a source sequence to an N-Dimension context vector that will embody the ‘meaning’ of the input. This vector is then fed to the decoder, that will ‘translate’ the input to the required output.

Why the Attention mechanism? Wait, what is the attention mechanism?

The context vector thats separating the encoder and the decoder holds/summarizes most of the input information from the source sequence. It’s ‘expected’ to be a good summary of the input, in order for the decoder to start spitting out good model results. The idea behind attention mechanism is to retrain the context vector and find which states will prove to be useful for decoder. Without the attention mechanism, usually the initial and intermediate states aren’t given as much importance and are easily forgotten, thus causing the network to forget the earlier parts of the sequence.
To understand it better, I would suggest reading these articles here and here.

Input:

The problem definition is unique and different to others. Our Database consists of over a list of over 1 million company names, which are all proper nouns, i.e. they are all company names, that any user could be searching for. Let’s take a data science approach to cut this number and take baby steps towards achieving our goal. Not everyone is gonna search for all the companies, they would be interested in a few companies that are trending or are about to start trending. Plus we have history of searches that helps us dissect which area people are focusing on and what kind of sectors or industries they are most interested in. This will eliminate a lot of noise and reduce our range for companies drastically.

Thats a data science project for some other time, but for now we would use a set of special companies that we can fairly assume people are interested in searching. This covers over 2500 companies and that will be our training dataset.

Training Data:

Here comes the magic! From a single list of words that we wanna predict, we create our own input and output. We induce spelling mistakes in our list, not with any random spelling mistakes, but with keys close to one another. For e.g. if we have a company called “Linkedin.com”, we create a training input with a spelling error “Lonkwdin.com”. So whats so special about the errors? I replaced the letters ‘i’ and ‘e’ with ‘s’ and ‘o’ respectively. These letters were not chosen at random. Look at your keyboard, the most likely one would misspell the letters ‘i’ would be with ‘u’, ‘j’, ‘k’, ‘l’ & ‘o’. We pick any of these letters and replace it in the input. The same goes with the letter ‘e’. Now we introduce a number of variations so that the network learns these mistakes and predicts the output.

I created 24 errors in the augmented input and mapped it to the actual predicted output. I also added one without mistakes, so that network should know that when someone types it exactly right, we still show them the expected result. Remember the objective of the network is to learn these errors in generalizable way, so that once someone makes a spelling error out of the training/testing set, we are still able to predict it.

This augmentation/simulation was then run over all the companies which resulted in 78,000 data points. This data was split 60–20–20 to turn it into training, validation and testing dataset.

Coming back to the network:

The objective here for the network is to predict the actual values given the augmented values. We here have to improvise how we would use the neural machine translation model. This model is widely used for translating languages which would have a ton of words (other than proper nouns) that are frequent enough for the model to understand when, where and how to use them. This case is unfortunately not the same. Why? Because our vocabulary increases when we introduce a new company each time in the dataset. This is obviously not scalable since we cant always modify the input of the network to take these words into account and retrain it from scratch. One way to keep the input index constant is by using characters. This way our vocabulary length stays constant, no matter what. We do that by taking a total of 52 characters which includes the alphabet, numbers and special symbols, map each character as ‘a’ -> 2, ‘b’ -> 3, ‘c’ -> 4 and so on. By doing so, irrespective of any new company that comes in, it will always be within this vocabulary of 52 characters. This allows us to train the model from where it had left of previously. On the other hand, if we had to introduce a new word altogether, our initial mapping index would increase by 1 everytime. This would require the network to relearn a new set of numbers each time from scratch, which is not scalable.

"<>abcdefghijklmnopqrstuvwxyz &()+/.,;'?!-@0123456789"

The first two characters ‘<’ and ‘>’ would be used in the decoders as ‘SOS’ (Start of String) and ‘EOS’ (End of string) to tell the decoder when the string starts and ends. Why is that necessary? Because it would continue to predict characters one by one even after it throws an EOS tag. Thats a cue for us to know that the prediction ends there.

Great, now that we have established what our input is going to look like and the output is the actual company name. We start training the network and stop before it overfits.

Results:

The model achieved 87.1% accuracy, not too bad with state of the art results at 89–90% accuracy. The reason for the difference in accuracy is solely because the dataset that I have is proper nouns, i.e. words that do not repeat itself. This dramatically increases the complexity of the problem and thus reduces accuracy.

One way I believe to improve this is by doing a bit more preprocessing on the dataset, making it easier for the model to predict and thus pushing the accuracy higher.

AI in Action:

Reorg has been lately writing about the impact of coronavirus, with all its related articles under the name “Coronavirus Impact”. Below I have created a gif with a few possible combinations of errors.

These error could possibly occur when someone is searching about ‘coronavirus impact’ in our search bar and then hits enter. This gif shows how well the model still does even after inducing many mistakes.

The letters in orange is the input from a user and the output below is the autocomplete prediction.

Here is another example of company “Bharti Airtel”. After removing the second word completely and inducing 2 errors in the first word, it still predicts the correct outcome

Another thing to point out thats absolutely crazy for this model is; the input lengths and output lengths do not have to stay constant for the model to train. They can vary for any other data point. Now try to think of any other model that does that. Can’t think of many, can you? *Mindblown*

--

--