NLP with Kotlin

A first approach using n-grams

Published in

Towards Data Science

4 min readDec 19, 2018

Natural Language Processing (NLP) is at the heart of many applications that fall under the “machine learning” trademark though they are their own group of algorithms and ways of approaching artificial intelligence (pick this with a pinch of salt)

NLP is used to create chatbots that help you when you’re asking for support to deliver results from a search that are similar; to translate texts or to alleviate the work other applications must do by offering a representation of the text that is easier to handle

One application of NLP is to generate or guess the next word in a sequence. As with other machine learning processes, guessing the next word requires training a model and making inferences using it.

In this example, you will see how to build a simple word generator that works at character level.

There are different approaches to this task. In the last few years, RNN in different flavours are beating other mechanisms to generate text. However, in this project, I have decided to drop them in favour of the more simple n-gram representation as it is a much better fit for the problem to solve

If you want to jump into the code, find it here

mccorby/MachineLearning

Contribute to mccorby/MachineLearning development by creating an account on GitHub.

github.com

What is an n-gram model?

An n-gram is basically a way of splitting the text in a sequence of consecutive tokens

For instance, for the text “In a village of La Mancha, the name of which I have no desire to call to mind”, a 4-gram representation would be

A n-gram language model will use this representation to estimate the probability of the next item in the sequence (a word, a character, a whole sentence) by counting the occurrences of the different possible continuations

The n-gram language model used in this project is represented by a map of maps as shown in the following image

Every entry represents a n-gram with a map of the possible characters in the sequence and the number of occurrences found in the corpus

This is represented by a Map<String, Map<Char, Int>>

The following code is used to train the model

Training process for character-based n-gram language model

n-grams models are often used as baseline models when building more complex architectures and approaches

Generating text

To generate text using this model, the application will take an initial input (or the empty string) and find the candidates for the next character. This character is added to the input forming a history that is used in subsequent steps

A sample using a 5-gram model with no initial seed for Don Quixote looks like

chapter x.
of work of it to saddle the
hack as well as handle the character and pursued the character
went in a village of la mancha
in which treats of the character and pursued him, he made away with the character and putting in practice himself up to take up his income. the character and pursued him that keep a lance
in the character and pursuit of it too, had in his income. so, and pursuit of it too, have no desire to called quexana. the character and pursuits of the character and to saddle the character and present myself up to take a knights, scraps on saturdays, lentils on saturdays, lentils on saturdays, lean hack, an old buckler, a leagues long since it his income. the character and pursued him that keep a lance one of la mancha

That text really looks like something Cervantes would have written (if maybe not at his best moment)

Smoothing

But what happens when the model did not register the history? It would then stop working since there is no next character

In these situations some techniques are used. Among them there is smoothing and, in particular, Stupid Backoff.

In short, Stupid Backoff will use n-grams of a lower order to compute the probability of the missing entry

Selecting the chars

Selecting the next character can be tricky depending on the corpus used to train the model. If the character with the highest probability is always selected there is a chance of the model falling into repetition

To avoid this issue, a random selection of the character from the set of possible candidates linked to the current history is used

Improvements

As with any other Machine Learning problem, a series of steps are desirable when building a system: Data inspection, cleaning, evaluation of the model(s)
The corpus used to train the model could be cleaned up removing stop words

Conclusions

This project shows how to create a character-level n-gram language model to generate text.

I’ll be using this model as the base for the second part of the project.

Stay tuned!