The world’s leading publication for data science, AI, and ML professionals.

Dynamic Word Tokenization with Regex Tokenizer

A short tutorial on single-step preprocessing of text with regular expression

Image by Amador Loureiro on Unsplash
Image by Amador Loureiro on Unsplash

In this tutorial, we introduce regular expressions to customize word tokenization for NLP task. By actively choosing your tokens or words, you can quickly preprocess your text without extensive text cleaning.

In the realm of machine learning engineering, scoping, data handling, modelling and deployment are the main recursive processes of a project lifecycle. Of note, data cleaning and preparation are considered the early stages of any Data Science project pipeline, yet they are of paramount importance to the model’s accuracy. In simple terms:

Garbage in, garbage out.

Diagram from the lecture slides of MLOps Specialization by DeepLearning.ai (Source)
Diagram from the lecture slides of MLOps Specialization by DeepLearning.ai (Source)

For structured tabular data, data preprocessing may take the form of imputation of missing data or standardizing values of certain classes (e.g. lowercasing of strings that are refer to similar class). Yet for this tutorial, we will be touching on data preprocessing method on unstructured data from another sub-field, Natural Language Processing (NLP) – text data.

1. Text Preprocessing – Tokenization

If images (another unstructured data) are considered spatial data, then text should be considered sequential data, with information of text being derived, after tokens (words or characters) are processed in complete order.

Each text token can be word, number, symbol, punctuation and so on, which carry a certain meaning, and thus can be seen as an element of semantics.

The process of obtaining a list of tokens from text is a preprocessing step called tokenization, after which tokens are converted into one-hot vectors to be fed into any downstream models, such as transformers or bi-directional LSTM for NLP task like machine translation, summarization, sentiment analysis and coreference resolution, to name a few. (An n-dimensional vector embeddings can then be derived for each token after the training process, which encapsulate a certain meaning.)

If the NLP task process text on a character level, tokenization is very easy, as shown in the following snippet of code:

Character level tokenization. Image by Author.
Character level tokenization. Image by Author.

Nonetheless most NLP task usually process text on a word level. And word tokenization can easily be done using some popular NLP libraries in Python such as NLTK or spaCy, as shown below:

Word level tokenization. Image by Author.
Word level tokenization. Image by Author.

One issue with the above method of tokenization is that the the method of tokenizing is either fixed or not easily customizable. This problem is further exacerbated when the text is messy, containing HTML tags for instance, or comprise of text you wish to omit, such as numbers, web-links, emails or even symbolic expletives.

Typical word tokenizers do not automatically clean text. Image by Author.
Typical word tokenizers do not automatically clean text. Image by Author.

The second issue can be solved with manual text cleaning, by substituting away unwanted text with an empty string. This is fine, provided you can account for all variations of unwanted text. However this is tedious, especially if you have a huge corpus, comprising of millions of text documents. Hence, with this, sometimes we might end up with uncleaned tokens that are passed down to downstream models.

What if we could combine text cleaning together with tokenization in a single step? In other words, instead of filtering away unwanted text before tokenization, could we achieve a rules-based tokenization only based on text we want to include?

Enter the regular expression (Regex) tokenizer! But before that, let see how regular expression works.

2. Searching by Text Patterns – Regular Expression

Regular expressions (regex) are extremely useful in extracting characters from text by searching matches of a specific search pattern. The language of the search pattern, expressed as a another text string of characters, is called regular expressions. There are several special characters in regex for each specific use case, but in this tutorial we would only be going through briefly some of them for illustration purposes:

Anchors – ^ and $

^The : Matches ‘The’ if it is at the start of string.

end$ : Matches ‘end’ if it is at the end of string.

hello20 : Matches ‘hello20’ if the character sequence appears at anywhere in string.

Quantifiers – * + ?

love* : Matches a string that contains ‘lov’ followed by zero or more ‘e’

love+ : Matches a string that contains ‘lov’ followed by one of more ‘e’

love! : Matches a string that contains ‘lov’ followed by zero or one ‘e’

Whitespaces – s, S

s : Matches whitespaces

S : Matches non-whitespaces

Bracket Expressions – [ ]

[a!?:] : Matches a string that is either ‘a’, ‘!’, ‘?’ or ‘:’

[a-zA-Z0–9] : Matches a string that is any letter (lowercase or uppercase) in the alphabet or any digit 0–9.

Capturing, Non-capturing the OR operator – ( ), ?: and |

(roar)ing : Matches and captures the string ‘roar’ if there is ‘ing’ behind it.

(?:([0–9]+)|([#@!])) : ' ?: ' typically negates capturing, and is used when you want to group an expression, together with the OR operator '|', but you do not want to save it as a captured portion of the string.

Look Ahead and Look Behind – (?=) and (?< =)

[0–9-]+(?=[.]) : Matches a telephone number, for instance, that ends with a full-stop.

(?<=^)[a-z@.]+ : Matches an email, for instance, that begins at the start of the string.

The above tutorial on regular expressions is non-exhaustive and there are several other regular expression use cases and rules, and I will leave this to your exploration. I have also collated some good resources for further learning here:

Regex tutorial – A quick cheatsheet by examples

A Simple And Intuitive Guide to Regular Expressions in Python

Regular Expression (Regex) Tutorial

Regex lookahead, lookbehind and atomic groups

There are also a couple of handy websites to easily test out regex on text strings. One of my favorite is this:

regex101: build, test, and debug regex

3. The NLTK Regex Tokenizer

Now we have mastered regular expressions, we can easily search for and customize text patterns that we wish to tokenize. Now say for instance, we only wish to capture all meaningful words, excluding all external punctuations, numbers, HTML tags, websites and what not. To do this, we can make use of the regex tokenizer in the NLTK library to tokenize text according to a search pattern. Consider the following text, which is a movie review, we see that tokens can be captured with clear precision without needing prior text cleaning.

'OK, so the FX are not high budget. But this story is based on actual events. Not something thrown together to make a couple of rich actors, even richer!! As most movies that are based on books, there are somethings that just don't fit. Only a couple of people have stated that this movie was based on real events, not a knock-off as most people believe it is!! This movie is in no way related too TWISTER! Other than both movies are about tornadoes.<br /><br />For those of you who have problems with the science of the tornadoes and storms in the movie, there are a couple of things you need to remember... The actual "night of the twisters" was June 3, 1980. So this movie was released 16 years after the actual events. Try think how far storm research has advanced in that time. It happened in a larger town; Grand Island, Nebraska is the third largest city in the state. Even though the movie calls the town something else, and says it's a small town. For the real story check out: http://www.gitwisters.com/'
Screenshot from regex101.com. Image by Author.
Screenshot from regex101.com. Image by Author.

The regular expression applied allows us to capture tokens that are any words (one of more letters with apostrophe and hyphen) that are preceded by either :

  • Start of string
  • Whitespace
  • Characters like [> "]

and also succeeded by either :

  • End of string
  • Whitespace
  • Characters like [: . ! ; "]

Hence, the regular expression is able to capture all meaningful words, except unwanted text like weblinks, email, HTML tags etc. For the curious, the regex string we applied is shown below, and feel free to customize further to suit other context.

"(?:(?<=s)|(?<=^)|(?<=[>"]))[a-z-']+(?:(?=s)|(?=:s)|(?=$)|(?=[.!,;"]))"

4. Final Thoughts

Regex Tokenization is a dynamic rules-based tokenization. Although in recent years algorithms have gradually shifted to being model-based or used data to drive tokenization, regular expression remains a powerful tool. On hindsight, in another application, text can even be further preprocessed, before tokenization, by converting weblinks into tokens (e.g. ) using regex, for instance.

If messy text can easily and clearly delineated, then a strong foundation can be provided to any NLP model in the horizon of the MLOps lifecycle.

Thanks for reading! If you have enjoyed the content, pop by my other articles on Medium and follow me on LinkedIn.

Support me! – If you are not subscribed to Medium, and like my content, do consider supporting me by joining Medium via my referral link.

Join Medium with my referral link – Tan Pengshi Alvin


Related Articles