Getting Started

Tokenization Algorithms Explained

A one-stop-shop for all your tokenization needs

Sharvil
Towards Data Science
7 min readAug 3, 2021

--

Photo by Brett Jordan on Unsplash

Introduction

For the uninitiated, let's start by formally introducing the concept of tokenization — Tokenization is simply a method of splitting input textual data into individual separate meaningful tokens that can be further understood and processed by machines. Tokens can be words, characters, or even sub-words depending on what splitting algorithm is being employed. We’d discuss all the 3 major categories of tokens — words, characters, and sub-words in this article. We’d also focus on the sub-word tokenization algorithms that most of the recent SOTA models make use of — Byte-Pair Encoding (BPE), Word Piece, Unigram, and Sentence Piece. By the end of this discussion, you’ll have developed a concrete understanding of each of the above avenues and would be well equipped to decide which tokenization method suits best your needs.

Word Based Tokenization

As the name suggests, in word-based tokenization methods entire words separated by either punctuation, whitespaces, delimiters, etc. are considered as tokens. Separation boundary can be task-specific and could sometimes depend on the nature of data you’re working on. A word-based tokenizer for tokenizing Twitter tweets would slightly deviate from the one used for tokenizing news articles per se.

To put it into perspective, let’s start with an example showcasing tokenization performed based on whitespace characters only

Space Tokenization

With this approach, as you can see “tokenization!” is considered as a single token. If you were to run it on a bigger dataset this method would result in a huge vocabulary as innumerable groupings of words and punctuations would be treated as different tokens from the underlying word. To tackle this concern, let’s make an attempt to tokenize on punctuations and whitespaces.

Space and Punctuation Tokenization

The result looks a bit better than the previous output now. Although, I hope you’ll appreciate that in the English language, not all punctuations can be treated as token boundaries without understanding the context in which they reside. To put this into context, we would ideally want to treat apostrophe differently in these scenarios — “let’s” and “don’t” [“do” , “n’t”] tokens would provide more meaningful semantics that [“don” , “‘”, “t”] while we’d would fine with [“let”, “‘”, “‘s”].

This brings us to the topic of rule-based words tokenizers. SpaCy offers a great rule-based tokenizer which applies rules specific to a language for generating semantically rich tokens. Interested readers can take a sneak peek into the rules defined by spacy.

Rule-based Tokenization — SpaCy

This approach is pretty straightforward and works well for most cases if you can bear with extravagant processing time and power. Otherwise, there’s still a caveat involved— as the size of training data increases so does the vocabulary. Training a model with such word-based tokenization on a giant corpus would generate a ridiculously heavy model with exorbitant resource demands.

Then why not reduce the vocabulary size by using mere character sets as tokens, you ask?

Character Based Tokenization

Character based tokenization, as you can imagine, considers all base characters as tokens. They could be either UNICODE, ASCII, etc. as per your requirement. Continuing the example from above, the result would look something like this:

Character-based Tokenization

Of course, the complexity and size of model would drastically reduce as the vocabulary is pretty limited to ~200 tokens. However, the tokens no longer carry meaningful semantics. To put this into context, representations of “king” and “queen” in word-based tokenization would contain a lot more information than the contextual-independent embeddings of letter “k” and “q”. This is the reason why language models perform poorly when trained on character based tokens.

If performance is not critical to your business requirement and are willing to compromise it for higher speed and less compute intensive — go for it.

Otherwise, if you want the best of both worlds, it’s crucial for a tokenization algorithm to find a middle ground that can retain as much semantic meaningful information as possible while also limiting vocabulary size of the model to a certain extent. Meet our new friend, sub-word tokenization!

Sub-word Based Tokenization

This is the sweet spot you should vouch for if character and word based tokenization don’t float your boat— not restricted by meaningless tokens and also not very demanding of time and processing power. Sub-word tokenization methods aim to represent all the words in dataset by only using as many as N tokens, where N is a hyperparameter and is decided as per your requirements. Generally, for base models, it hovers around ~30,000 tokens. Thanks to these methods, without needing an infinite vocabulary we’re now well equipped to capture context-independent semantically rich token representations.

We’ll now discuss some of the best and widely used algorithms utilized for performing sub-word tokenization.

Byte-Pair Encoding (BPE)

Byte-Pair Encoding first requires the input text to be pre-tokenized — which can be as simple as whitespace tokenization or you could employ a rule based tokenizer such as SpaCy for performing the pre-tokenization step.

Now we form a base vocabulary which is nothing but a collection of all unique characters present in the corpus. We also calculate frequency of each token and represent each token as a list of individual characters from base vocabulary.

Now merging begins. We keep adding tokens to our base vocab as long as the maximum size is not breached on the basis of following criteria — the pair of tokens occurring most number of times is merged and introduced as a new token. This step is repeated until we reach the configured maximum vocab size.

Byte-Pair Encoding Algorithm

Word Piece

Word Piece and BPE are very similar in their approaches of achieving sub-word tokenization. Before we saw that BPE’s primary criteria was to select a candidate pair with maximum frequency. Word Piece, instead of maximizing frequency focuses on maximizing the likelihood of a candidate pair — this has succinctly be summarized by following formula:

where a, b are tokens in vocabulary

Unigram

One of the downsides of using BPE is that it fails to provide a ranking mechanism for choosing tokens when there’s ambiguity while tokenizing a word. To put this into perspective, let’s say our base vocab consists of following tokens — the, th, at, eat, re and if we were asked to tokenize the word theatre — we would be presented with two options neither of which precedes the other — (the, at, re) or (th, eat, re). BPE focusses on the best possible prediction at each step which is more of a greedy solution and thus may generate unlikely results in some cases. However, Unigram sticks to predicting the most likely result token taking into account learned probability during training.

At each stage in training, we calculate probability of each sub-word token and define a loss value which would result if each sub-word token was dropped. Then we cherry pick the tokens that result in least overall loss if dropped, essentially the tokens that add very little value to the set.

Sentence Piece

Now coming to the last and one of the most ingenious methods that the NLP community has ever witnessed. Sentence Piece is like the father of all sub-word tokenization methods described in this article so far. It’s language agnostic — works for all languages as it does not require a pre-tokenizer for handling language specific attributes. It treats spaces as a separate symbol thus solving decoding issues altogether (I know it sounds simple!). And it’s overall faster than other sub-word tokenization methods making it the go-to tokenizer for most of the use cases. For further reading, you can check out the published Sentence Piece Paper.

In this article, we walked through some of the best and widely used tokenization algorithms and weighed their perks, drawbacks and discussed situations in which they could be the right choice. Hopefully, you’re now well equipped and more comfortable dealing with different tokenization strategies!

--

--