Hands-on Tutorials

Word, Subword, and Character-Based Tokenization: Know the Difference

The differences that anyone working on an NLP project should know

Chetna Khanna

Published in

Towards Data Science

8 min readJul 1, 2021

Natural Language Processing (NLP) is a branch of Artificial Intelligence (AI) that provides machines (computers) the ability to understand written and spoken human language in the same way as human beings. NLP is almost everywhere and helping people in their daily tasks. 😍 It is such a common technology now that we often take it for granted. A few examples are spell check, autocomplete, spam detection, Alexa, or Google assistant. NLP can be taken for granted, but one can never forget that machines work with numbers and not letters/words/sentences. So to work with a large amount of text data readily available on the internet, we need manipulation and cleaning of text which we commonly call text pre-processing in NLP.

Pre-processing is the first step in working with text and in building a model to solve our business problem. Pre-processing in itself is a multi-stage process. In this article, we will be only talking about tokenization and tokenizers. So, let’s get started. 🏄🏼

Note: We are mainly focusing on the English language.

Tokenization

Tokenization is one of the most important steps in text pre-processing. Whether you are working with traditional NLP techniques or using advanced deep-learning techniques, you cannot skip this step. 🙅🏻

Tokenization in simple words is the process of splitting a phrase, sentence, paragraph, one or multiple text documents into smaller units. 🔪 Each of these smaller units is called a token. Now, these tokens can be anything — a word, a subword, or even a character. Different algorithms follow different processes in performing tokenization but the below given example will give you a basic idea about the difference between these three.

Consider the following sentence/raw text.

“Let us learn tokenization.”

A word-based tokenization algorithm will break the sentence into words. The most common one is splitting based on space.

[“Let”, “us”, “learn”, “tokenization.”]

A subword-based tokenization algorithm will break the sentence into subwords.

[“Let”, “us”, “learn”, “token”, “ization.”]

A character-based tokenization algorithm will break the sentence into characters.

[“L”, “e”, “t”, “u”, “s”, “l”, “e”, “a”, “r”, “n”, “t”, “o”, “k”, “e”, “n”, “i”, “z”, “a”, “t”, “i”, “o”, “n”, “.”]

Tokens are actually the building blocks of NLP and all the NLP models process raw text at the token level. These tokens are used to form the vocabulary, which is a set of unique tokens in a corpus (a dataset in NLP). This vocabulary is then converted into numbers (IDs) and helps us in modeling. 😎

We mentioned three different tokenization techniques here. Each of these techniques works differently and have their own advantages and disadvantages. Let us go into the details of each of these techniques to know more. 🏇🏻

Word-based tokenization

This is the most commonly used tokenization technique. It splits a piece of text into words based on a delimiter. The most commonly used delimiter is space. You can also split your text using more than one delimiter, like space and punctuation marks. Depending on the delimiter you used, you will get different word-level tokens.

Word-based tokenization can be easily done using custom RegEx or Python’s split() method. Apart from that, there are plenty of libraries in Python — NLTK, spaCy, Keras, Gensim, which can help you perform tokenization easily.

Example:

“Is it weird I don’t like coffee?”

By performing word-based tokenization with space as a delimiter, we get:

[“Is”, “it”, “weird”, “I”, “don’t”, “like”, “coffee?”]

If we look at the tokens “don’t” and “coffee?”, we will notice that these words have punctuation attached to them. What if there is another raw text (sentence) in our corpora like this — “I love coffee.” This time there will be a token “coffee.” which can lead the model to learn different representations of the word coffee (“coffee?” and “coffee.”) and will make the representation of words (tokens) suboptimal. 🙆🏻

The reason that we should take punctuation into account while performing tokenization is that we do not want our model to learn different representations of the same word with every possible punctuation (of course the ones that can follow a word). If we allow our model to do so, we will be exploded with the number of representations a model will learn (each word × number of punctuations used in a language). 😳 So, let’s take punctation into account.

[“Is”, “it”, “wierd”, “I”, “don”, “’”, “t”, “like”, “coffee”, “?”]

This is better than what we had earlier. However, if we notice, tokenization has made three tokens for the word “don’t” — “don”, “’”, “t”. Better tokenization of “don’t” would have been “do” and “n’t” and this way if the model would have seen a word “doesn’t” in the future, it would have tokenized it into “does” and “n’t” and since the model would have already learned about “n’t” in the past it would have applied its knowledge here. The problem sounds complicated but can be dealt with using some rules. 🤓

You must have noticed that the state-of-the-art NLP models have their own tokenizers because each model uses different rules to perform tokenization along with tokenizing using spaces. Thus, tokenizers of different NLP models can create different tokens for the same text. Space and punctuation, and rule-based tokenization are all examples of word-based tokenization.

Each word is then represented using an ID and each ID contains a lot of information as a word in a sentence usually has a lot of contextual and semantic information. 😲

The technique sounds impressive but this type of tokenization leads to a massive corpus which leads to a big vocabulary. 😏 The state-of-the-art model, Transformer XL, uses space and punctuation tokenization and has a vocabulary size of 267,735. That’s huge! This huge vocabulary size leads to a huge embedding matrix for the input as well as the output layers causing the model to be heavier and requiring more computational resources.

This tokenization also gives different IDs to the words like “boy” and “boys” which are almost similar words in the English language (one is singular and the other is plural). We actually want our model to know that words like these are similar.

To solve this huge vocabulary problem, we can limit the number of words that can be added to the vocabulary. For example, we can save only the most common (based on the frequency of occurrence of words in the corpora) 5,000 words in our vocabulary. The model will then create IDs for those 5,000 common words and mark the rest of the words as OOV (Out Of Vocabulary). But this leads to loss of information as the model will not learn anything about the OOV words. This can be a big compromise for the model as it will learn the same OOV representation for all the unknowns words. 🙄

One more drawback is regarding the misspelled words. If the corpora have “knowledge” misspelled as “knowldge” the model will assign OOV token to the later word.

Thus, to solve all these issues researchers came up with character-based tokenization.

Character-based tokenization

Character-based tokenizers split the raw text into individual characters. The logic behind this tokenization is that a language has many different words but has a fixed number of characters. This results in a very small vocabulary. 😍

For example, in the English language, we use 256 different characters (letters, numbers, special characters) whereas it has close to 170,000 words in its vocabulary. Thus, character-based tokenization will use fewer tokens compared to word-based tokenization.

One of the major advantages of character-based tokenization is that there will be no or very few unknown or OOV words. Thus, it can create a representation of the unknown words (words not seen during training) using the representation for each character. Another advantage is that misspelled words can be spelled correctly rather can marking them as OOV tokens and losing information.

This type of tokenization is quite simple and can greatly reduce memory and time complexity. So, is it the best or perfect algorithm for tokenization? 🤔 The answer is no (at least for the English Language)! A character usually doesn’t carry any meaning or information as a word does. 😕

Note: A few languages carry a lot of information in each character. So, character-based tokenization can be useful there.

Also, reducing the vocabulary size has a trade-off with the sequence length in character-based tokenization. Each word is split into each character and thus, the tokenized sequence is much longer than the initial raw text. For example, the word “knowledge” will have 9 different tokens. 🙄

Note: Researchers Karparthy, Radford, et al., Kalchbrenner et al., and Lee et al. have demonstrated the use of character-based tokenization and came up with some impressive results. Read these papers to know more!

Character-based tokenization despite having some issues have solved a lot of problems faced by word-based tokenization. Let us see if we can solve the issues faced by character-based tokenization too.

Subword-based tokenization

Another popular tokenization is subword-based tokenization which is a solution between word and character-based tokenization. The main idea is to solve the issues faced by word-based tokenization (very large vocabulary size, large number of OOV tokens, and different meaning of very similar words) and character-based tokenization (very long sequences and less meaningful individual tokens).

The subword-based tokenization algorithms uses the following principles.

Do not split the frequently used words into smaller subwords.
Split the rare words into smaller meaningful subwords.

For example, “boy” should not be split but “boys” should be split into “boy” and “s”. This will help the model learn that the word “boys” is formed using the word “boy” with slightly different meanings but the same root word.

At the start of this article, we split the word “tokenization” into “token” and “ization” where “token” is the root word and “ization” is the second subword labeled as additional information for the root word. The subword splitting will help the model learn that the words with the same root word as “token” like “tokens” and “tokenizing” are similar in meaning. It will also help the model learn that “tokenization” and “modernization” are made up of different root words but have the same suffix “ization” and are used in the same syntactic situations. Another example can be the word “surprisingly”. Subword-based tokenization will split it into “surprising” and “ly” as these stand-alone subwords would appear more frequently.

The subword-based tokenization algorithms generally use a special symbol to indicate which word is the start of the token and which word is the completion of the start of the token. For example, “tokenization” can be split into “token” and “##ization” which indicates that “token” is the start of the word and “##ization” is the completion of the word.

Different NLP models use different special symbols to denote the subwords. “##” is used by the BERT model for the second subword. Kindly note that special symbols can be added to the start of the word as well.

Most models which have obtained state-of-the-art results in the English language use some kind of subword-tokenization algorithms. A few common subword-based tokenization algorithms are WordPiece used by BERT and DistilBERT, Unigram by XLNet and ALBERT, and Bye-Pair Encoding by GPT-2 and RoBERTa. 😊

Subword-based tokenization allows the model to have a decent vocabulary size and also be able to learn meaningful context-independent representations. It is even possible for a model to process a word which it has never seen before as the decomposition can lead to known subwords. 😇 🙌🏻

Thus, we saw how tokenization methods evolved from time to time to accommodate the ever-growing needs of the NLP domain and to come up with better solutions to the problems.

References:

https://huggingface.co/docs/tokenizers/python/latest/
The links to the related research papers are provided in the article.

Thank you, everyone, for reading this article. Do share your valuable feedback or suggestion. Happy reading! 📗 🖌