The world’s leading publication for data science, AI, and ML professionals.

A comprehensive guide to subword tokenisers

Unboxing BPE, WordPiece and SentencePiece

Photo by Food Photographer | Jennifer Pallian on Unsplash
Photo by Food Photographer | Jennifer Pallian on Unsplash

Tokenisation is the task of splitting the text into tokens which are then converted to numbers. These numbers are in turn used by the machine learning models for further processing and training.

Splitting text into tokens is not as trivial as it sounds. The simplest way we can tokenize a string is splitting on space.

Eg: The sentence, "Let’s go to the beach today!", when tokenized on space would give:

["Let’s", "go", "to", "the", "beach", "today!"] This simple step breaks a string into individual words, but it is not perfectly tokenized. Notice the words "Let’s" and "today!". If we don’t pay attention to punctuations and simply split on spaces, our vocabulary will explode. In this case, we haven’t converted the text to lowercase too. So every other word will have it’s opposite case variant and words with every possible punctuation will be part of the vocabulary.

A better tokenization of the above example can be

["Let", "’", "s", "go", "to", "the", "beach", "today", "!"]

Both are results of rule based word tokenization. The algorithm works on rule based splitting on space and punctuation. Although this is a decent first step, word tokenization can have an exploding vocabulary problem, since we’re taking every word in the corpus. Many of these words will be different ways of representing the same token, and are therefore unnecessary additions to the vocabulary. For example, Transformer XL uses space and punctuation rule based tokenization, and has a vocabulary size of more than 250,000 tokens.

To deal with this, we can limit the size of the vocabulary to a desired number. However, this would keep the most frequently occurring words in the vocab and everything else would become out-of-vocabulary (OOV). And when a new word is encountered at prediction time, it is either ignored or is assigned the out of vocabulary token. Although this seems like a good workaround, it fails to get any value from the OOV token. If two very different words like "bank" and "bake" are both OOV, they will get the same id, no matter how different they mean.

Large vocabulary is an issue because the number of model parameters are dependent on it. The embedding layer’s parameters will increase with a larger vocabulary. And with language models, the output is also a softmax over the entire vocabulary.

To reduce the vocabulary size, why not tokenize on characters instead of words?

Character tokens solve the OOV problem but representing the input as a sequence of characters increases the sequence length which makes it challenging to learn relationships between characters to form meaningful words.

To get the best of both worlds, we can use subword tokenization!

SubWord Tokenisation

The core concept behind subwords is that frequently occurring words should be in the vocabulary, whereas rare words should be split into frequent sub words. Eg. The word "refactoring" can be split into "re", "factor", and "ing". Subwords "re", "factor" and "ing" occur more frequently than the word refactoring, and its overall meaning is also kept intact.

There are three major subword tokenizers and let us now discuss each in detail.

Byte Pair Encoding (BPE) tokenisation

BPE was introduced by Senrich in the paper Neural Machine translation for rare words with subword units. Later, a modified version was also used in GPT-2. The first step in BPE is to split all the strings into words. We can use any tokenizer for this step. For simplicity, let us use the rule based space and punctuation tokenizer that we discussed in the previous sections.

After word tokenization, let’s assume we have the following words with their frequencies as given below:

[("car", 5), ("cable", 3), ("tablet", 1), ("watch", 2), ("chair", 5), ("mouse", 1)]

The desired vocabulary size is a hyperparameter for BPE. In the example, let’s assume we want a total of 17 tokens in the vocabulary. All the unique characters and symbols in the words are included as base vocabulary. In this the base vocabulary would be

[‘a’, ‘b’, ‘c’, ‘e’, ‘h’, ‘i’, ‘l’, ‘m’, ‘o’, ‘r’, ‘s’ ,’t’, ‘u’, ‘w’] size=14

Next, all the words are split into the base vocabulary characters, which can be represented as follows:

[(‘c’,’a’,’r’ , 5), (‘c’,’a’,’b’,’l’,’e’, 3), (‘t’,’a’,’b’,’l’,’e’,’t’, 1), (‘w’,’a’,’t’,’c’,’h’, 2), (‘c’,’h’,’a’,’i’,’r’, 5), (‘m’,’o’,’u’,’s’,’e’, 1)]

The BPE algorithm then counts the occurrence of every symbol pair and picks the one with the highest frequency. In the above example, the pair "ca" occurs 5 times in car and 3 times in cable, making a total of 8 occurrences, the highest of all pairs. It is followed by 7 occurrences of ch (2 from watch and 5 from chair) and so on.

The most frequent pair "ca" is added to the vocabulary and all occurrences of c and a are merged. The base vocabulary now becomes

[‘a’, ‘b’, ‘c’, ‘e’, ‘h’, ‘i’, ‘l’, ‘m’, ‘o’, ‘r’, ‘s’ ,’t’, ‘u’, ‘w’, ‘ca’] size=15

And the tokenized words become

[(‘ca’,’r’ , 5), (‘ca’,’b’,’l’,’e’, 3), (‘t’,’a’,’b’,’l’,’e’,’t’, 1), (‘w’,’a’,’t’,’c’,’h’, 2), (‘c’,’h’,’a’,’i’,’r’, 5), (‘m’,’o’,’u’,’s’,’e’, 1)]

Next highest occurrence is of "ch", which is added to the vocabulary and all paired occurrences of c and h are merged together.

Vocab: [‘a’, ‘b’, ‘c’, ‘e’, ‘h’, ‘i’, ‘l’, ‘m’, ‘o’, ‘r’, ‘s’ ,’t’, ‘u’, ‘w’, ‘ca’, ‘ch’] size=16

Tokenized input: [(‘ca’,’r’ , 5), (‘ca’,’b’,’l’,’e’, 3), (‘t’,’a’,’b’,’l’,’e’,’t’, 1), (‘w’,’a’,’t’,’ch’, 2), (‘ch’,’a’,’i’,’r’, 5), (‘m’,’o’,’u’,’s’,’e’, 1)]

Since target vocab size = 17, BPE will choose the next most frequent pair ‘ca’ and ‘r’ which occurs 5 times. They will be merged and ‘car’ will be added to the vocabulary

Final vocab: [‘a’, ‘b’, ‘c’, ‘e’, ‘h’, ‘i’, ‘l’, ‘m’, ‘o’, ‘r’, ‘s’ ,’t’, ‘u’, ‘w’, ‘ca’, ‘ch’,’car’]

Final Tokenized Input: [(‘car’ , 5), (‘ca’,’b’,’l’,’e’, 3), (‘t’,’a’,’b’,’l’,’e’,’t’, 1), (‘w’,’a’,’t’,’ch’, 2), (‘ch’,’a’,’i’,’r’, 5), (‘m’,’o’,’u’,’s’,’e’, 1)]

No further merging will be done since the vocab limit is reached.

Now that BPE has been trained, the same tokenization merges will be applied to new words. Say, we get a new word "cab", it will get tokenized into ["ca", "b"]. However, if the new word is "card", it will get split into ["car", "[UNK]"] since the letter d is not in the vocabulary. Practically, this never happens because all characters occur in the corpus at least once. However, UNK (unknown) token may be encountered if a symbol like punctuation or number was not added in the vocabulary but is a part of a new word.

WordPiece tokenization

Wordpiece gained a lot of popularity for being the chosen tokenizer for Bert, followed by Electra. WordPiece is similar to BPE since it includes all the characters and symbols into its base vocabulary first. We define a desired vocab size and keep adding subwords till the limit is reached. The difference between BPE and WordPiece lies in the way the symbol pairs are chosen for adding to the vocabulary.

Instead of relying on the frequency of the pairs, WordPiece chooses the one which maximises the likelihood of the training data. This means that it trains a language model starting on the base vocabulary and picks the pair with the highest likelihood (pair = base vocab character + highest probability generated character). This pair is added to the vocab and the language model is again trained on the new vocab. These steps are repeated until the desired vocabulary is reached.

Example: "I just got a funky phone case!"

Tokenized: ["I", "just", "got", "a", "fun", "##ky", "phone", "case"]

The characters ## suggest that this subword should be attached to the previous token.

Unigram tokenization

Unigram tokenization also starts with setting a desired vocabulary size. However, the main difference between unigram and the previous 2 approaches is that we don’t start with a base vocabulary of characters only. Instead, the base vocabulary has all the words and symbols. And tokens are gradually removed to arrive at the final vocabulary.

The way that tokens are removed is key to the unigram tokenizer. It uses a language model at each step and keeps removing x% of the pair (definition of pair is same as in word piece) which have the highest loss. Loss is generally defined as the log likelihood over the vocabulary at that training step.

The Unigram algorithm always keeps the base characters so that any word can be tokenized.

Unigram is mostly used in conjunction with the SentencePiece.

SentencePiece

All the tokenizers discussed above assume that space separates words. This is true except for a few languages like Chinese, Japanese etc. SentencePiece does not treat space as a separator, instead, it takes the string as input in its original raw format, i.e. along with all spaces. It then uses BPE or unigram as its tokenizers to construct the vocabulary.

Example: "I just got a funky phone case!"

Tokenized: ["_I", "_just", "_got", "_a", "_fun", "ky", "_phone", "_case"]

The tokens can be joined to form a string and "_" can be replaced with space to get the original string back.

Conclusion

Subwords solve the out of vocabulary problem, and help to reduce the number of model parameters to a large extent. Now that the NLP models are getting larger and larger in size, subwords help to keep the vocabulary more balanced.

Infact, they are a great option for dealing with spelling mistakes too! An interesting read on how subwords can help bring misspelled words closer can be seen here.

For using sub words in your projects, look no further than the awesome tokenizers library from huggingface. It provides all the tokenizers discussed above along with some pre-trained ones used in models like BERT, XLNet etc.

Cheers!

Eram


Related Articles