Getting Started

SentencePiece Tokenizer Demystified

An in depth dive into the inner workings of the SentencePiece tokenizer, why it’s so powerful, and why it should be your go to tokenizer. You might just start caring about tokenization.

Jonathan Kernes

Published in

Towards Data Science

18 min readFeb 4, 2021

I’ll be the first to admit that learning about tokenization schemes can be boring, if not downright painful. Often, when training a natural language, choosing and implementing a tokenization scheme is just one more added layer of complexity. It can complicate your production pipelines, kill the mood with a poorly timed [OOV] or [UNK], or maybe just send you into the depths of unicode hell, where you wonder why “string” != “string”.

However, like your first truly good glass of wine, or your first experience with high quality sushi, once you’ve had a taste of the good stuff, you’ll see the rest in a totally different light. This post is meant to discuss SentencePiece, and show how painless it makes the tokenization process. Along the way you might just decide that tokenization isn’t so bad after all.

What is SentencePiece?

Surprisingly, it’s not actually a tokenizer, I know, misleading. It’s actually a method for selecting tokens from a precompiled list, optimizing the tokenization process based on a supplied corpus. SentencePiece [1], is the name for a package (available here [2]) which implements the Subword Regularization algorithm [3] (all by the same author, Kudo, Taku). For the duration of the post, I will continue to use SentencePiece to refer to both the algorithm and its package, as that will hopefully be less confusing.

If you want instructions on how to use the actual software, take a look at the colab link at the end of the article or in Reference [18].

Strengths of SentencePiece

It’s implemented in C++ and blazingly fast. You can train a tokenizer on a corpus of 10⁵ characters in seconds.
It’s also blazingly fast to tokenize. This means you can use it directly on raw text data, without the need to store your tokenized data to disk.
Subword regularization is like a text version of data augmentation, and can greatly improve the quality of your model.
It’s whitespace agnostic. You can train non-whitespace delineated languages like Chinese and Japanese with the same ease as you would English or French.
It can work at the byte level, so you **almost** never need to use [UNK] or [OOV] tokens. This is not specific only to SentencePiece.
This paper [17]: Byte Pair Encoding is Suboptimal for Language Model Pretraining — https://arxiv.org/pdf/2004.03720.pdf

SentencePiece is powerful, but what exactly is its purpose? Alluding to point (3), it enables us to train the subword regularization task, which we now explain.

The Subword Regularization objective

Our goal in this subsection is to explicitly develop the learning objective for subword regularization. To begin, let’s consider how we might assign probabilities to sequences. In general, we like to think of sequences as containing a direction, meaning there is a forward and backward direction. You speak words in an order. Stocks rise and fall over time. We can thus specify the distribution of a sequence in terms of its conditional distributions at earlier times. Given a sequence of unigrams X = (x_1, … , x_n), from repeated application of Bayes’ rule we rewrite its probability as

Readers familiar with the Naive Bayes method [5] should recognize this formula. In Naive Bayes, we make a conditional independence assumption (a strong assumption) that if X is conditioned on some variable y, i.e. P(X|y), then we can assume that p(x_2|x_1, y) = p(x_2 | y), p(x_3|x_2, x_1, y) = p(x_3|y) and so on. Meaning if we are given information about this other thing y, we can totally forget about all the variables x_{j<i} that x_i depends on.

To prep ourselves for such a simplification, we consider the problem of Neural Machine Translation (NMT). There, we want to assess the probability P(Y|X) of a target sentence Y given an input sentence X. As both Y and X are sequences, we can write the probability [3]

Here, the lower case variables indicate the actual tokens, and the uppercase the sequence of such tokens. Theta represents our model parameters (the weights of a neural network). As it stands, this formula is not quite correct. In reality, X and Y can be formed by an exponentially large number of possible subword sequences. Think breaking down the word “hello”. We could tokenize in a large number of ways:

Even a simple word like “hello” can exhibit an exponential number of possible tokenizations

So in reality, we should replace X and Y on the left with a specific sequence representation x and y. SentencePiece acknowledges this reality, and uses it to its advantage. We can then write the full segmentation-averaged cost function for our NMT task as

|D| is the number of possible segmentations, and x and y are both drawn from their respective distributions over those segmentations. L is the cost function and P is as before.

This formula is slightly intimidating, but you don’t need to think too hard about it. What we can do in practice, is remove the expectation E[…] and just replace x and y with a single randomized segmentation. That’s it. Everything else we would normally do for training an NMT model is unchanged (this includes a short length penalty and any other model hacks).

The “use one sample” approximation should be familiar to anyone who has worked with Variational Autoencoders [6], where the expectation value over hidden states is similarly approximated by a sample size of one.

Now that we’ve defined the training task, we have to answer the practical question; yeah cool, but how in the world do we build a probability distribution over an exponential number of states?! Like everything else in science, we cheat and approximate like hell until someone tells us we can’t.

Unigram language models

At its heart, SentencePiece is just another plain old a unigram model. What this means is that it’s too hard to consider even a joint distribution P(x_1, x_2) of just two tokens, let alone the n needed to specify P(X). So, we just make the approximation:

subject to the normalization constraint:

where every subword (known as a unigram here) occurs with probability independent of every other subword. Yes, a drastic assumption, but empirically not too harmful.

Unigram language models are quite common, and have been around for a while. A popular method called Byte Pair Encoding (BPE), first introduced in the information literature by Gage [7] and later used in the context of NMT by Sennrich et. al. [8] (a very readable paper btw) is a simple method generating based on encoding the text with the fewest required bits of information. A slight variant of BPE called WordPiece is another popular tokenizer, and we refer the reader to other digestible summary articles like [9] for a better overview.

In principle, SentencePiece can be built on any unigram model. The only things we need to feed it are

The unigram probabilities
The training corpus

We then just train the SentencePiece tokenizer on the corpus, and we are free to perform the subword regularized (or not) NMT training. The beauty is we don’t even need to use subword regularization if we don’t want to. We could also just use SentencePiece as a fast tokenizer that let’s us handle raw text on the fly.

Despite making the drastic unigram assumption, how to train the tokenizer is not at all obvious, and one of the reasons for this post. We explain this in detail after the next section.

In order to actually do something concrete though, eventually we have to pick a unigram model. In this post, we will use BPE, as it’s straightforward and simple enough that we can make a homecooked version on the spot.

Byte Pair Encoding

The general idea is to do the following:

Go through your corpus and find all the “bytes” i.e. the irreducible characters from which all others can be built. This is our base. It ensures we can almost always reconstruct any unseen input.
Run a sliding window over the entire corpus (the actual code is slightly different) and find the most frequent bigram. Bigrams are formed from consecutive subwords in the current list of seen subwords. So, “hello” would have the counts {“he”: 1, “el”:1, “ll”:1, “lo”: 1}.
Choose the most frequent bigram, add it to the list of subwords, then merge all instances of this bigram in the corpus.
Repeat until you reach your desired vocabulary size.

By always picking the most frequent bigram (i.e. byte-pair) we in essence minimizing the coding length needed to encode our corpus, hence the whole “minimizing information” thing.

In practice, it’s not efficient to loop over the entire corpus every time we want to find a new byte-pair. Instead, we loop once at the start to find all words (not subwords, actual words) and create vocabulary, which is a dictionary matching words to their word count. We then only need to loop over words in the dictionary each time, and if a word appears 20 times, then any of its subwords will also appear at least 20 times.

BPE is an incremental and deterministic method. We can always tokenize by following the same order of merges. I want to pause and emphasize the importance of this point. If we don’t follow the same order, all hell breaks loose. In BPE we first preprocess words to be space separated, with a special </w> token to indicate a word break. Consider the word

Boat → B o a t </w>

Suppose our corpus talks alot about snakes (boa constrictors), breakfast and sailing, so we might see the words “boa”, “oat”, and “boat”. If we talk more about snakes than breakfast, we might get the tokenization

(For snake enthusiasts) Boat → Boa t </w>

(For breakfast enthusiasts) Boats -> B oat </w>

If we never talk about boats, then the tokenizers will never line up. One will have a Boa where the other has an Oat and they won’t agree.

With that digression out of the way, let’s dive right into the BPE code, since it’s not our main focus, and it’s actually semi-written out in the original paper [8]

The standard BPE format is how we wrote Boat above. Subwords separated by spaces, with an end of word token </w>. We choose the token “_” instead of </w> to align better with SentencePiece.

First, in the `initialize_vocab` method, we initialize the vocabulary by getting all the words and their counts, then initialize the tokens by finding all irreducible characters.

The get_bigrams method is auxiliary method to determine the most frequent bigram. The merge vocab takes care of updating the vocab to use the new bigram, and returns the bigram → bytepair merge, so that we can store the operation in a list.

Finally, the find_merges function does the work of actually finding bigrams, and the fit function just coordinates all of the helper methods.

How to train SentencePiece

Great! Now that we have our byte-pair encoder ready to manufacture subwords on the spot, we can turn to training SentencePiece. We assume that we have a large collection of bigrams, greater than whatever desired vocab size we ultimately want. To train, we want to maximize the log-probability of obtaining a particular tokenization X=(x_1, …, x_n) of the corpus, given the unigram probabilities p(x_1),…,p(x_n). Since only the full un-tokenized sequence X is observed, the actual tokenization (x_1, …, x_n) is unobserved. This is a classic setup for using the EM algorithm [10]

So it should be easy right? Just turn to any old ML textbook and copy/paste the EM result. Well,… no. The problem is that the x_i are all of different sizes!

It turns out that the code implementation of sentence piece uses a Bayesian method of training, whereas the paper description uses a maximum likelihood EM method. If this is confusing, just know generally Bayesian=harder. To see what I’m talking about you either have to really dig into the C++ code, or just look where I tell you in references [11] and [12].

Since this point will come up, we need to show how to solve the basic Bayesian EM problem for Dirichlet processes. As it is a little off the main topic, please see Appendix I for details. For now, we’ll keep moving along.

The SentencePiece training objective is the following. We wish to maximize the log-likelihood

Where the x are the unigram sequences, and S(x) denotes the set of all possible sequences. Again, these are hidden variables, we only see the un-tokenized corpus! To solve this, we incorporate an EM type algorithm. If you’re familiar with EM, you’ll notice the steps are actually backwards, we do an ME-method. Despite the fancy name, it’s actually quite intuitive and straightforward. The steps are:

Initialize the unigram probabilities. Remember P(x) = P(x_1)…P(x_n) so once we have the unigrams, we have the probability of any sequence. In our code, we just use the BPE frequency counts to get closer to the objective.
M-step: compute the most probable unigram sequence given the current probabilities. This defines a single tokenization. Implementing this requires some thought.
E-step: given the current tokenization, recompute the unigram probabilities by counting the occurrence of all subwords in the tokenization. The frequentist unigram probability is just the frequency with which that unigram occurs. In practice, it is of no more difficulty to Bayesian-ify this (see the Appendix) and instead compute

Here, c_i is the count of subword (unigram) i in the current tokenization. M is the total number of subwords. Psi is the digamma function. The arrow indicates how we Bayesian-ify.

4. Repeat steps 2 and 3 until convergence. The log-likelihood is theoretically guaranteed to monotonically increase, so if it doesn’t you’ve got a bug.

We’re almost there, the final piece we need is how to compute step 2.

Finding the optimal tokenization

If all of the subwords were of the same length, this would be a classic application of the Viterbi algorithm [13]. But alas, life is not so simple. The Viterbi algorithm applies to the following problem

You have some hidden states z_1, …, z_n, and you want to transition from z_1 →z_2 →… →z_n, and you know the transition matrix A_ij, giving the probability to go from z_i^{(1)} → z_j^{(2)}, where i and j are the hidden state dimension, and the superscript is the sequence order. All transitions get the same A matrix. You can use Viterbi to construct an optimal path

The issue is that A is not between adjacent states. To understand how this may be a problem, let us represent a simple tokenization procedure diagrammatically.

Consider tokenizing the word “hello”. Let’s say we have the subwords:

{“he”, “h”, “ll”, “e”, “o”, “hell”}.

Then, we can generate the following “trellis-like” figure (it’s not a trellis since we don’t show transitions to all possible hidden states i.e. characters, and transitions are not restricted to nearest neighbors, we can jump more than one box to the right.):

Source: current author. Dashed lines delineate fundamental character/byte spacing. Arrows indicate which words are reachable via allowed tokens. Each column represents a hidden state z_i^{(j)}. i has dimension given by the number of fundamental characters, the superscript indicates sequence order.

Each arrow represents a transition, and we can think of it as carrying a probability as well, given by the unigram probability of the token created from the arrow’s tail (not inclusive) to its head (inclusive). The goal is to now pick arrows such that we arrive at <eos> — end of sequence — with as high a probability as possible.

This problem has optimal substructure and can be solved with dynamic programming. Let’s say we are sitting at state (4). There are three arrows feeding into it, a red, a blue, and a green. The max probability at (4) is just the best path out of the three possible choices: coming from red, from blue, or from green. In equations:

We’re almost ready to code it up, but there’s one issue. How do we actually find those arrows we drew? And how do we do it efficiently? For this, we need to make use of the Trie data structure [14]. It’s bit a bit difficult to explain in words (pun intended!) so let’s just show what the trie looks like for our current problem:

Source: current author. This is the Trie corresponding to the subword dictionary {‘h’, ’he’, ’hell’, ’hello’}. There are additional nodes <sos>-e-<end> and likewise for ‘o’, and ‘l’ as well that we have omitted for clarity.

The root node is the start-of-sequence token <sos>. Any time we encounter and <end> node, it signifies that everything in the path from <sos> to <end> is a valid subword. The root <sos> will begin with exactly one branch for every unique character in our subword list. As we grow the available subwords, we create more branches in our trie. The Trie is going to be the fundamental data structure that our tokenizer uses to store and retrieve subwords.

Here’s a basic Trie implementation in python that will fit all of our needs:

We’ve now got everything we need. The actual algorithm for computing sequences can vary depending on if we want just the best sequence (Viterbi) or the n_best, so we will keep two separate functions for this. The dynamic programming solution to this type of problem is old and so it has other names; it is referred to as a forwards-backwards algorithm, and is a special subset of the sum-product algorithm for training directed graphical models [13, pg. 613]. More sophisticated algorithms include the Forward-DP Backward-A* algorithm [15] and Forward-Filtering and Backward-Sampling algorithm (FFBS) [16]. Our solution will be closer to the latter.

There’s one final note before showing the code. When performing the max over p_{i<j}, we brute force search p_{T<i<j}, where T is the length of the longest subword. That’s probably going to be a small number and shouldn’t harm our O(N) algorithm.

Ok, below is our full sentencepiece trainer. For the moment, you only need to pay attention to the methods 1) forward_step 2) backward_step

The forward and backward steps implement the algorithm we just talked about. The backward step hasn’t been discussed yet though. While we compute the forward step, we also store the length of the max token ending at any given index. This way, we can retrace all of the arrows that led to our max probability, since the length of the arrow fully specifies the transition! This is because the tokenized text isn’t changing, meaning the hidden states are set. We really are just choosing how many steps to jump at every character.

The full EM step is easy to put together now. We follow the steps outlined earlier, and use the Bayesian-ified EM step, which is why we had to import the digamma function from scipy.

There’s one more piece (pun intended again) to make this complete. In general, we want to be able to fix our vocab size. What sentencepiece does is first aggregate more subword tokens than it really needs. We then perform pruning “rounds” whereby we optimize the EM algorithm, then remove or prune off the least probable 20% tokens (probabilities were computed in the E-step). We repeat this procedure until we reach our desired vocab size.

The fit method now takes care of running the EM steps, pruning after each round, then continuing if further reductions are needed. And that’s the trainer.

Subword sampling

The final part is subword sampling. Disclaimer, we don’t use the same algorithm, but it’s enough to generate randomized tokenizations and provide a proof of concept.

In the forward-backward pass, we only stored the optimal sequence. To find alternative tokenizations, at each index instead of saving only the optimal ending subword, we save the n_best number of ending subwords. Now, we can tokenize by randomly sampling each final subword from the provided list. This gives us subword regularization!

Why the disclaimer? Well, if you think about it, we actually still have the same problem as before, where randomly sampling ending subwords locally does not guarantee that the full tokenization will be 2nd, 3rd, 4th, or whatever best. To properly do that, please see the actual sentencepiece implementation [2] or the FFBS algorithm [16].

The random sampler is provided by the generalized_forward_step and generalized_backward_step methods. Here’s an example output

Sample 1: ['h', 'e', 'l', 'l', 'o', '_', 'w', 'or', 'l', 'd']
Sample 2: ['h', 'e', 'l', 'l', 'o', '_', 'w', 'o', 'r', 'l', 'd'] Sample 3: ['h', 'e', 'l', 'l', 'o', '_', 'w', 'or', 'l', 'd']

Conclusion

This was a long dive into SentencePiece, but I hope it has been worth the effort. Now that you know more about how it works, you can promptly forget everything and just piece together what you need from the bullet points at the top of the article (pun still intended!!):

Speed speed speed. Can directly handle text data during training.
Subword regularization → better models, data augmentation, improved Language Modeling pretraining
Can easily tokenize non-whitespace languages like Japanese and Chinese
No more [UNK] tokens (well…almost no more [UNK] tokens)

If you have any NLP tasks, please strongly considering using SentencePiece as your tokenizer. For using the actual software, I’ve found the following google colab tutorial to be really useful [18]

Google Colaboratory

SentencePieceTutorial

colab.research.google.com

Thanks, and happy tokenizing!

Appendix

Here, we discuss how to Bayesian-ify your EM algorithm. You can also read along with Reference [12].

The discussion is based on the mean field theory. We will simply use its main result, i.e. the coupled equations for determining posterior distributions over hidden parameters/states. See Chapter 10 of [13] for further reading.

The goal of variational inference is to determine the posterior distributions of our model’s unobserved parameters and/or states. We begin by writing the model’s log evidence:

Which is a repeat of our earlier definition; nothing new yet. Recall, X is the un-tokenized corpus, S(x) represents all possible sequences, and the boldface lowercase x denotes a particular tokenization. Due to the summation inside the log, this model is intractable. To make headway, we first introduce hidden variables pi, denoting the unigram probabilities. We further condition these probabilities on a Dirichlet prior, creating a Bayesian model. We write:

The probability p(pi|alpha) is a symmetric Dirichlet distribution:

and the probability of any particular sequence is the product of its unigram probabilities

The x_{nk} binary-valued (can only be 0 or 1) and represent whether the nth position in the sequence is given by the kth subword in the vocabulary. We now make the mean field approximation for the posterior distribution

the subscripts are purely labels; they don’t represent dimensions or anything like that, they’re just names. From here, we can make use of the general formulae for computing posteriors:

and similarly for pi

Inserting our definitions for the log model evidence into the expectation values and pulling out all the terms not averaged over, we find the two equations

Now we make the following observation. The top equation is precisely in the form of a Dirichlet distribution, with a modified prior. Furthermore, since the z_nk are binary variables we can perform their expectation as just the count of various unigrams. We define this with the variable

which again is just the unigram count. Now that we know the distribution over pi, we can compute its expectation value using the known result [13, page 687, Eq. B.21 OR go to wikipedia — way faster]

Putting this into the equation for log q_z(z), we find the distribution

This is exactly our categorical distribution over the weights pi! In other words, we immediately recognize that the thing inside the parenthesis are indeed our weights pi. The only thing we need to mention, is that when applying the Bayesian-ified method, we set alpha=0. This has the effect of enhancing high count unigrams and diminishing low counts.

References

[1] Kudo, Taku, and John Richardson. “Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing.” arXiv preprint arXiv:1808.06226 (2018).

[2] https://github.com/google/sentencepiece/

[3] Kudo, Taku. “Subword regularization: Improving neural network translation models with multiple subword candidates.” arXiv preprint arXiv:1804.10959 (2018).

[4] Steven L Scott. 2002. “Bayesian methods for hidden markov models: Recursive computing in the 21st century.” Journal of the American Statistical Association .

[5] http://cs229.stanford.edu/notes-spring2019/cs229-notes2.pdf

[6] Kingma, Diederik P., and Max Welling. “Auto-encoding variational bayes.” arXiv preprint arXiv:1312.6114 (2013).

[7] Philip Gage. “A new algorithm for data compression.” C Users J. 12(2):23–38 (1994).

[8] Rico Sennrich, Barry Haddow, and Alexandra Birch. “Neural machine translation of rare words with subword units.” In Proc. of ACL (2016).

[9] https://huggingface.co/transformers/tokenizer_summary.html

[10] http://cs229.stanford.edu/notes-spring2019/cs229-notes8.pdf

[11] Specifically, go to Line 271 https://github.com/google/sentencepiece/blob/master/src/unigram_model_trainer.cc

[12] If you looked at [11] you will find the link, page 178 in particular: https://cs.stanford.edu/~pliang/papers/tutorial-acl2007-talk.pdf

[13] Bishop, Christopher M. “Pattern recognition and machine learning.” springer, (2006). Look at page 629

[14] https://en.wikipedia.org/wiki/Trie — Sorry I don’t have a better reference, I’m really not even sure how I know this anymore.

[15] Masaaki Nagata. “A stochastic japanese morphological analyzer using a forward-dp backward-a* nbest search algorithm.” In Proc. of COLING (1994).

[16] Steven L Scott. “Bayesian methods for hidden markov models: Recursive computing in the 21st century.” Journal of the American Statistical Association (2002).

[17] Bostrom, Kaj, and Greg Durrett. “Byte pair encoding is suboptimal for language model pretraining.” arXiv preprint arXiv:2004.03720 (2020).

[18] https://colab.research.google.com/github/google/sentencepiece/blob/master/python/sentencepiece_python_module_example.ipynb

[19] Murphy, Kevin P. Machine learning: a probabilistic perspective. MIT press, (2012).