A small timing experiment on the new Tokenizers library — a write-up

Spoiler alert: It’s blazingly fast 🔥

Steven van de Graaf
Towards Data Science

--

A little over a week ago, the lovely people at Hugging Face released their new Tokenizers library to the public. Written in the Rust programming language (known for its performance a.o. when compared to Python), Tokenizers provides “an implementation of today’s most used tokenizers, with a focus on performance and versatility”.

As such, this write-up presents the results of some small timing experiments, which contrast and compare the different implementations of the WordPiece tokenizer, as introduced by Wu et al. (2016) and popularized with the release and publication of BERT by Devlin et al. (2018). All of the code relating to these timing experiments can be found in the Jupyter Notebook below:

🤗 Transformers vs 💥 Tokenizers

In this first timing experiment, I compared the performance (in terms of execution time) of the Bert WordPiece tokenizer as implemented in the popular Transformers library (also by Hugging Face) to that of the new Tokenizers library. For both, I tokenized (encoded) 1 million English-language sentences over 5 independent runs, the results of which can be found below:

Transformers vs Tokenizers timing experiment

As you can see, with a mean execution time of just 45.6 seconds, the Tokenizers library implementation presents an almost 9x speed-up as compared directly to the Transformers library implementation (with a mean execution time of 6 minutes and 42 seconds).

🧵 Multithreaded performance

Tokenizers also provides a method to encode multiple sentences at once (in batches), which can significantly improve performance, due to its multithreaded implementation (in Rust). Python also supports multithreading, however, for example using concurrent.futures.

As such, similarly to the first timing experiment, here I compared the performance of the Bert WordPiece tokenizer using concurrent.futures.ThreadPoolExecutor with submit and map, as well as Tokenizers’ native encode_batch, the results of which can be found below:

Multithreaded performance timing experiment

As you can see, surprisingly both submit and map have (equal) worse performance when compared to non-multithreaded tokenization. What is even more interesting (and impressive) however, is that the multithreaded encode_batch that is native to the Tokenizers library takes only 10.6 seconds to tokenize 1 million sentences!

Conclusions

As advertised, the new Tokenizers library by Hugging Face provides a significantly (almost 9x) faster BERT WordPiece tokenizer implementation than that in the Transformers library. When tokenizing sentences in batches, however, the performance is even more impressive, as it takes only 10.6 seconds to tokenize 1 million sentences. As such, I think I can safely conclude that it’s blazingly fast 🔥!

While the new Tokenizers library provides more benefits than just its impressive performance (e.g. the ability to train a tokenizer on a new vocabulary), it should be said that this significant increase in performance does not only allow for ever larger data sets to be tokenized (on the fly), but it also allows for the better democratization of these methods and techniques (e.g. deployment on cheaper hardware, such as mobile phones and SoCs), allowing aspiring NLP enthusiasts from all backgrounds to get started with the latest and greatest in NLP research. 🤗

References

[1] Y. Wu et al., Google’s neural machine translation system: Bridging the gap between human and machine translation (2016), arXiv preprint arXiv:1609.08144.

[2] J. Devlin et al., Bert: Pre-training of deep bidirectional transformers for language understanding (2018), arXiv preprint arXiv:1810.04805.

--

--

Graduate student in Artificial Intelligence @UvA_Amsterdam with multiple years of experience in Python and VBA development.