
What is tokenization?
Tokenization is the process of breaking text into smaller pieces called tokens. These smaller pieces can be sentences, words, or sub-words. For example, the sentence "I won" can be tokenized into two word-tokens "I" and "won".
The formal definition of tokens, according to linguistics, is "an individual occurrence of a linguistic unit in speech or writing, as contrasted with the type or class of linguistic unit of which it is an instance."
Traditional methods of tokenization include whitespace, punctuation, or regex tokenization. New language models like BERT and GPT have promoted the development of advanced methods of tokenization like byte-pair encoding, WordPiece, and SentencePiece.
Why is tokenization useful?
Tokenization allows machines to read texts. Both traditional and deep learning methods in the field of natural language processing rely heavily on tokenization. It is often a pre-processing step in most natural language processing applications. For example, to count the number of words in a text, the text is split up using tokenizers. In deep learning and traditional methods, tokenization is used for feature engineering. For example, the input text is processed using WordPiece subword tokenization before it is fed into BERT’s neural network architecture.
What are word tokenizers?
Word tokenizers are one class of tokenizers that split a text into words. These tokenizers can be used to create a bag of words representation of the text, which can be used for downstream tasks like building word2vec or TF-IDF models.
Word tokenizers in NLTK
(The Jupyter notebook for this exercise is available here)
Nltk is a commonly used package for natural language processing applications. The nltk.tokenize module offers several options for tokenizers. We will look at five options for word tokenization in this article.
Before we proceed, let us import relevant functions from the package

Whitespace tokenization
This is the most simple and commonly used form of Tokenization. It splits the text whenever it finds whitespace characters.

It is advantageous since it is a quick and easily understood method of tokenization. However, due to its simplicity, it does not take special cases into account. In the example, "Jones" would be a more useful token than "Jones!".
Punctuation-based tokenization
Punctuation-based tokenization is slightly more advanced than whitespace-based tokenization since it splits on whitespace and punctuations and also retains the punctuations.

Punctuation-based tokenization overcomes the issue above and provides a meaningful "Jones" token. However, in cases like "Ms.", it might be useful to retain the punctuation to differentiate "Ms." from "Ms"/"MS"/"mS" which might mean different things in different contexts.
Default/TreebankWordTokenizer
The default tokenization method in NLTK involves tokenization using regular expressions as defined in the Penn Treebank (based on English text). It assumes that the text is already split into sentences.

This is a very useful form of tokenization since it incorporates several rules of linguistics to split the sentence into the most optimal tokens.
TweetTokenizer
Special texts, like Twitter tweets, have a characteristic structure and the generic tokenizers mentioned above fail to produce viable tokens when applied to these datasets. NLTK offers a special tokenizer for tweets to help in this case. This is a rule-based tokenizer that can remove HTML code, remove problematic characters, remove Twitter handles, and normalize text length by reducing the occurrence of repeated letters.

MWETokenizer
The multi-word expression tokenizer is a rule-based, "add-on" tokenizer offered by NLTK. Once the text has been tokenized by a tokenizer of choice, some tokens can be re-grouped into multi-word expressions.

For example, the name Martha Jones is combined into a single token instead of being broken into two tokens. This tokenizer is very flexible since it is agnostic of the base tokenizer that was used to generate the tokens.
Conclusion
Tokenization is an integral part of natural language processing. It is the first step in teaching machines how to reach and remains important as the field progresses into deep learning. There are three types of tokenizations – sentence, word, and sub-word.
In this article, we dived deep into five options for word tokenization. The findings are summarized in the table below:
