Tokenization for Natural Language Processing

Srinivas Chakravarthy
Towards Data Science
8 min readJun 19, 2020

--

Natural language processing is one of the fields in programming where the natural language is processed by the software. This has many applications like sentiment analysis, language translation, fake news detection, grammatical error detection etc.

The input in natural language processing is text. The data collection for this text happens from a lot of sources. This requires a lot of cleaning and processing before the data can be used for analysis.

These are some of the methods of processing the data in NLP:

  • Tokenization
  • Stop words removal
  • Stemming
  • Normalization
  • Lemmatization
  • Parts of speech tagging

Tokenization

Tokenization is breaking the raw text into small chunks. Tokenization breaks the raw text into words, sentences called tokens. These tokens help in understanding the context or developing the model for the NLP. The tokenization helps in interpreting the meaning of the text by analyzing the sequence of the words.

For example, the text “It is raining” can be tokenized into ‘It’, ‘is’, ‘raining’

Photo by SOULSANA on Unsplash

There are different methods and libraries available to perform tokenization. NLTK, Gensim, Keras are some of the libraries that can be used to accomplish the task.

Tokenization can be done to either separate words or sentences. If the text is split into words using some separation technique it is called word tokenization and same separation done for sentences is called sentence tokenization.

Stop words are those words in the text which does not add any meaning to the sentence and their removal will not affect the processing of text for the defined purpose. They are removed from the vocabulary to reduce noise and to reduce the dimension of the feature set.

There are various tokenization techniques available which can be applicable based on the language and purpose of modeling. Below are a few of the tokenization techniques used in NLP.

White Space Tokenization

This is the simplest tokenization technique. Given a sentence or paragraph it tokenizes into words by splitting the input whenever a white space in encountered. This is the fastest tokenization technique but will work for languages in which the white space breaks apart the sentence into meaningful words. Example: English.

Dictionary Based Tokenization

In this method the tokens are found based on the tokens already existing in the dictionary. If the token is not found, then special rules are used to tokenize it. It is an advanced technique compared to whitespace tokenizer.

Rule Based Tokenization

In this technique a set of rules are created for the specific problem. The tokenization is done based on the rules. For example creating rules bases on grammar for particular language.

Regular Expression Tokenizer

This technique uses regular expression to control the tokenization of text into tokens. Regular expression can be simple to complex and sometimes difficult to comprehend. This technique should be preferred when the above methods does not serve the required purpose. It is a rule based tokenizer.

Penn TreeBank Tokenization

Tree bank is a corpus created which gives the semantic and syntactical annotation of language. Penn Treebank is one of the largest treebanks which was published. This technique of tokenization separates the punctuation, clitics (words that occur along with other words like I’m, don’t) and hyphenated words together.

Spacy Tokenizer

This is a modern technique of tokenization which faster and easily customizable. It provides the flexibility to specify special tokens that need not be segmented or need to be segmented using special rules. Suppose you want to keep $ as a separate token, it takes precedence over other tokenization operations.

Moses Tokenizer

This is a tokenizer which is advanced and is available before Spacy was introduced. It is basically a collection of complex normalization and segmentation logic which works very well for structured language like English.

Subword Tokenization

This tokenization is very useful for specific application where sub words make significance. In this technique the most frequently used words are given unique ids and less frequent words are split into sub words and they best represent the meaning independently. For example if the word few is appearing frequently in the text it will be assigned a unique id, where fewer and fewest which are rare words and are less frequent in the text will be split into sub words like few, er, and est. This helps the language model not to learn fewer and fewest as two separate words. This allows to identify the unknown words in the data set during training. There are different types of subword tokenization and they are given below and Byte-Pair Encoding and WordPiece will be discussed briefly.

  • Byte-Pair Encoding (BPE)
  • WordPiece
  • Unigram Language Model
  • SentencePiece

Byte-Pair Encoding (BPE)

This technique is based on the concepts in information theory and compression. BPE uses Huffman encoding for tokenization meaning it uses more embedding or symbols for representing less frequent words and less symbols or embedding for more frequently used words.

The BPE tokenization is bottom up sub word tokenization technique. The steps involved in BPE algorithm is given below.

  1. Starts with splitting the input words into single unicode characters and each of them corresponds to a symbol in the final vocabulary.
  2. Find the most frequent occurring pair of symbols from the current vocabulary.
  3. Add this to the vocabulary and size of vocabulary increases by one.
  4. Repeat steps ii and iii till the defined number of tokens are built or no new combination of symbols exist with required frequency.

WordPiece

WordPiece is similar to BPE techniques expect the way the new token is added to the vocabulary. BPE considers the token with most frequent occurring pair of symbols to merge into the vocabulary. While WordPiece considers the frequency of individual symbols also and based on below count it merges into the vocabulary.

Count (x, y) = frequency of (x, y) / frequency (x) * frequency (y)

The pair of symbols with maximum count will be considered to merge into vocabulary. So it allows rare tokens to be included into vocabulary as compared to BPE.

Tokenization with NLTK

NLTK (natural language toolkit ) is a python library developed by Microsoft to aid in NLP.

Word_tokenize and sent_tokenize are very simple tokenizers available in NLTK

It basically returns the individual works from the string.

Sent_tokenize splits the string into multiple sentences. The sent_tokenizer is derived from PunktSentenceTokenizer class. The sent_tokenize uses the pre trained model from tokenizers/punkt/english.pickle. There are pre trained models for different languages that can be selected. The PunktSentenceTokenizer can be trained on our own data to make a custom sentence tokenizer.

custom_sent_tokenizer = PunktSentenceTokenizer(train_data)

There are some other special tokenizers such as Multi Word Expression tokenizer (MWETokenizer), Tweet Tokenizer.

The MWETokenizer takes a string which is already been divided into tokens and retokenizes it, merging multiword expressions into single token, by using lexicon of MWEs.

Consider the sentence “He completed the task in spite of all the hurdles faced”

This is tokenized as [‘He’, ‘completed’, ‘the’, ‘task’, ‘in’, ‘spite’, ‘of’, ‘all’, ‘the’, ‘hurdles’, ‘faced’]

If we add the ‘in spite of’ in the lexicon of the MWETokenizer, then when the above tokens are passed to MWETokenizer it will be tokenized as [‘He’, ‘completed’, ‘the’, ‘task’, ‘in spite of’, ‘all’, ‘the’, ‘hurdles’, ‘faced’]

The TweetTokenizer addresses the specific things for the tweets like handling emojis.

RegexpTokenizer

This tokenizer splits the sentence into words based on regular expression. For example in the below example, the tokenizer forms the tokens out of money expressions and any other non-whitespace sequences.

Tokenization with Textblob

Textblob is used for processing of text data and is a library in Python. Similar to other packages it provides APIs for sentiment analysis, parts of speech tagging, classification, translation and so on. Below is the code fragment to tokenize into sentence and words and you can notice in the output the emoji’s are removed from the punctuation.

Tokenization with Gensim

Gensim is one of the libraries mainly used for Topic Modelling. Gensim provides utility functions for tokenization.

Gensim also has a sentence tokenizer. Split_sentences from the text cleaner does this sentence tokenization.

Tokenization with Keras

Tokenization can also be done with Keras library. We can use the text_to_word_sequence from Keras. preprocessing.text to tokenize the text. Keras uses fit_on_words to develop a corpora of the words in the text and it uses this corpus to create a sequence of the words with the text_to_word sequence.

Challenges in Tokenization

There are lot of challenges in tokenization, but we discuss few of the difficulties in segmentation of the words.

One of the biggest challenges in the tokenization is the getting the boundary of the words. In English the boundary of the word is usually defined by a space and punctuation marks define the boundary of the sentences, but it is not same in all the languages. In languages such as Chinese, Korean, Japanese symbols represent the words and it is difficult to get the boundary of the words.

Even in English there are lot of symbols used as £, $, € followed by numerical to represent money and there are lot of scientific symbols such as µ, α etc. which create challenges in tokenization.

There are also lot of short forms used in English such as I’m (I am), didn’t (did not) etc. which needs to resolved or else these cause a lot of problems in next steps of NLP.

There is still a lot of research going on in this field of NLP and we need to select the proper corpora for the NLP task at hand.

References:

Gensim documentation: https://pypi.org/project/gensim/

NLTK documentation: https://www.nltk.org/

Keras Documentation: https://keras.io/

Authors

Srinivas Chakravarthy — srinivas.yeeda@gmail.com

Chandrasehkar Nagaraj — chandru4ni@gmail.com

--

--

Technical Product Manager at ABB Innovation Center, Interested in Industrial Automation, Deep Learning , Artificial Intelligence.