Feature Engineering with NLTK for NLP and Python

Alex Mitrani
Towards Data Science
4 min readOct 18, 2019

--

Herman Melville (1860), ‘Journal Up the Straits,’ Library of Congress

Last week I went over some of the basic functions of the Natural Language Toolkit (NLTK) for Natural Language Processing (NLP). I continued my journey into NLP by applying these basic functions to Herman Melville’s Moby Dick. The text document is provided by Project Gutenberg, several of the books on this site are available through the python NLTK package.

I detailed the cleaning process in the previous blog where I had to clean various transcripts of two television series. This process will change depending on the task at hand. Today I am just exploring the text of the data with NLTK methods. The general process is to:

  1. Load the text of the corpus or document
  2. Tokenize the raw text
  3. Remove stop words and punctuation
  4. Apply Stemming or Lemmatization

The text of the Project Gutenberg interpretation of Moby Dick is already fairly clean, i.e. only punctuation needed to be removed from the text and all words needed to be made lower case. I also removed the prologue and preface from the text because it is not part of Melville’s original story.

The only punctuation mark that I kept in the document was the apostrophe. NLTK allows you to customize a regex pattern to treat characters as one token. For instance any word with an apostrophe was treated as one token, i.e. d’ye is one token and not treated as two separate tokens. The code for applying a regex pattern is: nltk.regexp_tokenize(raw_text, pattern) where raw_text is a string representing a document and pattern is a string representing the regex pattern you wish to apply.

Exploring the Text

A Bag of Words is a count of how many times a token (in this case a word) appears in text. This count can be document-wide, corpus-wide, or corpora-wide.

A visualization of the text data hierarchy

NLTK provides a simple method that creates a bag of words without having to manually write code that iterates through a list of tokens. First you create a FreqDist object and then you apply the token list. Finally you can view the most common tokens with the .most_common() method.

To view the total number of unique words in the text after stopwords are removed you can simply look at the length of the frequency distribution list: len(moby_dick_freqdist). Each token’s count is nominal which is fine if you are only exploring one document. If you had several documents (a corpus) then normalizing the count would be necessary to compare documents of different lengths. This can be done by summing the values of each tuple in a frequency distribution list and then dividing each count by that total.

How to make a normalized frequency distribution object with NLTK

Bigrams, Ngrams, & the PMI Score

Each token (in the above case, each unique word) represents a dimension in the document. There are 16,939 dimensions to Moby Dick after stopwords are removed and before a target variable is added. To reduce the dimensionality of this document we can combine two or more words if they convey a significant amount of information as one token rather than two separate tokens. If we were to take a pair of words this would be called a bigram.

Let’s examine the sentence “I love hot dogs.” There are three pairs of words, (“I”, “love”), (“love”, “hot”), and (“hot”, “dogs”). The first two word pairs are random and do not convey any significant meaning together. However, the words hot and dogs together convey the name of a food item. If we were to treat this final pair as one token then the dimensionality of the sentence is reduced by one. This process is called creating bigrams. An ngram is different than a bigram because an ngram can treat n amount of words or characters as one token.

NLTK provides a bigram method. After you import NLTK you can then store the bigram object nltk.collocations.BigramAssocMeasures() as a variable. You can then utilize NLTK’s collector and scorer methods to view the associated bigrams and their normalized frequency scores.

How to make bigrams with NLTK

The score_ngrams() method returns a list of bigrams and their associated normalized frequency distribution.

The top five bigrams for Moby Dick

Not every pair if words throughout the tokens list will convey large amounts of information. NLTK provides the Pointwise Mutual Information (PMI) scorer object which assigns a statistical metric to compare each bigram. The method also allows you to filter out token pairs that appear less than a minimum amount of times. After you call the bigram method you can apply a frequency filter .apply_freq_filter(5) where five is the minimum amount of times that a bigram must appear. Finally you call the scorer method with the PMI parameter, score_ngrams(bigram_measures.pmi).

The top five bigrams by PMI score for Moby Dick

Conclusion

NLTK has numerous powerful methods that allows us to evaluate text data with a few lines of code. Bigrams, ngrams, and PMI scores allow us to reduce the dimensionality of a corpus which saves us computational energy when we move on to more complex tasks. Once a document is cleaned then NLTK methods can be easily applied.

--

--

Data scientist with a passion for using technology to make informed decisions. Experience in Real Estate and Finance.