The world’s leading publication for data science, AI, and ML professionals.

Extract keywords from documents, Unsupervised

A solution to extract keywords from documents automatically. Implemented in Python with NLTK and Scikit-learn.

Image by Andrew Zhu, an old Reuters news
Image by Andrew Zhu, an old Reuters news

Imagine you have millions(maybe billions) of text documents in hand. No matter it is customer support tickets, social media data, or community forum posts. There were no tags when the data was generated. You are scratching your head hard to giving tags to those random documents.

Manually tagging is unpractical; Giving an existing tagging list will be outdated soon. Hiring a vendor company to do the tagging work is too much expensive.

You may say, why not using Machine Learning? like, Neral Network deep learning. But, NN needs some training data first. The training data that right fit your dataset.

So, is there a solution that we can give documents tagging meet:

  1. Need no pre-request training data.
  2. Minimum manual interference and can automatically run.
  3. Capture new words and phrases automatically.

This article is logging how I extracted keywords, how it works, walkarounds, in Python.

Note that the code in this article was run and test in Jupyter Notebook. If you run a code block but welcomed with a missing import package error, this package must have been imported already somewhere ahead.

Core Idea

TF-IDF is a widely used algorithm that evaluates how relevant a word is to a document in a collection of documents.

In my previous article: Measure Text Weight using TF-IDF in Python and scikit-learn, I used a simple sample to show how to calculate the TF-IDF value for all words in a document. In both pure Python code and using scikit-learn package.

Based on TF-IDF, those unique and important words should have high TF-IDF values in a certain document. So, in theory, we should be able to leverage the text weight # to extract the most important words of a document.

For example, a document talks about scikit-learn should include much high density of keywords scikit-learn, while another document talks about "pandas" should have a high TF-IDF value for pandas.

Steps to extract keywords from document corpus
Steps to extract keywords from document corpus

Target documents

Since I can’t use my daily work database here and also ensure you can perform the keywords extraction sample code in your local machine with minimum efforts. I found the Reuters document corpus from NLTK is a good target for keyword extraction.

In case you are not familiar with Nltk corpus, this article may be helpful to get NLTK started in less than one hour: Book Writing Pattern Analysis – Get start with NLTK and Python text analysis with a use case.

To download the Reuters corpus. run Python code:

import nltk
nltk.download("reuters")

List all documents ids from the corpus we just downloaded.

from nltk.corpus import reuters
reuters.fileids()

Check out one document’s content, and its category.

fileid = reuters.fileids()[202]
print(fileid,"n"
      ,reuters.raw(fileid),"n"
      ,reuters.categories(fileid),"n")

Reuters corpus is organized by overlapping categories. We can also get documents by category name. For the complete NLTK corpus operations, check out this wonderful article: Accessing Text Corpora and Lexical Resources

Build ignored words list

To save time and computation resources, we’d better exclude stop words like, "am", "I", "should". NLTK provides a good English stop words list.

from nltk.corpus import stopwords
ignored_words = list(stopwords.words('english'))

And you can also extend the list with your own stop words that are not included in NLTK stop words list.

ignored_words.extend(
'''get see seeing seems back join 
excludes has have other that are likely like 
due since next 100 take based high day set ago still 
however long early much help sees would will say says said 
applying apply remark explain explaining
'''.split())

Build keywords vocabulary – single word

Before using TF-IDF to extract the keywords, I will build my own vocabulary list include both single-word(e.g. "Python") and two words (e.g. "white house").

Here I will use CountVectorizer from scikit-learn to perform the single word extraction job.

from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd
count_vec = CountVectorizer(
    ngram_range = (1,1)   #1
    ,stop_words = ignored_words
)
text_set     = [reuters.raw(fileid).lower() for fileid in reuters.fileids()] #2
tf_result    = count_vec.fit_transform(text_set)
tf_result_df = pd.DataFrame(tf_result.toarray()
                               ,columns=count_vec.get_feature_names()) #3
the_sum_s = tf_result_df.sum(axis=0) #4
the_sum_df = pd.DataFrame({ #5
    'keyword':the_sum_s.index
    ,'tf_sum':the_sum_s.values
})
the_sum_df = the_sum_df[
    the_sum_df['tf_sum']>2  #6
].sort_values(by=['tf_sum'],ascending=False)

Code #1, specified that CountVectorizer will only count single word. aka, 1gram word. You may ask, why not use ngram_range = (1,2) then get both single and bigram words at the same time? That is because capture bigram here will get phrases like "they are", "I will", and "will be". Those are conjunction phrases and usually not keywords or key phrases to a document.

Another reason is for saving memory resources, capture bigram phrases in this stage will use a lot of memory due to too many combinations.

Code #2, using Python comprehension to take all Reuters article in one line code.

Code #3, transform the count vector result to a readable Pandas Dataframe.

Code #4, produce a Series list include keywords and its total appearance # in the corpus.

Code #5, turn Series to Dataframe for easier to read and data manipulation.

Code #6, take words that only appear more than 2 times.

If you peek at the top 10 results set by the_sum_df[:10], you will see those most frequently used words:

Top 10 most frequently used words
Top 10 most frequently used words

The most frequent but meaningless, we can easily proportionally exclude those by Python slicing:

start_index     = int(len(the_sum_df)*0.01) # exclude the top 1%
my_word_df      = the_sum_df.iloc[start_index:]
my_word_df      = my_word_df[my_word_df['keyword'].str.len()>2]

And also remove words with less than 2 characters like "vs", "lt" etc.

Note that I am using .iloc instead of .loc. Because the original dataset is reordered by TF(term frequency) value. iloc will slice on the index of the index(or the sequence of index label). but loc will slice on the index label.

Build keywords vocabulary – 2 words phrase(Bigram phrase)

To build a bigram phrase list, we not only need to consider the frequency of appearance together but also its relationship with neighbor words.

For example phrase they are, appears together many times, but they are can only follow with limited words like they are brothers, they are nice people, those words have high internal stickiness but low external connection flexibility.

The external connection flexibility usually can be measured with information entropy. The higher the entropy value indicates the higher possibility to use together with other words.

And phrases that with high internal stickiness(count frequency) and high external entropy to our brain, we call these "Common Phrases", and these are what we want to add into our extraction vocabulary.

NLTK provides a similar solution to solve the bigram phrase extraction problem.

from nltk.collocations import BigramAssocMeasures
from nltk.collocations import BigramCollocationFinder
from nltk.tokenize import word_tokenize
text_set_words  = [word_tokenize(reuters.raw(fileid).lower()) 
                   for fileid in reuters.fileids()] #1
bigram_measures = BigramAssocMeasures()
finder = BigramCollocationFinder.from_documents(text_set_words) #2
finder.apply_freq_filter(3) #3
finder.apply_word_filter(lambda w: 
                         len(w) < 3 
                         or len(w) > 15 
                         or w.lower() in ignored_words) #4
phrase_result = finder.nbest(bigram_measures.pmi, 20000) #5
colloc_strings = [w1+' '+w2 for w1,w2 in phrase_result] #6

Code #1, In this Python comprehension expression, I use the word_tokenize to tokenize a document to a word list. The output will be like this:

[
    ['word1','word2',...,'wordn'], 
    ['word1','word2',...,'wordn'],
    ...
    ['word1','word2',...,'wordn']
]

Code #2, start the bigram finder object from the tokenized document list. there is another function from_words() can process tokenized words list.

Code #3, remove candidates which have a frequency less than 3.

Code #4, remove candidates which word length less than 3 or longer than 15. and those in ignored_words list.

Code #5, use pmi function from BigramAssocMeasures to measure the likelihood of a 2 words phrase. You can find out how it works in section 5.4 of Foundations of Statical Natural Language Processing. And this link list all other measure functions and source.

Code #6, transform the result to a more readable format.

By replacing BigramAssocMeasures, BigramCollocationFinder with TrigramAssocMeasures and TrigramCollocationFinder, you will get the 3-words phrase extractor. In the Reuters keywords extraction sample, I will skip the 3-words phrase. I post the sample code here in case you need it.

from nltk.collocations import TrigramAssocMeasures
from nltk.collocations import TrigramCollocationFinder
from nltk.tokenize import word_tokenize
text_set_words  = [word_tokenize(reuters.raw(fileid).lower()) 
                   for fileid in reuters.fileids()]
trigram_measures = TrigramAssocMeasures()
finder = TrigramCollocationFinder.from_documents(text_set_words)
finder.apply_freq_filter(3)
finder.apply_word_filter(lambda w: 
                         len(w) < 3 
                         or len(w) > 15 
                         or w.lower() in ignored_words)
tri_phrase_result = finder.nbest(bigram_measures.pmi, 1000)
tri_colloc_strings = [w1+' '+w2+' '+w3 for w1,w2,w3 in tri_phrase_result] 
tri_colloc_strings[:10]

The exciting moment, use TF-IDF to measure keyword weight

Now, let’s combine the single words and 2-words phrases together to build the Reuters customized vocabulary list.

my_vocabulary = []
my_vocabulary.extend(my_word_df['keyword'].tolist()) 
my_vocabulary.extend(colloc_strings)

Let’s start the engine. Note that please find a machine with at least 16g RAM to run the code. The TF-IDF calculation will take a while and may consume a large chunk of your memory.

from sklearn.feature_extraction.text import TfidfVectorizer
vec          = TfidfVectorizer(
                    analyzer     ='word'
                    ,ngram_range =(1, 2)
                    ,vocabulary  =my_vocabulary)
text_set     = [reuters.raw(fileid) for fileid in reuters.fileids()]
tf_idf       = vec.fit_transform(text_set)
result_tfidf = pd.DataFrame(tf_idf.toarray()
                            , columns=vec.get_feature_names()) #1

After transforming the result set to Dateframe in code #1, The result_tfidf hold all keywords’ TF-IDF values:

Check out the result

Let’s check out one of the articles and compare with its keywords extracted by the above extractor to verify the effectiveness.

Output one of the original document by specifying the fileid index.

file_index= 202 # change number to check different articles
fileid = reuters.fileids()[file_index]
print(fileid,"n"
        ,reuters.raw(fileid),"n"
        ,reuters.categories(fileid),"n")

Returns fileid, raw content, and its category. (hmm, many years ago, U.S. fought a tariffs war with Japan)

test/15223 
 WHITE HOUSE SAYS JAPANESE TARRIFFS LIKELY
  The White House said high U.S.
  Tariffs on Japanese electronic goods would likely be imposed as
  scheduled on April 17, despite an all-out effort by Japan to
  avoid them.
      Presidential spokesman Marlin Fitzwater made the remark one
  day before U.S. And Japanese officials are to meet under the
  emergency provisions of a July 1986 semiconductor pact to
  discuss trade and the punitive tariffs.
      Fitzwater said: "I would say Japan is applying the
  full-court press...They certainly are putting both feet forward
  in terms of explaining their position." But he added that "all
  indications are they (the tariffs) will take effect."

 ['trade']

Print out the top 10 keywords from our just brewed result_tfidf dataframe object.

test_tfidf_row = result_tfidf.loc[file_index]
keywords_df = pd.DataFrame({
    'keyword':test_tfidf_row.index,
    'tf-idf':test_tfidf_row.values
})
keywords_df = keywords_df[
    keywords_df['tf-idf'] >0
].sort_values(by=['tf-idf'],ascending=False)
keywords_df[:10]

Top 10 keywords:

Looks like white and house here is duplicated with white house. We need to remove those single words that already appeared in a 2-words phrase.

bigram_words = [item.split() 
                    for item in keywords_df['keyword'].tolist() 
                    if len(item.split())==2]
bigram_words_set = set(subitem 
                        for item in bigram_words 
                        for subitem in item) 
keywords_df_new = keywords_df[~keywords_df['keyword'].isin(bigram_words_set)]

The above code first builds a word set that contains words from a 2-words phrase. Then, filtered out single words that already being used in 2-words phrase by ~xxxx.isin(xxxx).

Other considerations

The larger the text corpus you have, the better TF-IDF will perform on extracting the keywords. The Reuters corpus contains 10788 articles, and the result shows it works. I believe this solution will work better for larger text databases.

The above code runs less than 2 minutes in my Macbook Air M1, which means the daily refresh result set is workable.

If you have data in hundreds of GB or even TB size. you may need to consider rewrite the logic in C/C++ or Go, and may also leverage the power of GPU to improve the performance.

The solution described in this article is far from perfect, For example, I didn’t filter out the verb and adjective words. The backbone of the solution can be expanded to other languages.

The extracted keywords

Let print out again the final result.

keywords_df_new[:10]

tariffs get the highest TF-IDF value and the rest keywords look good to represent this Reuters article. Goal reached!


Related Articles