The world’s leading publication for data science, AI, and ML professionals.

PDFs to Word Cloud in 3 steps

A quick guide to generate a word cloud from multiple documents

A word cloud is a data visualization tool for depicting words in an image. The size of each word varies and it represents the weightage of the word. Larger the size, higher is its importance. In most cases, the frequency of occurrence of a word is used as its weightage.

Word cloud has various applications in the data science field. It is normally used during the exploratory data analysis phase of a NLP task. It gives a quick overview on the frequently occurring term in a large text corpus and also gives an indication of a term’s usage relative to other terms in the corpus. It is applied on sentiment analysis tasks where a quick summary of the overall sentiment can be visualized. It is also applied on topic modelling tasks where the topic terms can be quickly spotted in the word cloud.

Here is an example of a word cloud.

Example of a word cloud (Image by Author)
Example of a word cloud (Image by Author)

Word cloud gives a quick summary of the text corpus from which it is created. Looking at the above word cloud it is easy to identify that the text corpus is about using reinforcement learning, in particular, the deep q-network method on a stock dataset.

Stuck with several PDF files?? (Image by Author)
Stuck with several PDF files?? (Image by Author)

In this article we will be learning how to read multiple Pdf documents, split them to keywords and create a word cloud with the most commonly used keywords.

LET’S GET STARTED

There are three simple and straight-forward steps involved in this process.

  • Step 1: Read the file
  • Step 2: Identify the keywords
  • Step 3: Create the word cloud

REQUIRED PYTHON PACKAGES

Before getting started with the implementation, here is the list of packages that will be required.

pip install PyPDF2
pip install textract
pip install wordcloud
pip install nltk
pip install collection

And here is why we need these packages:

  • PyPDF2 : to parse the PDF file (method 1)
  • textract : to parse the PDF file (method 1)
  • wordcloud : to create the word cloud image
  • nltk : to get the stopwords and to tokenize the string to keywords
  • collection : to get the mapping of keywords to its occurrence count

STEP 1: READ THE FILE

There are multiple ways to read the contents of a pdf file to a string. Here we will be looking at two such methods.

METHOD 1: Using the textract package

textract package supports multiple filetypes apart from the pdf file. Refer to the official documentation to understand the other supported filetypes.

Here we will be using the basic parser to process the contents of the file and return it as a string. By default the return type is a byte string and it needs to be decoded to extract it as a string.

METHOD 2: Using the PyPDF2 package

PyPDF2 is an open-source python package exclusive designed to perform PDF specific functionalities like reading, splitting, merging, cropping and transforming the pages of PDF files. Refer to this link to get more details on the PyPDF2 package.

Below code snippet reads the PDF from the filepath and returns its contents as a string.

Any of the above mentioned packages can be used to extract the contents of the PDF file. Personally, I felt that the results from textract package was better compared to the results from PyPDF2 and hence have set it as the default.

STEP 2: IDENTIFY THE KEYWORDS

Next step after reading the contents of the PDF file is to split it as keywords. To achieve this we use the word_tokenize from the nltk package. It accepts a string as input and divides it into a list of substrings.

Punctuations are removed from the final list of keywords. Common occurring words (stopwords) defined in the nltk package is removed from the keyword list. Also, in case there are some additional words that needs to be ignored in this list, then it can be passed as a list to this function.

Finally a list of keywords is returned from the function. By default, numbers are removed from the list and the words are converted to lower case. This can be overridden through the parameter settings.

STEP 3: CREATE THE WORD CLOUD

The last step after extracting the keywords is to draw a word cloud based on the frequency of occurrence of each word. The word that occurs more frequently will have a larger font size in the word cloud, highlighting its significance.

Here we use the WordCloud provided in the wordcloud package. The number of words to be shown in the word cloud along with its size and color settings are configurable.

PUTTING THEM TOGETHER

Once the 3 functions are available we are ready to generate a word cloud for a PDF file with 3 lines of code.

WORD CLOUD FOR A SINGLE FILE

Wordcloud with top 200 words in the pdf file(Image by Author)
Wordcloud with top 200 words in the pdf file(Image by Author)

WORD CLOUD FOR ALL FILES IN A FOLDER

The same code can be extended to create a word cloud for multiple documents within a folder. This will give an overview of the commonly occurring words across documents.

In the below example, the output is extracted from two sample pdf files placed in the sample_files folder (refer the git repo to access the files).

A consolidated list of the keywords from all the pdf files inside the folder is created and this keywords list is used to generate the word cloud.

Wordcloud with top 100 words from all documents in a folder (Image by Author)
Wordcloud with top 100 words from all documents in a folder (Image by Author)

SOURCE CODE

Download the Jupyter notebook from the following link and use it to generate your own wordcloud: wordcloud_from_docs

CONCLUSION

This article explains one of the many possible approaches to read from multiple pdf files and create a word cloud out of it.

There could be other approaches that can be explored to perform the same task, like using different PDF parsers, using TFIDF instead of counter, using dictionaries/dataframes with count mapping instead of all keywords list, generating fancy word cloud images etc. Happy Exploring!!

References:

PDF documents referred as examples to create the word clouds:

Ynag, Q., Zhao, Y., Huang, H. and Zheng, Z., 2022. Fusing Blockchain and AI with Metaverse: A Survey. arXiv preprint arXiv:2201.03201.

Gadekallu, T.R., Huynh-The, T., Wang, W., Yenduri, G., Ranaweera, P., Pham, Q.V., da Costa, D.B. and Liyanage, M., 2022. Blockchain for the Metaverse: A Review. arXiv preprint arXiv:2203.09738.

Bajpai, S., 2021. Application of deep reinforcement learning for Indian stock trading automation. arXiv preprint arXiv:2106.16088.


Related Articles