A word cloud is a data visualization tool for depicting words in an image. The size of each word varies and it represents the weightage of the word. Larger the size, higher is its importance. In most cases, the frequency of occurrence of a word is used as its weightage.
Word cloud has various applications in the data science field. It is normally used during the exploratory data analysis phase of a NLP task. It gives a quick overview on the frequently occurring term in a large text corpus and also gives an indication of a term’s usage relative to other terms in the corpus. It is applied on sentiment analysis tasks where a quick summary of the overall sentiment can be visualized. It is also applied on topic modelling tasks where the topic terms can be quickly spotted in the word cloud.
Here is an example of a word cloud.

Word cloud gives a quick summary of the text corpus from which it is created. Looking at the above word cloud it is easy to identify that the text corpus is about using reinforcement learning, in particular, the deep q-network method on a stock dataset.

In this article we will be learning how to read multiple Pdf documents, split them to keywords and create a word cloud with the most commonly used keywords.
LET’S GET STARTED
There are three simple and straight-forward steps involved in this process.
- Step 1: Read the file
- Step 2: Identify the keywords
- Step 3: Create the word cloud
REQUIRED PYTHON PACKAGES
Before getting started with the implementation, here is the list of packages that will be required.
pip install PyPDF2
pip install textract
pip install wordcloud
pip install nltk
pip install collection
And here is why we need these packages:
- PyPDF2 : to parse the PDF file (method 1)
- textract : to parse the PDF file (method 1)
- wordcloud : to create the word cloud image
- nltk : to get the stopwords and to tokenize the string to keywords
- collection : to get the mapping of keywords to its occurrence count
STEP 1: READ THE FILE
There are multiple ways to read the contents of a pdf file to a string. Here we will be looking at two such methods.
METHOD 1: Using the textract package
textract package supports multiple filetypes apart from the pdf file. Refer to the official documentation to understand the other supported filetypes.
Here we will be using the basic parser to process the contents of the file and return it as a string. By default the return type is a byte string and it needs to be decoded to extract it as a string.
METHOD 2: Using the PyPDF2 package
PyPDF2 is an open-source python package exclusive designed to perform PDF specific functionalities like reading, splitting, merging, cropping and transforming the pages of PDF files. Refer to this link to get more details on the PyPDF2 package.
Below code snippet reads the PDF from the filepath and returns its contents as a string.
Any of the above mentioned packages can be used to extract the contents of the PDF file. Personally, I felt that the results from textract package was better compared to the results from PyPDF2 and hence have set it as the default.
STEP 2: IDENTIFY THE KEYWORDS
Next step after reading the contents of the PDF file is to split it as keywords. To achieve this we use the word_tokenize from the nltk package. It accepts a string as input and divides it into a list of substrings.
Punctuations are removed from the final list of keywords. Common occurring words (stopwords) defined in the nltk package is removed from the keyword list. Also, in case there are some additional words that needs to be ignored in this list, then it can be passed as a list to this function.
Finally a list of keywords is returned from the function. By default, numbers are removed from the list and the words are converted to lower case. This can be overridden through the parameter settings.
STEP 3: CREATE THE WORD CLOUD
The last step after extracting the keywords is to draw a word cloud based on the frequency of occurrence of each word. The word that occurs more frequently will have a larger font size in the word cloud, highlighting its significance.
Here we use the WordCloud provided in the wordcloud package. The number of words to be shown in the word cloud along with its size and color settings are configurable.
PUTTING THEM TOGETHER
Once the 3 functions are available we are ready to generate a word cloud for a PDF file with 3 lines of code.
WORD CLOUD FOR A SINGLE FILE

WORD CLOUD FOR ALL FILES IN A FOLDER
The same code can be extended to create a word cloud for multiple documents within a folder. This will give an overview of the commonly occurring words across documents.
In the below example, the output is extracted from two sample pdf files placed in the sample_files folder (refer the git repo to access the files).
A consolidated list of the keywords from all the pdf files inside the folder is created and this keywords list is used to generate the word cloud.

SOURCE CODE
Download the Jupyter notebook from the following link and use it to generate your own wordcloud: wordcloud_from_docs
CONCLUSION
This article explains one of the many possible approaches to read from multiple pdf files and create a word cloud out of it.
There could be other approaches that can be explored to perform the same task, like using different PDF parsers, using TFIDF instead of counter, using dictionaries/dataframes with count mapping instead of all keywords list, generating fancy word cloud images etc. Happy Exploring!!
References:
PDF documents referred as examples to create the word clouds:
Ynag, Q., Zhao, Y., Huang, H. and Zheng, Z., 2022. Fusing Blockchain and AI with Metaverse: A Survey. arXiv preprint arXiv:2201.03201.
Gadekallu, T.R., Huynh-The, T., Wang, W., Yenduri, G., Ranaweera, P., Pham, Q.V., da Costa, D.B. and Liyanage, M., 2022. Blockchain for the Metaverse: A Review. arXiv preprint arXiv:2203.09738.
Bajpai, S., 2021. Application of deep reinforcement learning for Indian stock trading automation. arXiv preprint arXiv:2106.16088.