The world’s leading publication for data science, AI, and ML professionals.

Text Exploration with Python

Explore the interesting insight hidden in the text with n-grams Word Cloud

A Step-by-Step Guide on Text Exploration

Photo by Jennifer Griffin on Unsplash
Photo by Jennifer Griffin on Unsplash

Text exploration has always been my favourite process in Text Analytics. I always thrilled when I found something interesting. I have done two text analytics projects during my studies, which both I used Topic Modelling to study the topics that are discussed in the long text.

I never read the text before I do the analysis, and I should not be doing so as well because the text is extremely long, it is just not rational to read them to understand the topics. So I performed text exploration which gave me a rough idea of the text contents and let me know what should I expect from the Topic Modelling model.

Below is an example of how I use Word Cloud to explore the text in one of my school projects, you may skip to the next part for the script and explanation.

If you are a video person, here’s the video for you.


Text Exploration in My School Project

Exploring the text by using Word Cloud is a perfect and interesting way to know what is being frequently discussed in the text. For example, dating apps datasets from Kaggle contain the users’ answers to the 9 questions below [2]:

  1. About Me / Self-summary
  2. Current Goals / Aspirations
  3. My Golden Rule / My traits
  4. I could probably beat you at / Talent
  5. The last show I binged / Hobbies
  6. A perfect day / Moments
  7. I value / Needs
  8. The most private thing I’m willing to admit / Secrets
  9. What I’m looking for / Dating

The questions are showing in a different sequence for every user, so we cannot extract the answer to certain questions only.

The following are the Word Cloud generated from their answer.

Word Cloud Generated from Text. Image by Author.
Word Cloud Generated from Text. Image by Author.

The Word Cloud above are the Word Cloud generated from the cleaned text (Refer to the steps to process and clean the text in this article, Text Processing in Python). What caught my eyes from the Word Cloud is the "sunshine spotless mind" and "eternal sunshine spotless". These two trigrams seem to be part of a phrase and I never saw these terms before, so I thought to google it and see if they meant something.

Eternal Sunshine of the Spotless Mind.
Eternal Sunshine of the Spotless Mind.

Turn out it is a drama named "Eternal Sunshine of the Spotless Mind". This probably the answer to question 5, "The last show I binged / Hobbies". Another drama that was frequently mentioned is probably "The Big Bang Theory" which is shown as "big bang theory" in the Word Cloud.

What can we see from here?

"Eternal Sunshine of the Spotless Mind" and "The Big Bang Theory" are the two most popular drama series at the time.

Is this insight significant? We cannot tell at this moment before further analysis is performed but we can make a hypothesis from this statement. For instance, the users tend to accept the match that watches the same series with them as they share this info.

I wish to see more expectations of the users to their match. Hence, I extracted every line of sentences that contains the word "You" (You may see how I extract the sentences in this article, Text Extraction using Regular Expression (Python)).

Word Cloud Generated from Extracted Text. Image Created by Author.
Word Cloud Generated from Extracted Text. Image Created by Author.

Now we can see what is related to "You". Perhaps we can ignore YouTube.com? Anyway, from the Word Cloud above, we can make a hypothesis, that the users are looking forward to meeting new people that have a great sense of humour and want a long-term relationship. Again, the information from Word Cloud may not be complete as it only shows the frequent phrases, we can further prove it with Topic Modelling or other Natural Language Techniques.

Interesting?


Well, that’s is all from my past project. Now let us get started with the process of text exploration with Python.


Introduction to the Datasets

I found an interesting dataset on Kaggle recently and thought it will be interesting to explore.

The text data I found is Medium Articles dataset from Kaggle which contains the author, claps, reading time, link, title and text for 337 articles that are related to Machine Learning, AI and Data Science.

In the following text exploration, I will use only the title of the articles, to study what are the popular topics among the authors.

#

1. Import Libraries

import re
import pandas as pd
# text processing
import nltk
from nltk.tokenize import WordPunctTokenizer
nltk.download('stopwords')
from nltk.corpus import stopwords
## needed for nltk.pos_tag function 
# nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer
# visualization
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
from wordcloud import WordCloud

2. Import Data

df = pd.read_csv("articles.csv")
print(df.shape)
print(df.columns)
df.head()
DataFrame. Image by Author.
DataFrame. Image by Author.

3. Text Processing

The process here is similar to the process in my previous article, Text Processing in Python. Hence, I will paste the script here but skip the explanation on the repeated part to avoid redundancy.

a. Tokenization

Split the titles into a list of tokens.

# change DataFrame columns into a list
title = df['title'].values
# tokenize
title_text = ""
title_all = []
for _ in title:
    title_text += (_ + " ")
    title_all.append(_)

word_punct_token = WordPunctTokenizer().tokenize(title_text)

There are 4099 tokens from the titles.

b. Normalization

Remove unwanted tokens.

clean_token=[]
for token in word_punct_token:
    new_token = re.sub(r'[^a-zA-Z]+', '', token) # remove any value that are not alphabetical
    if new_token != "" and len(new_token) >= 2: # remove empty value and single character value
        vowels=len([v for v in new_token if v in "aeiou"])
        if vowels != 0: # remove line that only contains consonants
            new_token = new_token.lower() # change to lower case
            clean_token.append(new_token)
# Get the list of stop words
stop_words = stopwords.words('english')
stop_words.extend(["could","though","would","also","us"])
# Remove the stopwords from the list of tokens
tokens = [x for x in clean_token if x not in stop_words]

There are 2214 tokens left after we removed the non-alphabetical value, single-character tokens, tokens that only contain consonants and stopwords that do not carry much insight. We removed almost half of the tokens.

The dataset used in this example is small, hence removing these tokens does not improve the speed of the model significantly, but it will be crucial when we are analyzing a gigantic dataset.

c. POS Tag and Lemmatization

Label the Part-of-Speech of the word and return the word to it’s base form accordingly.

# POS Tag every token and save into dataframe
data_tagset = nltk.pos_tag(tokens)
df_tagset = pd.DataFrame(data_tagset, columns=['Word', 'Tag'])
# to focus on nouns, adjective and verb
tagset_allowed = ['NN','NNS','NNP','NNPS','JJ','JJR','JJS','VB','VBD','VBG','VBN','VBP','VBZ']
new_tagset = df_tagset.loc[df_tagset['Tag'].isin(tagset_allowed)]
text = [str(x) for x in new_tagset['Word']]
tag =[x for x in new_tagset['Tag'] if x != '']

There are more than 30 types of POS Tag, but the Tags that with meaningful insight are mostly in the category of Nouns, Adjectives and Verbs. So, we can filter out other Tag from our model.

# Create lemmatizer object 
lemmatizer = WordNetLemmatizer()# Lemmatize each word and display the output
lemmatize_text = []
for word in text:
    output = [word, lemmatizer.lemmatize(word, pos='n'),lemmatizer.lemmatize(word, pos='a'),lemmatizer.lemmatize(word, pos='v')]
    lemmatize_text.append(output)# create DataFrame using original words and their lemma words
df = pd.DataFrame(lemmatize_text, columns =['Word', 'Lemmatized Noun', 'Lemmatized Adjective', 'Lemmatized Verb'])
df['Tag'] = tag
DataFrame after Lemmatization. Image by Author.
DataFrame after Lemmatization. Image by Author.

The script above creates three columns, which stored the lemmatized nouns, lemmatized adjectives and lemmatized verbs. When the Tag of the word is Noun, the base form of the word will be reflected in Lemmatized Noun column, and adjective base form in Lemmatized Adjective column, verb base form in Lemmatized Verb column.

At this stage, every category of Part-of-Speech is further divided into subcategories. According to [1], Nouns are further divided into

  1. Singular or mass noun (NN),
  2. Singular proper noun (NNP),
  3. Plural proper noun (NNPS), and
  4. Plural noun (NNS).

Adjectives and Verbs are also further divided into subcategories. This may create a little more work when we want to select the tokens according to groups later. Hence, the subcategory will be replaced with its main category.

# replace with single character for simplifying
df = df.replace(['NN','NNS','NNP','NNPS'],'n')
df = df.replace(['JJ','JJR','JJS'],'a')
df = df.replace(['VBG','VBP','VB','VBD','VBN','VBZ'],'v')

Then, a new column "Lemmatized Word" with the base form of the word will be created with the following script.

'''
define a function where take the lemmatized word when tagset is a noun, and take lemmatized adjectives when tagset is adjective
'''
df_lemmatized = df.copy()
df_lemmatized['Tempt Lemmatized Word']=df_lemmatized['Lemmatized Noun'] + ' | ' + df_lemmatized['Lemmatized Adjective']+ ' | ' + df_lemmatized['Lemmatized Verb']
lemma_word = df_lemmatized['Tempt Lemmatized Word']
tag = df_lemmatized['Tag']
i = 0
new_word = []
while i<len(tag):
    words = lemma_word[i].split('|')
    if tag[i] == 'n':        
        word = words[0]
    elif tag[i] == 'a':
        word = words[1]
    elif tag[i] == 'v':
        word = words[2]
    new_word.append(word)
    i += 1

df_lemmatized['Lemmatized Word']=new_word
df_lemmatized.head()
Lemmatized Word Created in the DataFrame. Image by Author.
Lemmatized Word Created in the DataFrame. Image by Author.

The last step in text processing is to convert the Lemmatized Word column into a list for the next process.

lemma_word = [str(x) for x in df_lemmatized['Lemmatized Word']]

Now, we are ready to create Word Cloud to explore the text!


4. Text Exploration

Normally I will create Word Cloud with n-grams in increasing order. So, we will start with Unigram, follow by Bigram and Trigrams.

Unigram

Nouns, Adjectives and Verbs all are meaningful, so a Word Cloud will be created for each Tag category.

a. Noun

# select only noun for word cloud
tagset = df_lemmatized
tagset_allowed = ['n']
new_tagset = tagset.loc[tagset['Tag'].isin(tagset_allowed)]
text = ' '.join(str(x) for x in new_tagset['Lemmatized Noun'])
wordcloud = WordCloud(width = 1600, height = 800, max_words = 200, background_color = 'white').generate(text)
plt.imshow(wordcloud, interpolation = 'bilinear')
plt.axis("off")
#plt.savefig('Vis/Noun_WordCloud.png') # if you want to save the WordCloud
plt.show()
Word Cloud of Nouns. Image by Author.
Word Cloud of Nouns. Image by Author.

Based on the Word Cloud of Nouns, the nouns that are frequently used in the title are medium, machine, network, learning, intelligence, and data science.

b. Adjective

# select only adjectives for word cloud
tagset = df_lemmatized
tagset_allowed = ['a']
new_tagset = tagset.loc[tagset['Tag'].isin(tagset_allowed)]
text = ' '.join(str(x) for x in new_tagset['Lemmatized Adjective'])
wordcloud = WordCloud(width = 1600, height = 800, max_words = 200, background_color = 'white').generate(text)
plt.imshow(wordcloud, interpolation = 'bilinear')
plt.axis("off")
#plt.savefig('Vis/Adjectives.png')
plt.show()
Word Cloud of Adjectives. Image by Author.
Word Cloud of Adjectives. Image by Author.

The frequently used adjectives in the titles are neural, deep, artificial, big, new and fun.

c. Verb

# select only verbs for word cloud
tagset = df_lemmatized
tagset_allowed = ['v']
new_tagset = tagset.loc[tagset['Tag'].isin(tagset_allowed)]
text = ' '.join(str(x) for x in new_tagset['Lemmatized Verb'])
wordcloud = WordCloud(width = 1600, height = 800, max_words = 200, background_color = 'white').generate(text)
plt.imshow(wordcloud, interpolation = 'bilinear')
plt.axis("off")
#plt.savefig('Vis/Adjectives_BeforeStemming.png')
plt.show()
Word Cloud of Verbs. Image by Author.
Word Cloud of Verbs. Image by Author.

The frequently shown verbs in the titles are learning (learn), use, deep, understand, happen and interview. At this stage, we can roughly see the topics which are frequently discussed by combining the Word Cloud of Nouns, Adjectives and Verbs. The topics are the neural network, data science, machine learning, deep learning and artificial intelligence.

Then, based on the Word Cloud of Verbs, we can identify the objective of the article. For example,

  1. Use, learn, explain, build – The articles with these keywords are the tutorial of using a package, tool or an algorithm
  2. Interview – The articles with this keyword are the advice or guide on the interview

Bigram

For bigram and trigram Word Cloud, we will need to use Count Vectorizer to calculate the frequency.

#Using count vectoriser to view the frequency of bigrams
tagset_allowed = ['a','n','v']
new_tagset = df_lemmatized.loc[df_lemmatized['Tag'].isin(tagset_allowed)]
text = [' '.join(str(x) for x in new_tagset['Lemmatized Word'])]
vectorizer = CountVectorizer(ngram_range=(2, 2))
bag_of_words = vectorizer.fit_transform(text)
vectorizer.vocabulary_
sum_words = bag_of_words.sum(axis=0) 
words_freq = [(word, sum_words[0, idx]) for word, idx in vectorizer.vocabulary_.items()]
words_freq =sorted(words_freq, key = lambda x: x[1], reverse=True)
print (words_freq[:100])
#Generating wordcloud and saving as jpg image
words_dict = dict(words_freq)
WC_height = 800
WC_width = 1600
WC_max_words = 200
wordCloud = WordCloud(max_words=WC_max_words, height=WC_height, width=WC_width,background_color = 'white')
wordCloud.generate_from_frequencies(words_dict)
plt.title('Most frequently occurring bigrams')
plt.imshow(wordCloud, interpolation='bilinear')
plt.axis("off")
plt.show()
wordCloud.to_file('wordcloud_bigram_title.jpg')
Bigram Word Cloud. Image by Author.
Bigram Word Cloud. Image by Author.

From Bigram Word Cloud, we can see more meaningful phrases than unigram. For example, reinforcement learning, learn TensorFlow, raspberry pi, beginner guide, image segmentation and silicon valley.

Let see what we can find with the Trigram Word Cloud.

Trigram

Same with bigrams, we will have to calculate the frequency with Count Vectorizer first.

#Using count vectoriser to view the frequency of trigrams
vectorizer = CountVectorizer(ngram_range=(3, 3))
bag_of_words = vectorizer.fit_transform(text)
vectorizer.vocabulary_
sum_words = bag_of_words.sum(axis=0) 
words_freq = [(word, sum_words[0, idx]) for word, idx in vectorizer.vocabulary_.items()]
words_freq =sorted(words_freq, key = lambda x: x[1], reverse=True)
print (words_freq[:100])
#Generating wordcloud and saving as jpg image
words_dict = dict(words_freq)
WC_height = 800
WC_width = 1600
WC_max_words = 200
wordCloud = WordCloud(max_words=WC_max_words, height=WC_height, width=WC_width,background_color = 'white')
wordCloud.generate_from_frequencies(words_dict)
plt.title('Most frequently occurring trigrams')
plt.imshow(wordCloud, interpolation='bilinear')
plt.axis("off")
plt.show()
wordCloud.to_file('wordcloud_trigram_title.jpg')
Trigram Word Cloud. Image by Author.
Trigram Word Cloud. Image by Author.

The frequent phases are almost similar to bigrams, except the ‘detect object deep’, ‘recognition deep learning’ and ‘deep learning raspberry’. The first two trigrams shall be meant for object detection with deep learning and image recognition with deep learning respectively, while for the third trigram, I am not sure about it.

Let’s find it out.

# using the pre-process data
title = df['title'].values
keyline = []
for line in title:
    line = line.lower()
    result = re.search(r"(^|[^a-z])" + "raspberry pi" + r"([^a-z]|$)", line)
    if result != None:
        keyline.append(line)
The Titles Contains "Raspberry Pi". Image by Author
The Titles Contains "Raspberry Pi". Image by Author

The Raspberry Pi is used to detect object with deep learning.

The duplicated articles are not removed as they are having a different link.


Insight from Word Cloud of Medium Articles’ Title

From the Word Cloud created for Medium articles’ title, we can see that most of the title is formed with a more general or popular term, like Machine Learning, Artificial Intelligence, Deep Learning and Neural Network. These are the popular topic at the time. However, the dataset does not contain the time the article was published. So, we are not able to identify how the selection of words affects the articles’ view rate.


Conclusion

Generally, the average frequency of occurrence of unigrams will be higher if compared to bigrams and trigrams. Hence, to avoid the meaningful bigram and trigram buried, we shall create Word Cloud for unigram, bigram, and trigram separately.

Moreover, bigrams and Trigrams also help us to understand the text better as unigram may be confusing and the meaning might change when the word attached to a unigram change. For example, low quality and good quality.

This also brings to another point why text exploration is essential in certain cases. Imagine you are exploring a review of a furniture store, if you are doing topic modelling, you may get the general topic people are discussing, like the chair, table, cupboard and so on. With text exploration via the Word Cloud, you may know the phrases people mentioned frequently in the review. For example, table low quality, slow delivery, etc.

Lastly, although this is not shown in my example, we may face the situation that we need to go back to text processing when we explored some frequently shown but meaningless tokens during text exploration. Removing them may reveal the insightful tokens hiding under them and further reduce the time to train the model.

Word Cloud is just one of the methods for text exploration, it may not be the best, but it is the most fun method!


Some Side Notes

If you are interested in the difference of text processing with NLTK and SpaCy, Text Processing in Python.

If you are interested in extracting the sentences which contain the keywords from a list, Text Extraction using Regular Expression (Python).

Stay Connected

Subscribe on YouTube

Reference

[1] D. Juraksky and J. H. Martin, "Sequence Labeling for Parts of Speech and Named Entities," in Speech and Language Processing, 2020, p. 4.

[2] A. Kim and A. Escobedo-Land, "OkCupid Data for Introductory Statistics and Data Science Courses," Journal of Statistics Education, vol. 23, 07 2015.


Congrats and thanks for reading to the end. Hope you enjoy this article. ☺️

Photo by Ian Schneider on Unsplash
Photo by Ian Schneider on Unsplash

Related Articles