
Businesses interact with their customers to better understand them and also to improve their products and services. This interaction can take the form of emails, textual social media posts (e.g. Twitter), customer reviews (e.g. Amazon), etc. It would be inefficient and cost-prohibitive to have human representatives look through all of these forms of textual communications and then route the communications to the relevant teams to review, take action on and/or respond to customers. One inexpensive method to group such interactions and to assign them to relevant teams is using topic modeling.
Topic modeling in the context of Natural Language Processing (NLP) is a type of unsupervised (i.e. data is not labeled) Machine Learning task where an algorithm is tasked with assigning topics to a collection of documents, based on the documents’ contents. A given document typically contains multiple topics in different proportions – for example, if a document is about cars, we will expect name of cars to appear more predominantly than certain other topics (e.g. name of animals), while we expect certain words such as "the" and "are" to appear with almost equal proportions. Topic models implement mathematical approaches to quantify the probability of such topics for a given collection of documents.
In this post, we are going to expand our NLP depth of knowledge, as part of a Data Scientist Role Requirements. We will first build some foundational knowledge around concepts of, Tokenization, Parts of Speech and Named-Entity Recognition. Then we will implement a sentiment analysis exercise and finally use Latent Dirichlet Allocation for topic modeling.
Similar to my other posts, learning will be achieved through practice questions and answers. I will include hints and explanations in the questions as needed to make the journey easier. Lastly, the notebook that I used to create this exercise is linked in the bottom of the post, which you can download, run and follow along.
Let’s get started!
(All images, unless otherwise noted, are by the author.)
Data Set
In order to implement the concepts covered in this post, we will use a data set from UCI Machine Learning Repository, which is based on the paper "From Group to Individual Labels using Deep Features" (Kotzias et. al, 2015) and can be downloaded from this link (CC BY 4.0).
Let’s start with importing some of the libraries we will be using today, then read the data set and look at the top 10 rows of the dataframe. There are comments included before each command to explain these steps further.
# Import libraries
import pandas as pd
import numpy as np
import nltk
# Making the full width of the columns viewable
pd.set_option('display.max_colwidth', None)
# Making all rows viewable
pd.set_option('display.max_rows', None)
# Read the data set and drop the df['label'] column
df = pd.read_csv('imdb_labelled.csv').drop('label', axis = 1)
# Taking out a list of strings to clean up the data
df = df.replace(['n', 't1', 't0'],'', regex=True)
# Return top 10 rows of the dataframe
df.head(10)
Results:

Tutorial + Questions and Answers
Tokenization
Tokenization refers to breaking down textual strings into smaller substrings. These substrings can be at different levels. For example, one tokenization strategy at a sentence level breaks down a given string into sentences, while other tokenizers can break down a sentence into smaller tokens, such as words, bigrams, etc. For this exercise, we only need to break down a string into sentences and words so I will not go deeper into other tokenization strategies but if you are interested in learning more, I have another post linked here, where I cover Tokens, Bigrams, and N-Grams in more detail.
Question 1:
Define a function named "make_sentences" that accepts a Series as its argument, defaults to top 15 rows of the "text" column of our dataframe, breaks down each entry into sentences and returns a list of such sentences. Then apply that function to the top 10 rows of the dataframe.
_Hint: Use nltk.sent_tokenize
, which divides a given string into lists of substrings at a sentence level._
Answer:
# Import packages
from nltk import sent_tokenize
# Define the function
def make_sentences(text = df['text'].head(15)):
# Define a lambda function to apply the sent_tokenize to the df['text']
return text.apply(lambda x: sent_tokenize(x)).tolist()
# Return the results for the top 10 rows of the dataframe
make_sentences(df['text'].head(10))
Results:

Parts of Speech
So far, we can divide given strings into sentences, consisting of a collection of words. Words can be broken down into lexical categories (similar to classes in classification machine learning tasks), including nouns, verbs, adjectives, adverbs, etc. These lexical groups are called Parts of Speech or (POS) in NLP. The process of automatically assigning POS to words is called POS Tagging, which is a common step of NLP pipelines. Tagging can be useful in various NLP tasks – for example, in machine translation the task is to provide the translation (in target language) of an input text (in the original language). If the original textual input includes a person’s name, we would not want the machine translation model to translate the name. One way of ensuring this is to tag that person’s name as an entity and then the model will be bypassed when there is a tagged entity. In other words, everything else in that sentence would be translated except for that one tagged entity. Then at the end and as part of the post-processing steps, the tagged entity will be mapped into its right place in the final translation outcome.
There are various methodologies to create tagging strategies, such as regex-based or even trained machine learning models. In today’s exercise we will be relying on an existing POS tagging provided by NLTK. Let’s look at an example to better understand this concept.
Let’s start with creating a sample string and then run it through NLTK’s POS tagger and review the results.
# Create a sample sentence
sample = 'I am impressed by Amazon delivering so quickly in Japan!'
# Import required libraries
from nltk import word_tokenize, pos_tag
# Break down the sample into word tokens
tokens = word_tokenize(sample)
# Create POS tagging
pos = pos_tag(tokens)
# Return the POS tag results
pos
Results:

Now we see how the tagging results look like. For example "quickly" is tagged as "RB", which means an adverb, or "Amazon" is tagged as "NNP", which means a noun. NLTK provides documentation for tags. For example, if we want to see what "RB" refers to we can run the following command:
nltk.help.upenn_tagset('RB')
Results:

And if you’d like to see all the tags, you can run the same command with no arguments.
Named-Entity Recognition
Now we have POS tagging for each word in the sentence but not all nouns are equal. For example, "Amazon" and "Japan" are both tagged as "NNP" but one is the name of a corporation and the other one is a country. Named-Entity Recognition or NER (aka named-entity chunking) involves extracting information from a given textual input by classifying it into pre-defined classes, such as persons, organizations, locations, etc. Let’s look at an example to see how this works.
Question 2:
First break down the sample sentence into tokens, then apply POS tagging, followed by a named-entity recognition and return the results.
Answer:
# Import required packages
from nltk import word_tokenize, pos_tag, ne_chunk
# Break down the sample into tokens
tokens = word_tokenize(sample)
# Create POS tagging
part_of_speach = pos_tag(tokens)
# Create named-entity chunks
named_entity_chunks = ne_chunk(part_of_speach)
# Return named_entity_chunks
named_entity_chunks
Results:

Let’s look at the results, in particular for "Amazon" and "Japan", because we know these two are entities. Amazon is classified as a "Person", which is an improvement opportunity for our algorithm. I would have preferred a class of "Corporation" or similar. And then "Japan" is classified as a GPE, which stands for a Geo-Political Entity. That sounds right! Therefore, we observed how NER can help us further breakdown nouns into entity classes.
Now that we have learned how to perform POS tagging and NER, let’s create a function that can automatically implement these tasks.
Question 3:
Define a function named "make_chunks" that accepts a list of sentences as the argument, defaulting to the "make_sentences" function defined in Question 1, and returns a dictionary (which will be referred to as the outer dictionary) where the key of the outer dictionary is an integer referring to the row number of the entry. The value of the outer dictionary is a dictionary itself (which will be referred to as the inner dictionary) where the key of the inner dictionary is the sentence number and the value of the inner dictionary is the result of the named-entity recognition (similar to Question 2). As an example, looking back at the results of Question 1, the sixth row of the results was the following list of sentences:
['The rest of the movie lacks art, charm, meaning...',
"If it's about emptiness, it works I guess because it's empty."],
Therefore, running the function defined in Question 3 with default arguments is expected to return the below for the sixth row:
5: {
0: [('The', 'DT'),
('rest', 'NN'),
('of', 'IN'),
('the', 'DT'),
('movie', 'NN'),
('lacks', 'VBZ'),
('art', 'RB'),
(',', ','),
('charm', 'NN'),
(',', ','),
('meaning', 'NN'),
('...', ':')
],
1: [('If', 'IN'),
('it', 'PRP'),
("'s", 'VBZ'),
('about', 'IN'),
('emptiness', 'NN'),
(',', ','),
('it', 'PRP'),
('works', 'VBZ'),
('I', 'PRP'),
('guess', 'NN'),
('because', 'IN'),
('it', 'PRP'),
("'s", 'VBZ'),
('empty', 'JJ'),
('.', '.')
]
},
Answer:
In order to define this function, we will be iterating through two dictionaries, where the inner dictionary will include the tokenization, POS tagging and NER, similar to the example covered before this question.
# Define the function
def make_chunks(make_sentences = make_sentences):
# Create the outer dictionary with row number as key
row_dict = dict()
# Form the first iteration
for i, row in enumerate(make_sentences()):
# Create the inner dictionary with sentence number as key
sent_dict = dict()
# Form the second iteration
for j, sent in enumerate(row):
# Tokenize
w = word_tokenize(sent)
# POS tagging
pos = pos_tag(w)
# Add named-entity chunks as the values to the inner dictionary
sent_dict[j] = list(ne_chunk(pos))
# Add the inner dictionary as values to the outer dictionary
row_dict[i] = sent_dict
# Return the outer dictionary
return row_dict
# Test on the sixth row of the dataframe
make_chunks()[5]
Results:
{0: [('The', 'DT'),
('rest', 'NN'),
('of', 'IN'),
('the', 'DT'),
('movie', 'NN'),
('lacks', 'VBZ'),
('art', 'RB'),
(',', ','),
('charm', 'NN'),
(',', ','),
('meaning', 'NN'),
('...', ':')],
1: [('If', 'IN'),
('it', 'PRP'),
("'s", 'VBZ'),
('about', 'IN'),
('emptiness', 'NN'),
(',', ','),
('it', 'PRP'),
('works', 'VBZ'),
('I', 'PRP'),
('guess', 'NN'),
('because', 'IN'),
('it', 'PRP'),
("'s", 'VBZ'),
('empty', 'JJ'),
('.', '.')]}
As expected, the results match the example provided in the question.
Sentiment Analysis
In the field of NLP, sentiment analysis is a tool to identify, quantify, extract and study subjective information from textual data. What we will be using in this exercise is the polarity score, which is a float within the range of [-1.0, 1.0] that aims at differentiating whether the text has a positive or negative sentiment. This level of familiarity will suffice for the purposes of this post but if you are interested in leanring more, please refer to my post about sentiment analysis linked here. Let’s look at an example together.
Question 4:
Create a function that accepts a list of sentences as the argument, defaulting to "make_sentences" function defined in Question 1 and then returns a dataframe with two column of "sentence" and "sentiment". Please use NLTK’s "SentimentIntensityAnalyzer" for sentiment analysis. Finally, run the function with the default argument and return the results.
Answer:
# Import the package
from nltk.sentiment.vader import SentimentIntensityAnalyzer
# Define the function
def sentiment_analyzer(make_sentences = make_sentences):
# Create an instance of SentimentIntensityAnalyzer
sia = SentimentIntensityAnalyzer()
# Create the dataframe with two columns as described
df = {
'sentence' : []
, 'sentiment' : []
}
# Create two loops to add the column vlaues
for i, row in enumerate(make_sentences()):
for sent in row:
# Add the sentence to the dataframe
df['sentence'].append(sent)
# Add the polarity score of the sentiment analysis to the sentiment column of the dataframe
df['sentiment'].append(sia.polarity_scores(sent)['compound'])
# Return the dataframe
return pd.DataFrame(df)
# Run the function
sentiment_analyzer()
Results:

Topic Modeling – Latent Dirichlet Allocation
Latent Dirichlet Allocation (LDA) is one of the common models used for topic modeling. While exploring the mathematical details of LDA is beyond the scope of this work, we can think about it as a model that connects words to topics and documents. For example, when a collection of documents are provided to the LDA model, it will look at the words and based on the words included in each document, assigns topics with their respective probabilities to each document.
Luckily for us, LDA can be easily implemented in scikit-learn. NLTK’s LDA class accepts a Document Term Matrix (DTM) as an argument, therefore, let’s review what DTMs are and then we will look at an example of topic modeling using scikit-learn’s LDA model.
Document Term Matrix
A DTM is a matrix that represents the frequency of terms that occur in a collection of documents. Let’s look at two sentences to understand what DTM is.
Let’s say that we have the following two sentences (each sentence is considered a "document" in this example):
sentence_1 = 'He is walking down the street.'
sentence_2 = 'She walked up then walked down the street yesterday.'
The DTM of the above two sentences will be:

DTM can be implemented using scikit-learn’s CountVectorizer. This will suffice for the purposes of our current exercise but if you are interested in learning more about DTMs, visit my post on sentiment analysis, linked here.
Let’s implement what we have learned so far. We will implement the following steps:
- Import packages required for DTM and LDA and instantiate them
- Create a DTM of the "text" column of our dataframe
- Use LDA to create topics for the provided DTM
# Step 1 - Import packages
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.feature_extraction.text import CountVectorizer
# Create an instance of the imported packages
lda = LatentDirichletAllocation()
cvect = CountVectorizer(stop_words='english')
# Step 2 - Create a DTM of 50 randomly-selected rows of the dataframe
dtm = cvect.fit_transform(df['text'])
# Step 3 - Create list of topics using LDA
topics = lda.fit_transform(dtm)
Now that we have made the model, let’s look at what words are included in each of topics. Results of the model can be viewed using lda.components_
. Let’s look at an example.
Question 5:
Define a function named "top_n_words" that accepts two arguments:
- "feature_names", which are the feature names resulting from the DTM
- "n", which is the number of rows and words that will be returned.
This function accepts the two arguments above and returns n topic with the top n words within that topic. For example, results for the first topic could be as follows:
Topic 0: film ve really end movies just dialogue choked drama worthless
Finally, run the function and return the top 10 words in each topic.
Answer:
# Define the function
def top_n_words(feature_names, n):
for topic_index, topic in enumerate(lda.components_):
# Create the "Topic {index}:" portion of the output
output = "Topic %d: " % topic_index
# Add the top n words of that topic
output += " ".join([feature_names[i] for i in topic.argsort()[:-n - 1:-1]])
# Print the output
print(output)
# Print the output
print()
# Create function names from the DTM
feature_names = cvect.get_feature_names_out()
# Run the function for top 10 words of each topic
top_n_words(feature_names, n=10)
Results:

Question 6:
Define a function that accepts two arguments, a "search_word" and "n" and returns the top "n" most likely words related to the provided topic in the "search_word". Results should be in the form of a dataframe, containing two columns. First column will be the "probability" of each word and the second column will be the "feature" or the word associated with the provided topic (i.e. "search_word"). Finally, run the function with "action" as the "search_word" and return top 10 words related to that topic.
Answer:
# Define the function
def word_feature(search_word, n=10):
search_word_cvect = cvect.transform(np.array([search_word]))
probabilities = lda.transform(search_word_cvect)
topics = lda.components_[np.argmax(probabilities)]
features = cvect.get_feature_names_out()
return pd.DataFrame({'probability': topics, 'feature': features}).nlargest(n, 'probability')
# Run the function for the given search_word and return top 10 results
word_feature('action', n=10)
Results:

Notebook with Practice Questions
Below is the notebook with both questions and answers that you can download and practice.
Conclusion
In this post, we talked about how machine learning and specifically topic modeling can be used to group a collection of documents together, which can then facilitate downstream tasks for businesses, dealing with large volumes of incoming textual data in the form of emails, social media posts, customer reviews, etc. We started by building the required intuition and foundation and covered tokenization, parts of speech and named-entity recognition, followed by sentiment analysis and implementation of topic modeling, using Latent Dirichlet Allocation.
Thanks for Reading!
If you found this post helpful, please follow me on Medium and subscribe to receive my latest posts!