Introduction
Topic modeling is a type of Natural Language Processing (NLP) task that utilizes unsupervised learning methods to extract out the main topics of some text data we deal with. The word "Unsupervised" here means that there are no training data that have associated topic labels. Instead, the algorithms try to discover the underlying patterns, in this case, the topics, directly from the data itself.
There are various kinds of algorithms that are widely used for topic modelling. In my previous two articles, I introduced to you the two algorithms, LDA and GSDMM, for topic modelling.
Let us Extract some Topics from Text Data – Part I: Latent Dirichlet Allocation (LDA)
Let us Extract some Topics from Text Data – Part II: Gibbs Sampling Dirichlet Multinomial Mixture…
In this article, we will looking at the Non-Negative Matrix Factorization (NMF), an unsupervised algorithm that derived from linear algebra. Code implementations in Python will also be introduced.
What is NMF?
One interesting thing about NMF is that it is used for bunch of other applications including recommendation systems unlike other topic modelling algorithms that are exclusively for topic modeling. NMF was developed in the field of linear algebra to be able to identify the latent or hidden structure within the data. In a nutshell, NMF decomposes (or factorizes, hence the name factorization) high-dimensional vectors into a lower-dimensional representation. All the coefficients of these lower-dimensional vectors are non-negative which can be implied from the algorithm’s name "non-negative" matrix factorization.
Say the origianl matrix for factorization is A and NMF will break A down into two matrices W and H. W is called the features matrix which contains the topics NMF found and H is called the components matrix that contain the coefficients (weights) for those topics. If matrix A has the shape M x N and the number of topics is k, then the shape of the features matrix and the components matrix will be (M x k) and (k x N) respectively.
Some Math
Since NMF is an unsupervised algorithm, it needs some measure to be able to figure out the similarities among different documents to one another. There are several mathematical metrics used to define this sense of similarity or distance but I introduce the two most popular ones in this section.
The first one is the Generalized Kullback–Leibler divergence which is defined as the following.
Kullback–Leibler divergence can be implemented as follows.
from numpy import sum
def kullback_leibler_divergence(p, q):
return sum(p[i] * log2(p[i]/q[i]) for i in range(len(p)))
# There is built-in function in the SciPy package that
# you can directly use as well
from scipy.special import kl_div
You can read more about this divergence here.
Another popular metric is the Frobenius norm which is the square root of sum of absolute squares of its elements. It is also referred to as the Eucledian norm. The mathematical formulation is the following.
You can implement this using the NumPy package.
import numpy as np
# Assuming some array a has been already defined
frobenius_norm = numpy.linalg.norm(a)
For optimization, either one of the two algorithms are often used. They are the Coordinate Descent Solver and Multiplicative update Solver. The question of which optimization algorithm is better depends on the type of data you are dealing with and its characteristics (e.g. sparsity) and so trying both of them to see which returns more coherent and higher quality topics may be the best approach. In terms of evaluating topic quality, more of this would be explained in the later section of this article. Refer to this article to get a deeper understanding of these two optimization algorithms.
Implementation
Now, let us dive into the code implementation of NMF. As always, we start with installing packages we need and importing them to our environment.
!pip install Gensim
import nltk
from nltk.stem import *
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('stopwords')
nltk.download('omw-1.4')
nltk.download('averaged_perceptron_tagger')
# Prepare list of stopwords for removal
stopwords = set(nltk.corpus.stopwords.words('english'))
import pandas as pd
# NMF implementation in sklearn
from sklearn.decomposition import NMF
# TF-IDF Vectorization
from sklearn.feature_extraction.text import TfidfVectorizer
# Dataset we use for this tutorial
from sklearn.datasets import fetch_20newsgroups
from scipy import linalg # For linear algebra operations
import matplotlib.pyplot as plt
%matplotlib inline
np.set_printoptions(suppress=True)
Then, we read in the dataset for this tutorial. We will be using the 20 NewsGroup data that is available to everyone via the sklearn package. It uses the Apache Version 2.0 License. Just a reminder that I have used this dataset for my previous two articles regarding topic modeling for the sake of consistency across different tutorials.
# Categories
cats = ['alt.atheism', 'talk.religion.misc', 'comp.graphics', 'sci.space']
remove = ('headers', 'footers', 'quotes')
# Read in 20 NewsGroup Data
fetch20newsgroups =
fetch_20newsgroups(subset='train', categories=cats, remove=remove)
# Transform fetch20newsgroups object to pandas dataframe
df = pd.DataFrame(fetch20newsgroups.data, columns=['text'])
We go through some traditional text pre-processing steps to makes sure the quality of modeled topics does not get poor.
# Remove Punctuations
import string
PUNCT_TO_REMOVE = string.punctuation
df.text =
df.text.apply(lambda x: x.translate(str.maketrans('', '', PUNCT_TO_REMOVE)))
# Remove URLs
import re
def remove_url(text):
return re.sub(r'https?:S*','',text)
df.text = df.text.apply(remove_url)
# Remove mentions and hashtags
import re
def remove_mentions_and_tags(text):
text = re.sub(r'@S*','',text)
return re.sub(r'#S*','',text)
df.text = df.text.apply(remove_mentions_and_tags)
# Make all the text lower case
df.text = df.text.str.lower()
# Remove words that exist across almost all documents (almost like removing
# stopwords) after tokenization
extraneous_words =
['article', 'organization', 'subject','line','author','from',
'lines', 'people', 're', 'writes', 'distribution',
'x', 'nntppostinghost','think','university']
docs = []
for news in df.text:
words =
[w for w in nltk.tokenize.word_tokenize(news)
if (w not in stopwords) & (w not in extraneous_words) &
(nltk.pos_tag([w])[0][1] in ['NN','NNS'])]
docs.append(words)
We apply TF-IDF vectorization.
# Initialize TFIDF Vectorizer
vect= TfidfVectorizer(
min_df=10,
max_df=0.85,
max_features=5000,
ngram_range=(1, 3), # Include unigrams, bi-grams and tri-grams
stop_words='english' # Remove Stopwords
)
# Apply Transformation
X = vect.fit_transform(df.text)
# Create an NMF instance with 4 components
model = NMF(n_components=4, init='nndsvd', random_state=42)
# Fit the model
model.fit(X)
We store the features and components matrices separately.
# Features Matrix
nmf_features = model.transform(X)
# Components Matrix
components_df = pd.DataFrame(model.components_,
columns=vect.get_feature_names_out())
terms = list(vect.get_feature_names_out())
Get the words of the highest coefficient values for each topic (from the components matrix)
From the components matrix, we can look at the words with the most significance in each of the topics modeled by retrieving the tokens that are associated with the highest coefficient values.
# Top 20 words of importance in each of the topics
for topic in range(components_df.shape[0]):
topic_df = components_df.iloc[topic]
print(f'For topic {topic+1} the words with the highest value are:')
print(topic_df.nlargest(20))
print('n')
Let us look at the output for the first topic as an example.
The important words associated with the first topic are things like god, jesus, bible, and believe which are words that we would think would see in documents related to atheism (which is the actual topic of topic number one).
Topic Descriptors
Based on the information above, we can print out some top n key words for each topic.
def get_descriptor( terms, H, topic_index, top ):
# reverse sort the values to sort the indices
top_indices = np.argsort( H.iloc[topic_index,:] )[::-1]
# get the terms corresponding to the top-ranked indices
top_terms = [ ]
for term_index in top_indices[0:top]:
top_terms.append( terms[term_index] )
return top_terms
k = 4 # four topics
descriptors = [ ]
for topic_index in range(k):
descriptors.append(get_descriptor(terms, components_df, topic_index, 10 ))
str_descriptor = ", ".join( descriptors[topic_index] )
print("Topic %02d: %s" % ( topic_index+1, str_descriptor ) )
Then we get:
Topic 01: god, people, dont, think, just, jesus, say, bible, believe, does
Topic 02: thanks, files, graphics, image, file, know, program, format, help, need
Topic 03: space, nasa, launch, shuttle, lunar, orbit, moon, station, data, earth
Topic 04: sank manhattan sea, stay blew, away sank, away sank manhattan, sank manhattan, said queens stay, said queens, bob beauchaine, bronx away sank, bronx away
Topic of jth document
We can also find out which topic the j-th document falls under from the features matrix.
# Topic of the jth document
j = 100
print(pd.DataFrame(nmf_features).loc[100])
# Add 1 to index to get the topic number since index starts from 0 not 1
print("n {}th Document belongs to Topic {} ".
format(j, np.argmax(pd.DataFrame(nmf_features).loc[100])+1))
>>
0 0.113558
1 0.016734
2 0.015008
3 0.000000
Name: 100, dtype: float64
100th Document belongs to Topic 1
Predict topic of new document
For a new document that has been previously unseen by the algorithm, we can also predict which topic this document will fall under.
# New Document to predict
new_doc = """Jesus is the light of thie world"""
# Transform the TF-IDF
X_new = vect.transform([new_doc])
# Transform the TF-IDF: nmf_features
nmf_features_new = model.transform(X_new)
# The idxmax function returns the index of the maximum element
# in the specified axis
print("This new document belongs to Topic {}".
format(pd.DataFrame(nmf_features_new).idxmax(axis=1).iloc[0] + 1))
>>
This new document belongs to Topic 1
Evaluation of best number of topics
In the above code, we chose four topics to be modeled because we already had prior knowledge of how many topics are in the data we are dealing with. However, in most of the cases, we do not have such knowledge. Random guessing would not be the most ideal approach. Is there a programatic way of doing it? Yes!
For automatically determining what is the best number of topics, we use the same method used for the two previously introduced algorithms, LDA and GSDMM. We use the gensim’s coherence score to do this. Unfortunately, there is no way to use the sklearn’s implementation of NMF jointly with gensim’s coherence score model and therefore, we would need to rely on gensim’s implementation of NMF for this purpose. The later part of this article uses the gensim implementation of NMF to determine the optimal number of coherence scores and then uses the sklearn implementation of NMF to do the actual training and topic extraction.
Reviewing the Topic Quality
Since NMF is a linear algebra algorithm, we have a nice way to evaluate the topic quality of each of the topics by looking at the average residuals of the documents in each topic group.
Assuming we used the Frobenius norm as the distance metric for minimization, we can calculate the residuals for each document as follows.
A = vect.transform(df.text) # Original Matrix (after tf-idf vectorization is applied)
W = model.components_ # Component matrix
H = model.transform(A) # Features matrix
# Get the residuals for each document
r = np.zeros(A.shape[0])
for row in range(A.shape[0]):
# 'fro' here means we are using the Frobenium norm as the distance metric
r[row] = np.linalg.norm(A[row, :] - H[row, :].dot(W), 'fro')
Then, we calculate the mean of these residuals within each of the topic groups and see whichever topic has the lowest average residual would be the topic with the best quality! Take a look at this article to learn more about this process.
Bonus
As I mentioned earlier, gensim has its own NMF implementation as well. While the underlying math and principles are the same as the sklearn implementation, it can be jointly used with the coherence model in gensim for determining the optimal number of topics to be modeled while sklearn does not offer any coherence score models as of now. To learn moer about gensim’s implementation of NMF, refer to this documentation.
MiniBatch NMF is a NMF version that is tailored towards bigger datasets. It has a partial_fit method that is called multiple times consecutively on different chunks of a dataset so as to implement out-of-core or online learning in mini batches. To learn more about this algorithm, refer to its documentation.
Conclusion
In this article, I introduced to you the NMF algorithm, part of the family of linear algebra algorithms that is not only used for topic modeling but also for a broad array of applications from recommendation systems to dimensionality reduction. NMF’s strength lies in its intuitive and straightforward math as long as the user has some understanding of linear algebra. It is hard to tell which topic modeling algorithm works best for the particular project and dataset you are dealing with and thus I recommend using several of the most popular topic modeling algorithms and eyeballing the key words in each of the topics modeled before proceeding to any of the next steps.
If you found this post helpful, consider supporting me by signing up on medium via the following link : )
You will have access to so many useful and interesting articles and posts from not only me but also other authors!
About the Author
Data Scientist. 1st Year PhD student in Informatics at UC Irvine. Main research interest is applying SOTA ML/DL/NLP methods on health and medical related big data to extract interesting insights that will inform patients, doctors and policy makers.
Former research area specialist at the Criminal Justice Administrative Records System (CJARS) economics lab at the University of Michigan, working on statistical report generation, automated data quality review, building data pipelines and data standardization & harmonization. Former Data Science Intern at Spotify. Inc. (NYC).
He loves sports, working-out, cooking good Asian food, watching kdramas and making / performing music and most importantly worshiping Jesus Christ, our Lord. Checkout his website!