The world’s leading publication for data science, AI, and ML professionals.

NLP-Topic Modeling to identify Clusters

Derive Topics from Long Text

NLP-Abstract Topic Modeling

This is part 3 of a 4 part post. Until now we have talked about:

  1. Pre-processing and Cleaning
  2. Text Summarization
  3. Topic Modeling using Latent Dirichlet allocation (Lda)- We are here
  4. Clustering

In this article, we will be using a summary version of long text documents to find what topics makeup each document. We summarize the text before topic modeling because there could be extra details in some documents. Whereas, others might have just the gist.

Wait, but why model topics? What does that even mean?

Topic modelling is for discovering the abstract "topics" that occur in a collection of documents. It is a frequently used text-mining tool for discovery of hidden semantic structures in a text body.

Image by Author: Original Text document
Image by Author: Original Text document

We want to keep just crisp and concise information to identify topics for each long document. So, we summarized this text to look like this:

Image bu Author: Summarized text
Image bu Author: Summarized text

Topic Modeling

We will not do any further preprocessing because we have essentially preprocessed when cleaning the text initially and only have phrases in summary.

Like in the previous sections, we first proved if summarization was possible. Let us inspect whether clustering will be viable.

from numpy import dot
from numpy.linalg import norm
exam = NLP.parser.vocab[u"exam"]
# cosine similarity
cosine = lambda v1, v2: dot(v1, v2) / (norm(v1) * norm(v2))
allWords = list({w for w in nlp.parser.vocab if w.has_vector and w.orth_.islower() and w.lower_ != "exam"})
# sort by similarity to Exam
allWords.sort(key=lambda w: cosine(w.vector, exam.vector))
allWords.reverse()
print("Top 5 most similar words to exam:")
for word in allWords[:5]:   
    print(word.orth_)

Wow!!! so does that mean we can even find similarities between documents? YES!! So, we can potentially find clusters of these different documents!!!!

Finding the Optimum Number of Topics

The Latent Dirichlet allocation (LDA) is a Bayesian probabilistic model of text documents. It determines sets of observations from unobserved groups. Hence, explaining the similar parts of data.

Observations are words from documents. Each document is an amalgamation of a small number of topics. Each word’s presence is attributable to one of the document’s topics.

How many of you actually understood what happened in the above section? Let me know if you did not and I will be happy to write another article on the interpretation of this dictionary!

Can we VISUALIZE the topics above though? Yes, we can.

Image by Author: Click on every topic and understand how it is made up of terms
Image by Author: Click on every topic and understand how it is made up of terms

This shows that topics are getting assigned based on the disease/condition being talked about. Great, let us see the topic propensity among the transcripts.

When LDA found topics above, it was essentially trying to find the occurrence of words in groups. This does mean we have a probability score associated to every chart for every topic. But, we are not confident of all of them. So we only extract those that are above 90% propensity.

If notice closely, not all charts have gotten topics. And that is because the algorithm did not meet our cut off criteria for topic propensity.

Image by Autor: chart note with Final Topic and propensity of that topic(in the range of 0–1)
Image by Autor: chart note with Final Topic and propensity of that topic(in the range of 0–1)

Let us visualize the topics/clusters and frequency of charts we have identified

Image by Author: Topic Frequency
Image by Author: Topic Frequency

Conclusion

Here, we can see that _topic21 has the most charts. It is closely followed by _topic15 and _topic14.

Next, we have clustering of these text documents!

If you want to try the entire code yourself or follow along, go to my published jupyter notebook on GitHub: https://github.com/gaurikatyagi/Natural-Language-Processing/blob/master/Introdution%20to%20NLP-Clustering%20Text.ipynb


Related Articles