The world’s leading publication for data science, AI, and ML professionals.

Topic Modeling with Latent Semantic Analysis

Exploring a popular approach towards extracting topics from text

Photo by Brett Jordan on Unsplash
Photo by Brett Jordan on Unsplash

Consider the sheer amount of text that is currently in circulation. News articles, blog posts, online reviews, emails, and resumes are all examples of text data that exist in vast quantities.

Due to the massive influx of unstructured data in the form of these documents, we are in need of an automated way to analyze these large volumes of text.

That is where topic modeling comes into play. Topic modeling is an unsupervised learning approach that allows us to extract topics from documents.

It plays a vital role in many applications such as document clustering and information retrieval.

Here, we provide an overview of one of the most popular methods of topic modeling: Latent Semantic Analysis.

An important note

Before covering Latent Semantic Analysis, it is important to understand what a "topic" even means in NLP.

A topic is defined by a collection of words that are strongly associated. For instance, the words "potato", "soup", and "eat" could represent the topic "food".

Since documents are not restricted by a limited set of words, they usually contain multiple topics. We can assign a document to a topic by finding the topic that the document is most strongly associated with.

Latent Semantic Analysis

Latent Semantic Analysis (LSA) is a method that allows us to extract topics from documents by converting their text into word-topic and document-topic matrices.

The procedure for LSA is relatively straightforward:

  1. Convert the text corpus into a document-term matrix
  2. Implement truncated singular value decomposition
  3. Encode the words/documents with the extracted topics

Simple, right?

Ok, I may have glossed over some details. Let’s go over each step one at a time.

1. Convert raw text into a document-term matrix

Before deriving topics from documents, the text has to be converted into a document-term matrix. This is often done with the bag of words or TF-IDF algorithm.

2. Implement truncated singular value decomposition

Truncated singular value decomposition (SVD) is at the heart of LSA. The operation is key to obtaining topics from the given collection of documents.

Mathematically, it can be explained with the following formula:

The formula looks intimidating at first glance, but it is quite simple.

In layman’s terms, the operation decomposes the high dimensional document-term matrix into 3 smaller matrices (U, S, and V).

The variable A represents the document-term matrix, with a count-based value assigned between each document and word pairing. The matrix has n x m dimensions, with n representing the number of documents and m representing the number of words.

The variable U represents the document-topic matrix. Essentially, its values show the strength of association between each document and its derived topics. The matrix has n x r dimensions, with n representing the number of documents and r representing the number of topics.

The variable S represents a diagonal matrix that evaluates the "strength" of each topic in the collection of documents. The matrix has r x r dimensions, with r representing the number of topics.

The variable V represents the word-topic matrix. Its values show the strength of association between each word and the derived topics. The matrix has m x r dimensions, with m representing the number of words and r representing the number of topics.

Note that while the number of documents and words in a corpus is always constant, the number of topics is not a fixed variable as it is decided by the ones who run the operation. As a result, the output of an SVD depends on the number of topics you wish to extract. For example, an SVD that extracts 3 topics will yield different matrices compared to an SVD that extracts 4 topics.

3. Encode the words/documents with the derived topics

With the SVD operation, we are able to convert the document-term matrix into a document-topic matrix (U) and a word-topic matrix (V). These matrices allow us to find the words with the strongest association with each topic.

We can use this information to decide what each derived topic represents.

We can also determine which documents belong to which topic.

Limitations

LSA enables us to uncover the underlying topics in documents with speed and efficiency. That being said, it does have its own drawbacks.

For starters, some information loss is inevitable when conducting LSA.

When documents are converted into a document-term matrix, word order is completely neglected. Since word order plays a big role in the semantic value of words, omitting it leads to information loss during the topic modeling process.

Furthermore, LSA is unable to account for homonymy or polysemy. Since the technique evaluates words based on the context they are presented in, it is unable to identify words with multiple meanings and distinguish these words by their use in the text.

It is also difficult to determine the optimal number of topics for a given set of documents. While there are several schools of thought with regards to finding the ideal number of topics to represent a collection of documents, there isn’t a sure-fire approach towards achieving this.

Finally, LSA lacks interpretability. Even after successfully extracting topics with sets of words with strong associations, it can be challenging to draw insights from them since it is difficult to determine what topic each set of terms represents.

Case Study

Now that we have given a rundown of what LSA does, let’s see how we can implement it in Python.

This case study will primarily utilize the Gensim library, an open-source library that specializes in topic modeling.

We will use a dataset containing reviews of musical instruments and see how we can unearth the main topics from them. The dataset (copyright-free) can be obtained here.

Here is a preview of the data:

Code Output (Created By Author)
Code Output (Created By Author)

The first step is to convert these reviews into a document-term matrix.

For that, we will have to perform some preprocessing on the text. This entails lower casing all the text, removing punctuation, stop words, short words (i.e. words less than 3 characters), and reducing every word to its base form with stemming.

All of this can be achieved with the preprocess_string function, which turns a given text into a list of processed tokens.

Here’s a quick preview of the text after the preprocessing.

Code Output (Created By Author)
Code Output (Created By Author)

Now, we can convert these processed reviews into a document-term matrix with the bag of words model.

Next, we have to implement the truncated singular value decomposition on this matrix. In the Gensim library, we can use the LSImodel to build a model that performs SVD on the given matrix.

However, before we can create the lower dimensional matrices, we need to determine the number of topics that should be extracted from these reviews.

One approach towards finding the best number of topics is using the coherence score metric. The coherence score essentially shows how similar the words from each topic are in terms of semantic value, with a higher score corresponding to higher similarity.

Again, we can obtain the coherence score with the Gensim module. Let’s see how the coherence score is for the range of 2 to 10 topics.

Code Output (Created By Author)
Code Output (Created By Author)

The coherence score is highest with 2 topics, so that is the number of topics we will extract when performing SVD.

We are able to obtain 2 topics from the document-term matrix. As a result, we can see which words have the strongest association with each topic and infer what these topics represent.

Let’s see the 5 words that each topic has the strongest association to.

Code Output (Created By Author)
Code Output (Created By Author)

Based on the given words, topic 0 may represent reviews that address the sound or noise that is made when using the product, while topic 1 may represent reviews that address the pieces of equipment themselves.

Additionally, we can see what values the model assigns for every document and topic pairing.

As previously mentioned, documents usually have multiple topics. However, some topics have a stronger association with the documents than others. So, we can determine which topic a document belongs to by finding the one that registers the highest value by magnitude.

Let’s look at a sample review as an example.

Code Output (Created By Author)
Code Output (Created By Author)

The sample review registers a score of 0.88 and 0.22 for topics 0 and 1, respectively. Although both topics are present in the review, topic 0 has a higher value than topic 1, so we can assign this review to topic 0.

Let’s see a review belonging to each topic.

Code Output (Created By Author)
Code Output (Created By Author)

The sample text from topic 0 discusses the sound of their instrument after buying the tube screamer, whereas the sample text from topic 1 focuses more on the quality of the purchased pedal itself. This is in line with the interpretation of the two derived topics.

Conclusion

Photo by Prateek Katyal on Unsplash
Photo by Prateek Katyal on Unsplash

You’ve now gained some insight into how one can find the underlying topics in a collection of documents using LSA.

While this technique is relatively quick and efficient, it should be used with caution given its limitations.

I wish you the best of luck in your NLP endeavors!

Reference

  1. McAuley, J. (n.d.). Amazon product data. Amazon review data. Retrieved March 1, 2022, from http://jmcauley.ucsd.edu/data/amazon/

Related Articles