The world’s leading publication for data science, AI, and ML professionals.

Supervised & Unsupervised Approach to Topic Modelling in Python

Build a Topic Modelling Pipeline from Scratch in Python

Image taken from Unsplash by v2osk
Image taken from Unsplash by v2osk

This article will provide a high level intuition behind topic modelling and its associated applications. It will do a deep dive into various ways one can approach solving a problem which requires topic modelling and how you can solve those problems in both a supervised and unsupervised manner. I placed an emphasis on restructuring the Data and initial problem such that the solution can be executed in a variety of methods. The following table breaks down the contents in the article.

Table of Contents

  • What is Topic Modelling?
  • Applications of Topic Modelling
  • Supervised vs Unsupervised Learning
  • Problem Breakdown
  • Requirements
  • Data
  • Load Data
  • Cleaning & Preprocessing
  • Data Statistics
  • Unsupervised Learning
  • Train Model
  • Visualization
  • Topic Analysis
  • Supervised Learning
  • Keyword Statistics
  • Generate Labels
  • Train Model
  • Evaluation
  • Concluding Remarks
  • Resources

What is Topic Modelling?

Topic modelling is a subsection of natural language processing (NLP) or text mining which aims to build models in order to parse various bodies of text with the goal of identifying topics mapped to the text. These models assist in identifying big picture topics associated with documents at scale. It is a useful tool for understanding and organizing large collections of text data, and can help organizations make sense of large amounts of unstructured data.

Applications of Topic Modelling

  • Document classification – Categorize documents into various topics
  • Social media analysis – Identify the major topics associated to posts by users on social media
  • Recommender systems – Recommend products to users based on the topics that they’re interested in. A common application is customized advertisement recommendation based on the topics the user is interested in. For example, if a user was into cars, they might like advertisements from promising car brands like Honda / Toyota.

Supervised vs Unsupervised Learning

There’s a clear distinction between supervised and unsupervised learning. Supervised learning is associated with training models given some label to map to the initial dataset. On the contrary, unsupervised learning is associated with training models given no labelled information present. Topic modelling is generally an unsupervised learning approach but this article will cover both a supervised and unsupervised learning approach to topic modelling.

The supervised learning approach will consist of binary classification. Binary classification is mapping the input data to exactly 2 targets, whereas multi-class classification is mapping the input data to more than 2 targets. The binary classification topic models will indicate whether the input article is or is not mapped to a topic we’ve labelled or not. The multi-class classification topic models will identify the topic this article is most likely to fall under given a set of topics. This article will showcase the implementation of the binary classification approach.

Problem Breakdown

The problem this article is aiming to solve is to identify the main topics associated with research papers given the summary of the paper. Based on the topics identified, the user can then infer whether this paper is of interest to them or not. We will be using the arXiv database to query and fetch several research papers across a variety of domains.

Requirements

The following are the required modules and versions necessary to follow along with this tutorial. The version of Python in my environment is 3.10.0. If an error does occur during the execution process, be mindful of the versions associated with the modules you’re referencing as this is commonly a problem in collaboration across platforms.

pandas>=1.3.5
numpy>=1.22.4
arxiv>=1.4.2
Unidecode>=1.3.6
nltk>=3.7
gensim>=4.2.0
wordcloud>=1.8.2.2
pyLDAvis==2.1.2

If you don’t have the gensim package installed, [here](https://pypi.org/project/arxiv/) is the library documentation to install it through the command line. Similarly, you can install the arXiv package in Python with the following instructions here.

Data

Based on the arXiv terms of use for their API it is completely free and encouraged to use. For more information regarding their terms of use, please reference their documentation which you can find here.

In this article, I will show you how to hit the API through Python to collect the following information necessary for the models we’re building today. If you want to hit this API through other programming languages, or just want more information on how to use the API, I highly encourage you to reference their documentation which you can find here.

Load Data

The code provided below will consist of importing the required modules, setting up constants to be used throughout the project and defining a function to query and load data from arXiv given a set of prompts.

That script should yield a resulting pandas DataFrame looking similar to the screenshot below:

Data queried from arXiv. Image provided by the author.
Data queried from arXiv. Image provided by the author.

Cleaning & Preprocessing

Now we can clean and preprocess the summary information associated with each article. The cleaning and preprocessing phase of working with text data is essential in optimizing the performance of the underlying models. The lower the quality of data you feed into the model, the lower the performance of it will be in production settings. Furthermore, the amount of data you clean, preprocess and reduce will impact the training and inference time associated with the model. This will overall improve the experiments you run and performance in production. Topic modelling algorithms rely on the frequency of words within a document to identify patterns and topics, so any irrelevant information passed in can skew the results.

The text preprocessing we will be doing will consist of the following:

  • Unicode the input data. This is critical when working with data in different languages. It will convert à into á , this will be critical during the cleaning phase.
  • Lowering the text such that all upper case characters are now lower case.

The text cleaning we will be doing will consist of the following:

  • Removing punctuations
  • Removing stop words

When removing stop words, be mindful of the data you’re working with. The reason you would want to remove stop words is because they don’t provide any new information and it aids in optimizing the performance of the model. The instances where you wouldn’t want to remove stop words is when the context around the sentence matters. Not removing stop words would be useful for things like sentiment analysis and summarization. However, for our use case of topic modelling, we can proceed with removal of stop words.

The output of the cleaned data should yield a new column called cleaned_summary . The resulting dataset should look something similar to the image shown below.

Transformed initial dataset through cleaning and preprocessing of the summary column. Image provided by the author.
Transformed initial dataset through cleaning and preprocessing of the summary column. Image provided by the author.

Data Statistics

Now let’s investigate the word count breakdown associated with the cleaned dataset and identify the underlying distribution associated with the word count.

Word count distribution on the cleaned summaries associated with a sample of research papers from arXiv. Image provided by the author.
Word count distribution on the cleaned summaries associated with a sample of research papers from arXiv. Image provided by the author.

Based on this, it seems that out of the 1548 sample of articles we queried from arXiv, ~700 of them have fewer than 100 words. This corresponds to 45.1% of the data having fewer than 100 words.

Unsupervised Learning

We will be using LDA as the topic modelling algorithm in Python for the unsupervised learning approach associated with identifying the topics of research papers. LDA is a common approach to topic modelling and is the same approach large organizations like AWS provide as a service when using their Comprehend tool. This approach will essentially outline the backend code AWS would be using to process documents and generate topics for each of them in an unsupervised approach. At least this way, you won’t have to pay for it (aside from computing costs – which depends on the quantity of data you’re working with).

Train Model

Visualization

Now that we have the model object associated with the data we prepared to train in it. We can create a few unique visualizations which would help provide insights associated to the topics and keywords the model has identified for each article.

We’re going to be using the pyLDAvis library for the following visualization. Be aware that more recent versions of this library doesn’t support the visualization in JupyterNotebooks. I highly encourage this module to be installed with the specific version 2.1.2 as outlined in the requirements section of this article. This thread on Stack Overflow highlights the difficulty of generating this visualization on different versions.

LDA topic visualized for top 30 most frequent terms per topic. Image provided by the author.
LDA topic visualized for top 30 most frequent terms per topic. Image provided by the author.
Word cloud of the topics identified by the LDA model. Image provided by the author.
Word cloud of the topics identified by the LDA model. Image provided by the author.

There are 10 word cloud images created by the script above, but only 2 are showcased in this article. As you can see that there is quite a bit of overlap between the topics (namely with terms like model and models). Based on this it can be seen that further preprocessing and cleaning would be required to take the stem of a word, remove further stop words like use, show, first, also, may, one, number, etc... . The model development process is an iterative one but this highly outlines the importance of having high quality data being fed into the model.

Regardless of that we can also see that the two topics identified by this approach are quite unique. The first topic seems to dive into things centered around quantum computing and deep learning whereas the second topic is centered around Machine Learning, automated machine learning and data.

Topic Analysis

Frequency of learned topics with the threshold greater than 0.3. Image provided by the author.
Frequency of learned topics with the threshold greater than 0.3. Image provided by the author.

It seems that a majority of the articles which the model was trained on fall under the first and fifth topics.

Term frequency of the top 30 words. Image provided by the author.
Term frequency of the top 30 words. Image provided by the author.

Clearly, based on this it seems that further data cleaning and preprocessing can be done. As data, model and models are the most frequent terms, we wouldn’t want the model to be influenced by these words as they’re not distinctive enough.

Top relevant keywords and document count associated with each topic. Image provided by the author.
Top relevant keywords and document count associated with each topic. Image provided by the author.

As corroborated by the two images prior, the most popular topics being predicted are 5 and 0. Both topics use words like model and data which should be removed on further iteration. This was the first iteration of the model development process for this approach. The model which goes into production will never be the first model you train, it is imperative to take the results of models from previous iterations to influence the changes required for models in future iterations.

Supervised Learning

The supervised learning approach to topic modelling will consist of generating topic labels to train a binary classification model. This can be done by identifying the keywords associated to topics we are interested in labelling and predicting. I will mainly focus on the three topics of machine learning, nlp (natural language processing) and mathematics .

This is the set of keywords I’ve identified per topic. By no means is this list exhaustive but it is good enough to begin with.

topics_dct = {
    'machinelearning': [
        'machinelearning', 'clustering', 'classification', 'regression',
        'supervised machine learning', 'unsupervised machine learning'
    ],
    'mathematics': [
        'mathematics', 'graph theory', 'combinatorics', 'calculus',
        'linear algebra', 'probability', 'statistics', 'trigonometry', 
        'topology', 'differential equations', 'differentiate', 'algebra'
    ],
    'nlp': [
        'natural language', 'topic modelling', 'sentiment analysis', 
        'translation', 'chat bot', 'text analysis', 'text mining', 
        'semantic analysis', 'summarization', 'linguistic processing', 
        'language recognition', 'text processing', 'language models', 
        'linguistic', 'sequencetosequence', 'neural machine translation', 
        'word embeddings', 'word2vec'
    ]
}

We can parse through the cleaned summary associated with each article and identify which summaries had any keywords which we’re interested in and link it back to the original topic those keywords were mapped to. This will provide us with labels for each of the topics above. We can use TF-IDF to transform the input summary into a vector corresponding to the articles being fed into the model.

Keyword Statistics

Count of articles with the corresponding keyword counts. Image provided by the author.
Count of articles with the corresponding keyword counts. Image provided by the author.
Frequency of keyword occurrences throughout the articles. Image provided by the author.
Frequency of keyword occurrences throughout the articles. Image provided by the author.

Generate Labels

From the 1548 documents we are working with, given the keywords associated to the topics defined above, this is the count of documents which have a positive label for the corresponding topic. Image provided by the author.
From the 1548 documents we are working with, given the keywords associated to the topics defined above, this is the count of documents which have a positive label for the corresponding topic. Image provided by the author.
The corresponding data frame after the label generation. Image provided by the author.
The corresponding data frame after the label generation. Image provided by the author.

Train Model

The script above will yield the following sklearn pipeline corresponding to the cleaned summaries and labels we've generated above. Image provided by the author.
The script above will yield the following sklearn pipeline corresponding to the cleaned summaries and labels we’ve generated above. Image provided by the author.

Evaluation

Since we generated a holdout set during the training phase, we can now pass in the trained models against the holdout set to identify the performance of the model. Be advised that given that we’re working with a small sample of data with a class imbalance, there is a high likelihood that the trained model will be overfit. This can be resolved fairly easily (in our case) by increasing the number of articles we label and train the model on. All this means is that we should query arXiv for a larger dataset and generate better keywords and for labelling the articles. This might not be an easy issue for you to resolve if you’re working with a different dataset.

I would also highly encourage you to try out multiple classification models and not just the gradient boosting classifier. As stated before, iteration is an integral part of the machine learning development cycle!

Concluding Remarks

This article aimed to be a tutorial for the reader with the intention of providing both a supervised and unsupervised learning approach to topic modelling. I hope that I was able to outline the change in mindset when looking at the underlying data and how that can impact and broaden the approach to solving a particular problem.

I also hope that this article outlined the importance of iteration in machine learning. The model which goes into production will never be the first model you train, it is imperative to take the results of models from previous iterations to influence the changes required for models in future iterations.

I hope that it is also clear that the results of the unsupervised learning approach can influence the supervised learning approach. It could also bring forth a semi-supervised learning approach to topic modelling where you train a binary classification model on the results of the LDA model.

If you want to download the jupyter notebook associated with this tutorial, I have provided it here.

Resources


If you enjoyed reading through the article I wrote today, here are a few others I’ve written around the topic of natural language processing which you might also enjoy!

Text Similarity w/ Levenshtein Distance in Python

Word2Vec Explained

Text Summarization in Python with Jaro-Winkler and PageRank

Identifying Tweet Sentiment in Python


Related Articles