Image by author.

Is LDA Topic Modeling Dead?

Overcome LDA’s Shortcomings with Embedded Topic Models

Dan Robinson
Towards Data Science
14 min readJul 6, 2022

--

The 2003 paper, Latent Dirichlet Allocation, established LDA as what is now probably the best known and most widely used algorithm for topic modeling (Blei et al. 2003). Yet despite its ubiquity and longevity, those experienced with LDA are familiar with its limitations. In addition to its instability, detailed below, LDA requires more than a little text pre-processing to obtain good results. Even putting aside the implementation details, LDA suffers from a more general problem that plagues topic modelers regardless of the algorithm they use — the lack of a ground truth upon which to evaluate their models.

The only reliable and consistent way to establish a ground truth for topic models is to empanel experts to create a common corpus topic vocabulary and then to have multiple annotators apply the vocabulary to the text — a time consuming and expensive process. This ‘by-hand’ methodology, practiced in the social-sciences for decades, is exactly the problem that unsupervised topic modeling seeks to address.

The practical limitations to obtaining an objective standard against which a model can be measured and evaluated has undoubtedly led many ML/AI practitioners to pass by topic modeling for other, less fraught endeavors. Yet despite its shortcomings, topic modeling in general, and LDA specifically, have proved sufficiently useful to retain their popularity. One paper addressing the issue pithily summarizes topic modeling’s continued appeal in the face of its drawbacks observing that

…although there is no guarantee that a ‘topic’ will correspond to a recognizable theme or event or discourse, they often do so in ways that other methods do not (Nguyen et al. 2020) (emphasis added).

For these authors, and many others, topic modeling has proved to be ‘good enough’ to warrant their continued attention.

This article explores the idea that a new technique, topic modeling with language embeddings, effectively addresses two of the most glaring issues encountered when using LDA. This new approach is detailed in the paper BERTopic: Neural Topic Modeling with a Class-Based TF-IDF Procedure (Grootendorst 2022). BERTopic is an end-to-end tool for producing topic models from embeddings data. As a default, it leverages HDBSCAN to identify topics contained within language embeddings. Although ̶ ̶H̶̶̶D̶̶̶B̶̶̶S̶̶̶C̶̶̶A̶̶̶N [UMAP, used by default in BERTopic to reduce the dimensionality of the language embeddings,] is stochastic and subject to run-over-run variations, in my experience it produces much more stable and predictable topic groupings than LDA does. Secondly, because BERTopic topic models are distinct from the embedding data they summarize, it is possible to evaluate how well a particular run does or doesn’t represent the underlying data structure. This feature effectively creates a model-by-model ground-truth that can be used for evaluation and tuning. There is nothing comparable with LDA.

TL;DR

Use BERTopic, not LDA! LDA is a powerful tool for topic modeling but its instability is a major, often unacknowledged, stumbling block. BERTopic doesn’t suffer from the instability problem. Importantly, this article seeks to demonstrate that word embeddings, used as the basis for topic modeling, can effectively create a ground-truth upon which a given topic model can be evaluated and tuned. From a practical standpoint BERTopic is also easier to use as there is no text pre-processing and as is demonstrated below, is much less resource intensive than LDA.

Disclaimers and Links

I have no official relationship to the BERTopic project (nor with LDA or EnsembleLDA for that matter). As a companion to this article I’ve created a Tableau presentation that will allow readers to interactively explore the models created for the article. The data used in this article are publicly licensed and available and can be found on Kaggle (CC0 license). I’ve pushed some technical details and notes into an appendix found at the end of the article.

Wobbly Ground

The case against LDA unfolds along two axes. The first is the algorithm’s inherent instability. I have previously written about LDA topic model instability and the difficulties inherent in establishing an objectively correct number of target topics for a given LDA model. All the models generated for this article were generated from the same corpus. LDA or EnsembleLDA was run against the corpus in three different configurations. Each pair is evaluated to determine how well they agree with one another.

The first LDA run, using default parameters for the Gensim LDA implementation resulted in the following heatmap which compares the document/topic assignments for each model’s topics:

Uncorrelated Topic Models. Image by author.

The orange cell represents the number of documents which occur in the first model’s topic 2 and the second’s topic 4 and registers a 62% overlap. This was by a large margin the best correlated topic pair for this run. Despite the fact that each model rested on the same data and identical parameters, these models represent very different sets of topics. However, it is possible to get better results by increasing the amount of processing power we throw at the problem.

EnsembleLDA, explicitly deals with the LDA model instability issue by harmonizing multiple models into a single model instance. Using the same data and settings as above we see a marked improvement in model correlation:

Improved model correlation. Image by author.

Yet this improvement comes at a cost. Running on a Colab+ instance, the first two poorly correlated runs each took only a matter of seconds to generate. The above improved run took more than three hours per model. If we throw more resources at the problem, increasing processing time to over nine hours per model, the results continue to improve:

Image by author.

But still, even though the final set of are better correlated, the impact of LDA model instability is notable. In the first run, an arbitrary number of topics, twelve, was chosen and fed as a parameter to the model. However, the subsequent four models were generated with EnsembleLDA. One compelling feature of EnsembleLDA is that the algorithm, unsupervised, will converge on an optimized number of topics. Yet while EnsembleLDA does a decent job in selecting the number of topics, we must note that in the examples there is still no agreement about how many topics there are. In EnsembleLDA models we see eight, eleven, ten and nine topics respectively. So even though the later models have significantly reduced drift in their document/topic assignments, there is still not agreement on the number of topics in this dataset.

What about other approaches to judge the correctness of LDA models? One of the most common approaches is to use npmi based coherence scores (Lau et. al. 2014). There are a number of different variations, here we use ‘c_v’, a common choice. The result for the first two models is .291 and .295 respectively. These are poorer scores than those of the second set of models which came in at .578 and .566. However, even though the last set of models show a noticeably improved agreement between its topics than the penultimate pair, their scores: .551 and .549, are worse than that second set. While npmi based metrics perform well in the lab, in my experience they often fall well short of their promise.

Using Visualizations to Understand the Models

Visualizing the documents in scatterplots provides an informative visual reference that helps in understanding the distribution of each of the thirty thousand documents within each model. Here is a TSNE 2D reduction of one of the two best correlated EnsembleLDA models:

2D TSNE Projection of LDA output. Readers are encouraged to explore the plot with the interactive version. Image by author.

We can clearly see spatial separation between the topics. The interactive version of this visualization allows the user to zoom in and hover over each dot which represents an individual document. In this way it is possible to get a sense of how the document’s content led them to be categorized in a particular way.

The BERTopic Alternative

I recommend reading up on BERTopic’s architecture and approach in its documentation and in the articles and papers published by its author, Maarten Grootendorst if you are unfamiliar with the package. In short, BERTopic employs HDBSCAN (or any another clustering mechanism you care to use) to determine topics within a corpus. Then it uses a variant of TF-IDF — c-TF-IDF which instead of looking at individual documents to extract meaningful vocabularies, aggregates the entire topic’s documents and extracts meaningful vocabularies from the entire topic. While I consider c-TF-IDF to be a significant contribution in its own right, in this article I will focus on the topic discovery through the HDBSCAN clustering of the BERT sentence embeddings and compare it to the LDA models above without touching on the c-TF-IDF vocabulary discovery phase.

The issue of model instability is practically non-existent with BERTopic. Here is a heatmap comparing the document assignments from two different models created with the same corpus and using the same parameters:

Image by author.

By design, HDBSCAN will not attempt to categorize documents which fall below a threshold, those are assigned to category -1. Also, BERTopic re-numbers all the topics from the largest to smallest number of total documents which accounts for the symmetrical diagonal on the matrix above — this is a purely aesthetic feature of BERTopic.

It is notable is that consecutive BERTopic runs will largely produce the same number of topics. While there is some drift and variation from model to model in terms of document assignment, compared to LDA, the degree of instability is minimal. Lastly, these models were created in a matter of minutes using a Colab+ account with the GPU enabled, a huge practical difference compared with the resource intensive LDA implementations which required hours of compute time.

[Almost] Nothing Up My Sleeve

The careful reader will notice that BERTopic models used above only produced six topics (in addition to the -1 ‘topic’). When working with this corpus I discovered that when HDBSCAN was run against the embeddings that there was a ‘natural’ segmentation in which sports stories encompassed approximately one sixth of the data and broken into five topic areas (interestingly organized as: Soccer/Rugby/Cricket, Race Cars, Golf, Tennis, Boxing, Swimming/Running/Olympics) and all the rest were organized into a super-cluster of non-sports related documents. As a result, I split the dataset into two parts, sports and not-sports. When BERT was run these two segmented corpora, the sports segment retained the same internal organization as appeared in the original corpus, dividing the sports into six segments. However the ‘non-sports’ blob that was formerly one single category broke down into ten separate topics. My assumption about the reason that HDBSCAN had trouble subdividing the non-sports super-cluster was that the inherent geometries of the data when it was all in a single group were simply beyond HDBSCANs abilities to segment the larger set into finer groupings while simultaneously dividing up the sports topics. When the corpus was broken into two separate parts, everything fell into place.

Because of this, and because there are no tools (yet) within BERTopic to perform this kind of operation, the topic groupings below are assembled from two different BERTopic models. The tools and techniques used to arrive at the correct HDBSCAN parameters are beyond the scope of this article.

A Ground Truth Embedded in…. Embeddings?

Below is a 2D TSNE projection of the BERT embeddings overlaid with the final topics from BERTopic with the -1 documents removed:

Image by author.

At the upper left side of the image the large pink cluster is mostly soccer with rugby and cricket. The other sports: Race Cars, Golf, Tennis, Boxing, and Swimming/Running/Olympics, are in the five clusters immediately below and to the left of it. The Tableau presentation allows the user to freely traverse the dataset and see down to the document level to learn more about how BERTopic segmented this corpus.

Another segmentation that demonstrates the ability of BERTopic to make fine-grained, valuable distinctions between topics. This ability seems far outside LDA’s grasp. Another example of the subtle distinctions observable in the BERTopic model can be seen with the large central green cluster and the two separated green clusters to its left. These are all part of one topic with its topic words:

Dog, animals, animal, dogs, species, just, like, cat, zoo, time

The leftmost cluster seems to be mostly or all about dogs, the middle about cats, and the large grouping (spatially organized from left to right) about exotic animals, wild animals, marine life and then a grouping at the edges of this larger cluster having to do with archeological issues and a smattering of biological science documents thrown in. This kind of clear spatial/semantic organization can be found throughout the BERTopic model.

We can use the scatterplots to compare the two different models created by BERTopic and LDA. First we can overlay the LDA topics onto the BERTopic model:

Image by author.

The LDA topics are generally well correlated with the coordinate positions extrapolated from the BERT embeddings. In other words the smaller number of LDA topics more or less fit into the topic groupings that BERTopic generated. Using a heatmap of the same BERT to LDA mappings we see another view of the same, reasonably well ordered data:

BERT topic/document distribution down the left compared to LDA distributions across the top. Image by author.

The 0 and 1 LDA topics are the sports related stories. We can see that LDA broke these documents into two clusters and BERTopic into 6. Six other BERTopic topics roughly correspond to the LDA topics, but the remaining four BERTopic topics clusters have no clear relationship to the LDA topics.

The reverse view, projecting the BERT Topics on the LDA coordinates reveals a more chaotic view:

Image by author.

In this case we can see that while some areas are broadly congruent, where there are disagreements things are quite disorganized. For example, the pink area on the right with the two ‘eyes’ are the sports stories. Both models did a good job selecting out the soccer stories. However, the LDA model was unable to differentiate effectively between the different kinds of sports.

The flipped heatmap shows the level of confusion in different form:

LDA Topics down the left compared with BERT topics across the top. Image by author.

Only three LDA topics can be said to be correlated to any degree with their BERTopic equivalents.

So What?

The two models above, one created with BERTopic and the other with EnsembleLDA, are clearly different and it is hard to find much more than passing agreement about their respective document/topic assignments. Where do we turn to rationalize this disagreement? Topic modelers deal with this kind of uncertainty all the time. Since there is no ground-truth upon which to base an objective measure, modelers are left to arrive at the best informed subjective judgement that they can muster.

Yet I argue that embeddings based topic models offer an alternative to resorting to purely subjective measures. With the above data I felt confident in an analysis of the scatterplots that the BERTopic model is more precise in its topic cluster definition. Furthermore, there was no way I was able to figure out to leverage the LDA scatterplots to discern anything meaningful about the organization of the underlying data or what could be done to improve the model. While each individual scatterplot is a decent representation of that particular model, as far as I could tell it has nothing to say about the underlying semantic structure of the documents themselves. Over and over the BERT based scatterplots revealed surprisingly well organized and important information about the documents, expressed in rationalized spatial relationships. There is nothing comparable to be found in the LDA visualizations.

Hopefully this article has sparked an interest in topic modeling with embeddings in general, and BERTopic in particular. The simple practical facts presented here: that BERTopic requires no text pre-processing; the demonstrated issues of LDA model instability; and the dramatic difference in compute resources required between the two, is hopefully enough to peek the interest of both beginning and experienced topic modelers.

However, beyond these practical considerations, is the intuition that the embeddings that BERTopic uses for its topic model effectively establish a heretofore elusive ground truth for topic modeling. Based on my explorations so far it seems reasonable to argue that the BERT embeddings create a firmer ground upon which to created topic models than does LDA.

When we view LDA embeddings we are seeing the results of a mathematical process that has only the immediate corpus upon which to operate and extrapolate semantic meaning. Each LDA model represents a small, closed universe of relationships. When using embeddings we are connecting our data to a much, much larger body of information that claims to, at some level, represent language itself. It seems to me that this larger relationship is visible when examining embeddings based model data.

Appendix

The data used for these examples is a randomly selected, 30,000 article subset (CC0 license) of a larger publicly licensed dataset called News Articles (CC0 license).

As noted in the article I stepped outside of BERTopic to arrive at optimized HDBSCAN parameters to use with BERTopic to get the output shown here. The method used to arrive at them was to extract the embeddings (actually the UMAP reduction of the embedding) from the BERTopic model and then to run a series of experiments that varied the min_sample_size and min_cluster_size HDBSCAN parameters. The output was judged on the number of topics clusters identified and the number of outlier, -1, document assignments. What I found with this dataset was that there were ‘natural’ numbers of topics created when running randomly selected values for dozens of runs. With this data the numbers of topics clustered around 3, 7, and then jumped to over 50. I chose parameters that produced these cluster configurations with the smallest number of outliers and then ran scatterplots. Based on these results I determined that it might make sense to break the corpus into two parts and re-run the tuning experiments. The final result is what is shown above and forms the data for the Tableau presentation. I hope to write up this technique in more detail in the future, those interested in the specifics are encouraged to contact me via my LinkedIn account found in my profile.

Academics have acknowledged and addressed the lack of a ground-truth as being an obstacle for topic modeling. To read more about this issue I suggest:

Nguyen, D., Liakata, M., DeDeo, S., Eisenstein, J., Mimno, D., Tromble, R., & Winters, J. (2020). How We Do Things With Words: Analyzing Text as Social and Cultural Data. Frontiers in Artificial Intelligence, 3, 62.

O’connor, B., Bamman, D., & Smith, N. A. (n.d.). Computational text analysis for social science: Model assumptions and complexity. Retrieved June 28, 2022, from https://people.cs.umass.edu/~wallach/workshops/nips2011css/papers/OConnor.pdf

Topic Modeling in the Humanities: An Overview. (n.d.). Retrieved June 28, 2022, from https://mith.umd.edu/news/topic-modeling-in-the-humanities-an-overview/

Bibliography

Referenced in the article:

Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent Dirichlet Allocation. Journal of Machine Learning Research: JMLR. https://www.jmlr.org/papers/volume3/blei03a/blei03a.pdf?ref=https://githubhelp.com

Grootendorst, M. (2022). BERTopic: Neural topic modeling with a class-based TF-IDF procedure. In arXiv [cs.CL]. arXiv. http://arxiv.org/abs/2203.05794

Lau, J. H., Newman, D., & Baldwin, T. (2014). Machine Reading Tea Leaves: Automatically Evaluating Topic Coherence and Topic Model Quality. Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics, 530–539.

--

--

I'm a thirty-year+ technology consultant who gave up tech for a while and am now edging back in. Connect with me at https://www.linkedin.com/in/daninberkeley/