Intuitive Guide to Correlated Topic Models

A neat extension to LDA which can be the backbone for even more advanced topic models

Theo Lebryk
Towards Data Science

--

Photo by wu yi on Unsplash

Ever since David Blei, Andrew Ng, and Michael Jordan (no not the basketball player; no not the actor) came out with Latent Dirichlet Allocation (LDA) in 2003, topic modeling has been one of the most well-trafficked models in all of data science. Fun fact: that original LDA paper is the most cited paper from the Journal of Machine Learning Research.

Topic modelling, simply put, is finding out what a document is about. Unlike with supervised document classification, it’s unlikely we know the topics in advance. Instead, we take an unsupervised approach and uncover latent topics in the documents. LDA is particularly useful for document modeling, classification, and even collaborative filtering.

In this post, I’m going to introduce an important, but somewhat overlooked, extension to LDA: the correlated topic model (CTM).

Motivating CTM: Limitations of LDA

We’re going to build from the ground up by first reviewing LDA, then showing how CTM tries to improve over LDA. There are a couple of great LDA guides already on medium, which I highly recommend to supplement this brief overview.

LDA models documents as a distribution of topics and topics as a distribution of words. Our word generation process consists of selecting a topic from the document’s distribution of topics (a multinomial distribution we’ll call Θᵢ where i just means this Θ is specific to document i) and selecting a word from that topic’s distribution of words (another multinomial usually called β). These distributions are generated from Dirichlet distributions. For now, we’re only focused on the first of these Dirichlets that produces Θᵢ, the topic distributions for a document.

A place diagram for LDA. The bottom right corner of a box means repeat the operation that many times. M is the number of documents in the corpus; N is the number of words in a document; k is the number of topics. z is the specific topic that Θ spits out, which dictates which β probability vector we use to select our final word w. a and b are hyperparameters for the respective Dirichlet distributions.

For those less familiar with the Dirichlet, think of it as the multivariate generalization of the beta. It produces probabilities for not just a single Bernoulli trial, but rather for a probability vector with multiple classes, in our case for the k topics in our model. In other words, the Dirichlet is the conjugate prior for the multinomial.

I know this is a bit jargony and quick, but even if you don’t totally understand, there are two takeaways to note:

  1. The fact that the Dirichlet is the conjugate prior for the multinomial makes the math come out nicely when we start to actually working with the data. We still need to do some approximations to get our final topics and their distributions, which we can do via variational inference, or other techniques like collapsed Gibbs sampling.
  2. The Dirichlet produces topic probabilities that are independent-ish (in math terms: it produces a neutral vector). Recall for the multinomial if you drop one of the categories, the rest of the probabilities just renormalize around the remaining probability left to allocate. The remaining categories don’t change relative sizes.The Dirichlet is designed to generate probabilities with this in mind, meaning we are unable to model correlated topics.

What if we think topics are highly correlated? This isn’t too far-fetched: one latent topic might contain a bunch of words about food while another topic might contain words about health and fitness. These two topics are probably at least somewhat dependent. A document about health and fitness is more likely to also be about food than a randomly chosen document.

The CTM solution

We’re going to leave the rest of the generative model alone and just focus on how we generate Θs. In order to introduce correlations, we’re going to start with the multivariate normal distribution instead of the Dirichlet. If the Dirichlet is the multivariate generalization of the beta, the multivariate normal is just the multivariate generalization of the normal. Just as before, our goal is to generate k values (one for each topic) for our probability vector.

With the univariate normal, we need a mean and standard deviation. With the multivariate normal, we’ll need k means and k standard deviations. We’re also going to factor in the covariances between the different topics. In total, we’re feeding the multivariate normal two parameters: a k length vector of means (μ) and a kxk covariance matrix (Σ).

At this point, our model will be spitting out a vector k of values of arbitrary sizes, which we’ll call η. These numbers could be negative or they could be enormous. Ultimately, our goal is to have a probability vector of non-negative numbers that sum to one (known as a simplex).

To map the results of our multivariate normal into probabilities, we’re going to pass all the values through a variant of the logistic function. Specifically, for every output multivariate normal ηⱼ (where j corresponds to a topic), we’re applying the transformation:

The end result of passing the multivariate normal through this logistic transformation is called the logit normal distribution. In sum, all we’ve done is swap the Dirichlet distribution — Θ ~ Dir(α) — with a logit-normal distribution — Θ~ f(N(μ, Σ), where f(x) is the logistic transformation and N is the multivariate normal.

CTM plate diagram. “Log” represents the logistic transformation listed above.

This little switch has a couple important implications. Unfortunately, the multivariate normal, plus the logistic transformation at the end, doesn’t have the same neat properties as the Dirichlet.

Thus, we can no longer use the standard variational inference technique LDA uses. Instead, CTM uses something called mean field variational inference. The mechanics of mean field variational inference are… involved. Let’s just leave it that we now have additional variational parameters which require iterative methods to optimize. For our purposes, the main takeaway is that training might take a bit longer for CTM.

The payoff for this longer training is that we get not only get better topics, we also get the added benefit of seeing the relationship between topics. In LDA, the researcher would have to go in after training to try to sort out correlations manually. With CTM, we have to look no farther than the model itself to observe correlations. Maybe books about leadership also use a lot of religious language and vice versa, which could show up in a correlation. Maybe we find superset-subset relationships through correlations (e.g. a topic with general sports words is probably correlated with a topic with basketball words).

CTMs are particularly helpful if:

  • We have a hunch in advance that topics are likely to be correlated.
  • We are modelling a lot of topics, in which case at least a few are bound to be correlated.

However, I should qualify that correlation is not some form of semantic distance between topics. We’re merely observing that certain topics tend to appear, or not appear, together. Because CTM relies on parameter inference/estimation, and not some closed form solution, it’s best to not read too deeply into the precise correlations between topics. However, when it comes to providing broad brushstrokes or motivating further inquiry during EDA, these sorts of insights can be pretty interesting.

Conclusion

To recap:

  • Topic modelling is an unsupervised method of finding latent topics that a document is about.
  • The most common, well-known method of topic modelling is latent Dirichlet allocation. In LDA, we model documents as distributions of (independent-ish) topics which are themselves distributions of words.
  • We can improve on this baseline if we allow correlations between topics.
  • This comes at the cost of longer training time.
  • However, it enables the researcher to examine relationships between topics.

To be honest, CTM isn’t exactly an out-of-the-box darling of the ML community. However, CTM definitely has its uses (see CTM papers from the last three years on covid-19 research, higher education research, and climate change news) and the intuition behind CTM is a helpful starting point for some pretty cool models.

Two pretty cool CTM extensions that come to mind:

For now, I’m linking to the original papers, but I’m hoping to put out more intuitive guides to embedded and structural topic models in the future. Stay tuned!

[1] D. Blei, A. Ng., and M. Jordan, Latent Dirichlet Allocation (2003), The Journal of Machine Learning Research.

[2] D. Blei and J. Lafferty, Correlated Topic Models (2005), NIPS’05: Proceedings of the 18th International Conference on Neural Information Processing Systems.

--

--