Introduction to The Structural Topic Model (STM)

A unique way to use topic modelling for social science research

Theo Lebryk
Towards Data Science

--

Photo by Malin Strandvall on Unsplash

One of the coolest things about topic modelling is that it has applications in a variety of areas. It can help motivate inquiry, provide unique insights into texts, and give you new ways to organize documents.

The Structural Topic Model (STM) is a form of topic modelling specifically designed with social science research in mind. STM allow us to incorporate metadata into our model and uncover how different documents might talk about the same underlying topic using different word choices.

Motivating STM

Topic modelling describes uncovering latent topics within a corpus of documents. The most famous topic model is probably Latent Dirichlet Allocation (LDA). LDA’s basic premise is to model documents as distributions of topics (topic prevalence) and topics as a distribution of words (topic content). Check out this medium guide for some LDA basics.

LDA is great, but it does make some restrictive assumptions:

  1. Topics within a document are independent of one another. In English: Just because document 1 has latent topic 1, it gives us no information whether document 1 has latent topic 2, 3, etc.
  2. The distribution of words within a topic (i.e. topic content) is stationary. In English: topic 1 for document 1 uses identical words as topic 1 for document 2, 3, etc.
  3. Topics can be modeled entirely based on the text of the document. In English: LDA only looks at the text of the document when determining topics, and doesn’t take any other information (author, date, source) into account.

To address the first assumption of independence, check out Correlated Topic Models (CTM). STM employs CTMs, but adds a couple more features on top.

From a social science or humanities perspective, assumption two is somewhat suboptimal. Two documents could be about the same topic — say a protest — yet come at the topic from different angles. Perhaps one tends to highlight “police brutality” or “peaceful protesters” whereas the other uses terms like “law and order” and “radical rioters.” Both documents can be said to be about the same topic, but the topic content (i.e. words that make up the topic) vary from document to document.

As for the third assumption, we might imagine that both the topic prevalence and topic content for a specific document is correlated with “metadata” about the document. For instance, certain sources may be more likely to write about politics or write about politics in a particular way. Metadata can include date published, author, publication, likes on social media, or any number of categorical or numerical variables about a document.

The STM solution

In LDA, our topic prevalence and content came from Dirichlet distributions with hyperparameters we set in advance — sometimes referred to as a and b. With STM, our topic prevalence and content come from document metadata. We’re going to call the matrix of document metadata used to generate topic prevalence “X” and the matrix of document metadata used to generate topic contents “Y.” To try and keep things simple, however, we’ll say that X=Y (i.e. we’re using the same metadata in both cases) and that both are dxp matrices where d is the number of documents in the corpus and p is the number of metadata features we will use.

Topic prevalence

For topic prevalence, our goal is to get to a probability vector for each document, which we’ll call θᵢ. We need to go from Xᵢ, a 1xp metadata vector for a specific document, to θᵢ, a 1xk vector whose entries sum to one and essentially correspond to how much of that document is made up of a given topic. For our purposes, i just means this θ is specific to a given document i Apologies; subscript “i” would usually be subscript “d” as in “document” in the literature but I cannot find a Unicode subscript “d” for the life of me.

To get to θᵢ, we’re going to multiply Xᵢ by a pxk matrix of weights called “τ.” Where does τ come from? Well, we’re going to call τ’s columns “γ.” The default settings on the R STM package is for γₚ,ₖ~N(0, σₖ²) where σₖ²~Half-Cauchy(1,1).

Interpretation-wise, the Half-Cauchy (1,1) prior means most parameters will start around zero (the mean on the normal distribution is zero and the Half-Cauchy (1,1) means the variance will also be near zero). During the inference steps, we’ll learn the actual parameter values, but the prior on these parameters will drag these values towards zero, meaning that only metadata that is highly correlated with a topic will end up being influential. The Half-Cauchy prior will shrink non-influential coefficients towards zero, but does not actually zero out parameters. If you have a ton of metadata which might not be relevant to the topics of a document (for example one-hot encoded features), we might want to induce sparsity. For these cases, the R package supports a Gamma prior with L1 or elastic net regularization.

Before we get too bogged down in STM’s implementation details, the most important thing is that we’ve done a linear transformation from a 1xp metadata vector into a 1xk vector roughly corresponding to topic prevalence.

We still don’t have a probability vector yet. The resulting vector from τ * Xᵢ will serve as the mean for the logistic normal distribution that will ultimately generate θ:

θᵢ ~ LogisticNormal(τXᵢ, Σ)

What this means is that we’re treating the result of τXᵢ as the mean of a multivariate normal with variance Σ. In fact, Σ is the kxk topic covariance matrix (see CTM for a more thorough explanation), which is how we break assumption one and incorporate correlations between topics into our model. We then transform the resulting vector from our multivariate normal (which is called η) into probabilities by passing it through the logistic function.

From here on, it’s just regular LDA where we generate per-word topics (z) based on a multinomial made up of these topic probabilities — i.e. zᵢ,ₙ~Multinomial(θᵢ), for every n word in document i. Now that we have the topic for each word in the document, we still need to figure out the probabilities of different words for a given topic, which we’ll explore in the next section.

Topic content

For the topic content of a document, we start with Yᵢ, our 1xp vector of metadata for a given document i. Our end goal is a kxV matrix called βᵢ, where V is the length our vocabulary. In βᵢ, every k row is simply a probability vector, where each column value is the probability of a word being generated by that topic. As the i subscript indicates, this β is specific to a single given document.

To get word probabilities for a single topic in a single document (βᵢᵏ), we’re going to start with a baseline word probability we’ll call m. m is a 1xv vector that basically represents the global word probabilities: usually this means the log-transformed rate of any given word across the corpus, but also could be set in advance according to some known word frequencies.

From there, we’re going to deviate from that baseline by adding some vectors we’ll call κ. Note that κ (kappa) is not the same as k. It’s confusing, I know, but remember that κ is a set of deviations away from global word frequencies while k is the number of topics.

First, we’re going to add a topic deviation from the baseline called κₖᵗ. The superscript “t” is just a reminder that this is our “topic” deviation and the “k” subscript is further reminder that this deviation is specific to a topic k. Slightly more precisely κᵗ is a kxv matrix with κₖᵗ meaning we only are looking at row k right now. Whatever the case, at this point, it’s basically like regular LDA as κₖᵗ is non-document specific impact of a topic on word frequencies.

Second, we’re going to improve over LDA by adding document deviation to the baseline, what we’ll call κᵧᵢᶜ (pretend that “γi” is actually “yᵢ”… Unicode bests me again). The “c” superscript simply symbolizes this is the “covariate” deviation (another word for metadata) and the γi/yᵢ is also indicating that this covariate comes from the specific document’s metadata. In other words, κᵧᵢᶜ is a deviation based on how the model thinks the metadata will affect our word frequencies.

Finally, for good measure, we’re going to add an interaction deviation from the topic and the covariates (aka metadata): κᵧᵢ,ₖ,ᵥⁱ. Nothing to write home about here; this last κ term merely covers our bases in case there’s an “interaction” (which the “i” superscript indicates) between the metadata and a given topic when it comes to impacting word-frequencies.

Once we add all that together, we can just apply some logistic transformations, and we’ll have our word probabilities for a given topic within a given document.

Full formula for βᵢ,ₖ,ᵥ. βᵢ,ₖ,ᵥ means: for a given document i and given topic k, what is the probability of a given word v.

Where κ comes from isn’t all that important as we learn it over the course of posterior inference. While κ may seem a bit opaque, put simply, it is what gives us document-specific topic content. Once we’ve built an STM model, we can see the most frequent words (i.e. the highest βᵢᵏ) for the same topic k across different documents.

With that, we’re basically done! Here’s the final Plate diagram.

STM plate diagram. Recall z is the topic our θ generates. W is the word selected from that topic, based on β. M is the number of documents in the corpus. N is the number of words in a document. Boxes are kind of like loops, meaning we’re repeating the innermost process (z and w) for every word for every document.

And here’s the gross looking posterior distribution:

Posterior distribution. Recall that η is essentially our topic prevalence before the logistic transformation that converts it into a probability.

Super Brief Inference Sidebar

What that big equation is saying is that in STM, we’re given the documents’ words (W) the documents’ metadata (X,Y), and number of topics (k). We need to estimate the underlying topic prevalences (η); concrete topic assignments(z), covariate and interaction deviations from the baseline topic content (κ), the parameters to map from metadata to topic prevalence (γ), and the correlations between topics (Σ).

If you are interested in inference, the STM algorithm of choice is a “a fast variant of nonconjugate variational expectation-maximization (EM).” Broadly speaking, we first estimate (the estimation/E-step) topic content and prevalence parameters then update all the parameters using a couple different algorithms (the maximization/M-step). We’re testing how likely those parameters are given the data, updating the parameters based on our data, and iterating until the change in a parameter drops below some threshold (i.e. the model converges). This seems complicated, and it no doubt is tough computationally, but at a high level it boils down to regular Bayesian inference.

2 quick notes: I’ve included k as a hyperparameter we set in advance. The R STM package has a “searchK” function which will try to find the optimal k automatically. It’s not worth getting into how it determines the optimal k, and the posterior as far as we’re concerned treats k as a given so we’ve included it here as a hyperparameter. The original authors, however, don’t include it in their posterior equation, nor do they include m. Recall we can set m in advance or derive it from W. Either way, it isn’t something we’re estimating but does affect κ so I’ve included it in our posterior.

Conclusion

To recap: STM enables not just higher quality models, but also provides insights into the corpus such as how metadata affects the words a document uses within a topic. With that, you can hopefully understand the method behind papers about ideological polarization on climate change, what state judges tweet about, and when the pope talks politics.

You’re also hopefully ready to start playing around with STM in R (or if R gives you headaches then you can use this GUI instead). I’d highly recommend taking a look at the STM homepage which has links to the methods papers, more STM packages, and more published papers using STM.

[1] M. Roberts, B. Stewart, D. Tingley, and E. Airoldi, The Structural Topic Model and Applied Social Science (2013), Prepared for the NIPS 2013 Workshop on Topic Models: Computation, Application, and Evaluation.

[2] M. Roberts, B. Stewart, and E. Airoldi, A Model of Text for Experimentation in the Social Science (2016), The Journal of the American Statistical Association.

[3] D. Blei, A. Ng., and M. Jordan, Latent Dirichlet Allocation (2003), The Journal of Machine Learning Research.

[4] D. Blei and J. Lafferty, Correlated Topic Models (2005), NIPS’05: Proceedings of the 18th International Conference on Neural Information Processing Systems.

--

--

A master’s student who is still not quite sure what data science is…