An Introduction to Topic-Noise Models

Learn how to use topic-noise models (1/3)

Rob Churchill
Towards Data Science

--

Words matter. And these days, it can be hard to cut through the noise to find the words that matter the most. In this series of articles, we will introduce a new type of model — the topic-noise model, and show you how to use these models on social media text data sets to generate more interpretable topics. Social media data can come in many different forms, text, audio, images, and video, to name a few. In this article, when we use the phrase social media data, we are referring to text data found in posts and/or profiles. This article explains how topic-noise models differ from traditional topic models. It then introduces the gdtm python package and presents examples of how to use the original topic-noise models, Topic-Noise Discriminator (TND) and Noiseless Latent Dirichlet Allocation (NLDA). The second and third articles in this series introduce semi-supervised and temporal variants of topic-noise models and show how to effectively use them with social media data sets to build higher quality topics.

Note: This series is intended to be a high-level explanation of topic-noise models and demonstration of how to use them in practice. For all the nitty-gritty details, you can read the research paper here.

Introduction

In social media data sets, especially domain-specific data sets like those pertaining to Covid-19 or the 2020 Presidential Elections, most traditional topic models struggle to effectively filter noise from topics. While it’s easy to preprocess away stopwords (common words like the “ifs, ands, and buts”) and other common types of noise, we have to rely on the underlying model to remove less obvious types of noise, like words that don’t belong in the domain (spam words), or words that are too common in the domain (flood words). Traditional topic models, like Latent Dirichlet Allocation (LDA) [Blei et al., 2003], have a tough time cutting through the noise inherent in social media data. As a result, the generated topics are often incoherent — in other words, difficult for humans to interpret.

Topic-noise models were invented to generate more coherent, interpretable topics than traditional topic models when using text from noisy domains like social media. Topic-Noise Discriminator (TND) [Churchill and Singh, 2021 (2)] is the original topic-noise model. TND jointly approximates the topic and noise distributions on a data set to allow for more accurate noise removal and more coherent topics. It can be used as a standalone model to produce topics, or combined with other models to create topics based on other topic distributions. We advocate combining TND with your favorite topic model because it has been empirically shown that TND works best when combined with more traditional topic models. In this series of articles, we will describe how we ensemble TND and LDA to create a model called Noiseless Latent Dirichlet Allocation (NLDA), and we will show you how to effectively use it.

What are topic models?

Before we describe a topic-noise model, we should all be on the same page about what a topic model is. Figure 1 shows the process of determining topics for documents at a very high level. A topic model (black box) takes a set of documents (far left), and returns a set of k topics (four in Figure 1) that summarize the documents (far right). A topic is a set of related words. In Figure 1, we see that our example data set consists of books from the fantasy genre. We say that the domain of the data set is fantasy books. For the rest of these articles, we will be dealing with domain-specific data sets, i.e. data sets consisting of documents/posts that all refer to the same broad subject area.

Figure 1. What is a topic model?

The topic model, represented by the black box, scans through the documents in turn, finding words that repeatedly appear together in the same context. Topic models work by observing which words appear frequently in the same documents. When words appear together in the same document, their probabilities of being in the same topic increase. After observing thousands of documents (or more), the model converges to a set of topics, each with high probabilities assigned to words that appear together frequently. For a more complete explanation of topic models, refer to The Evolution of Topic Modeling [Churchill and Singh, 2021 (3)].

In Figure 1, the topic model returns four topics about Magic, Evil, Travel, and Creatures, respectively. Normally topics don’t come with titles like they do here. Researchers need to determine these topic categories on their own. Most topic models also provide a probability with each word in the document collection. In Figure 1, we show you the words that had the highest probability for each topic. For example, wand has the highest probability in the Magic topic. These topics can be interpreted by humans to get a better understanding of the data set as a whole, or they can be used to classify documents with their most probable topics, which we see on the far right of Figure 1. In this example, we see that Travel is the dominant topic for The Hobbit and The Lord of the Rings, but not for Harry Potter. In general, topic models are pretty good at identifying topics in traditional documents like books, newspaper articles, and research papers, because they have less noise.

What is wrong with topic models?

Social media documents (posts) are much shorter than books. So, when a topic model looks for the repetition of words that continually appear together within and across documents, it cannot find it. Tweets are particularly short, and because of the ease of posting to Twitter, tweets are often poorly edited. Words simply cannot repeatedly appear together in a ten word document the same way that they do in a book. Social media documents present a unique challenge to topic models. There is often not enough repetition of words that belong in the same topic to distinguish between topic and noise words. Also, two posts about the same topic may not contain any of the same words at all. Consider two sentences, “It’s not rocket science, clean your fingers and palms to avoid catching coronavirus,” and “It’s just like magic! Covid can be contained by washing our hands!” Both are ostensibly about the same topic of Covid-19 cleanliness, but contain none of the same content words.

This problem manifests itself in topics that are noisy and difficult to interpret (incoherent). To demonstrate this, we created a fake data set consisting of imaginary tweets written from the perspective of Harry Potter characters about the Covid-19 pandemic (the domain here is the Covid-19 pandemic). An example of a document from this fake data set is, “Coronavirus outbreaks have led to the cancellation of the Hogwarts quidditch season.” After generating topics using these imaginary tweets (Figure 2), we can see how domain-specific noise infects topics. In Figure 2, each column is a topic.

Figure 2. Noisy topics.

First, we can see that three of the topics contain a variant of the word Covid-19. These words belong in the domain, but they are so general within the domain, that they do not help delineate between topics. If every topic contains the same word, then it is more difficult for us to understand how to distinguish each topic. Second, there are a bunch of words related to Harry Potter in the topics (werewolf, cloak, invisibility, and quidditch). While these words might be relevant topic words in the domain of fantasy books, they do not add to our understanding of the pandemic (unless we have an unusual social media data set).

Figure 3. Better topics.

Figure 3 shows a better version of the topics, where noise words have been filtered out and replaced with more meaningful topic words. In Figure 3, each topic is a column. The identification and removal of noise words within topics is the goal of topic-noise models. Another interesting observation about the topics in Figure 3 is that some of the words are actually parts of an expected phrase, or ngram. For instance, social distance, wear mask, and virtual learning are all common phrases associated with the Covid-19 pandemic. These phrases and others like them can be incorporated directly into topic models, but it is not unusual to focus on unigrams due to the issue of sparsity in short documents. Two word phrases tend to appear less frequently in a document collection than single words. That means that when you have a topic model with unigrams and ngrams, the ngrams will naturally not rise to the top of the topic word list.

Let’s look back at the example tweets from our fake data set to see one last important facet of short documents.

“It’s not rocket science, clean your fingers and palms to avoid catching coronavirus.”

“It’s just like magic! Covid can be contained by washing our hands!”

Because social media documents are much shorter than traditional texts, they rarely contain all of the topics in the topic set. The second imaginary tweet contains words that refer mostly to the Covid-19 cleanliness topic, with one word (magic) that may refer to another topic. Given a larger set of documents, we might be able to find other words that frequently appear with magic and form a topic around it. Now let’s take a look at how the Topic-Noise Discriminator works to remove noise from topics.

Topic-Noise Discriminator (TND)

Instead of just generating topics from a set of documents, topic-noise models attempt to separate topic and noise words probabilistically through the generative process. Topic Noise Discriminator (TND) does this by probabilistically assigning each word in each document to either a topic or noise distribution, based on the word’s prior probability of being in each distribution. The result is a global noise distribution which approximates the likelihood of a word being a noise word, and a topic-word distribution that approximates the probability of each word being in each topic. Figure 4 shows how we think about the construction of TND.

Figure 4. A topic-noise model generates a noise distribution alongside the topic distribution.

When we go to finalize topics, we look at each topic individually and decide whether each word belongs in that topic or in noise. We decide based on a word’s frequency in the topic and noise distributions. We can think of it as a scale, where one side is weighted by topic frequency and the other by noise frequency. Figure 5 shows the scale analogy. A higher frequency on one side means that we are more likely to classify the word toward that side’s weight.

Figure 5. The Scales of TND.

We have described how the important new facets of Topic-Noise Discriminator work, but how do they come together to produce topics and a noise distribution? Let’s take a look at the high-level algorithm:

Given a data set D of documents and hyperparameters k, α, β₀, and β₁:

For each document d in D:

1. Probabilistically choose a topic zᵢ from the topic distribution of d

2. For each word w in d, assign w to zᵢ or the noise distribution H, based on the probabilities of w in zᵢ and H (Figure 5).

3. Re-approximate the topic distribution of d given the new topic-word assignments

Repeat the above for X iterations (usually 500 or 1,000 in practice)

The hyperparameter k defines the number of topics in the topic set. The hyperparameter α controls how many topics there are per document. A higher α setting allows for more topics in each document. In social media documents, we set α to a low value, because the size of documents rarely allows for multiple topics per document. The hyperparameter β₀ controls how many topics a word can be in. The higher we set β₀, the more topics a word can be spread over. The final hyperparameter, β₁, controls the skew of a word towards a topic (and away from the noise distribution). The higher we set β₁, the more likely any given word will be a topic word instead of a noise word. You can think of β₁ as placing an extra weight on the Topic Frequency side of the scales in Figure 5.

Over a large number of iterations, each word will have a probability of being a topic word and a probability of being a noise word. We can then take a look at the topics and noise distribution to get a better understanding of our data.

Figure 6 depicts how we can look at a few social media documents once we know the topic and noise distributions. We can see in Figure 6 how the words in each document affect the topic distribution for that document. The first document consists of a mixture of the green and blue topics, with some noise (purple), whereas the second consists of the yellow topic and some noise. If the noise words were in topics instead of their own distribution, we might classify these two documents as consisting of the same generic topic, resulting in less interpretable topics and document classifications.

Figure 6. The interaction between topics and noise in social media documents.

Noiseless Latent Dirichlet Allocation (NLDA)

While TND identifies highly coherent topics, qualitatively its topics are not always as intuitive as those generated using other topics models. Therefore, we suggest combining the noise information from TND with traditional topic models. In this way, TND provides accurate noise removal while maintaining the same topic quality and performance that people expect from state-of-the-art topic models used on more traditional document collections. We created Noiseless Latent Dirichlet Allocation (NLDA), an ensemble of TND and LDA, to demonstrate the effectiveness of this type of approach. We take the noise distribution of TND, the topic distribution of LDA, and combine them to create more coherent, interpretable topics than we would get with either TND or LDA alone (Figure 7).

Figure 7. The NLDA ensemble.

We combine TND and LDA in much the same way that we choose whether a word belongs to the noise distribution or to a topic within TND itself. In NLDA, we compare the frequency of a word in TND’s noise distribution to the frequency of the word in LDA’s topic distribution, using the same scales approach shown in Figure 5. A scaling parameter φ allows the noise and topic distributions to be compared even if they are not generated using the same number of topics, k, by scaling the underlying frequencies to relatable values.

Topic-Noise Models in Action

Now that we have a better understanding of what topic models are, why they need to be improved, and how topic-noise models account for the noisy nature of social media data, it is time to put them to use. In this section, we will introduce the Georgetown DataLab Topic Model package (gdtm), and get LDA, TND, and NLDA up and running!

Data Sets

We think it’s important to show how models perform on different data sets. If it only works on the example data set, what’s the point? For the rest of this article, we are going to leave the magical world of Harry Potter-based data sets behind, and focus on two real world data sets — one about the Covid-19 pandemic, and another about the 2020 United States Presidential election. These data sets were collected using Twitter’s API between March 2020 and March 2021, and January 2020 and November 2020, respectively, using keywords related to their respective domains. They consist of 1 million and 1.2 million documents each, which is pretty big for topic model standards. We conducted the following preprocessing on the Covid-19 and Election data sets: tokenization, URL removal, punctuation removal, lowercasing, and stopword removal. For more details about preprocessing for social media, we refer you to textPrep [Churchill and Singh, 2021 (5)].

Due to Twitter’s privacy policies, these two data sets are not publicly available, so for your own experiments, we provide a small sample data set of 196 preprocessed tweets taken from the Election data set above (with metadata removed to abide by Twitter’s policies), as well as a link to a larger public domain data set provided by Kaggle, that covers the 2020 election over a slightly different time period to the former election data set.

Setting up

First, let’s make sure we have the right code. Topic-noise models are implemented in Java (based on the Mallet [McCallum, 2002] implementation of LDA [Blei et al., 2003], which we think is the best topic model implementation available), but we built Python wrappers, based on the old Mallet LDA from Gensim [Řehůřek and Sojka, 2010], for simplicity.

Note: gdtm works in Python 3.6 and up. We have not tested it on older versions of Python. This tutorial assumes that you are using MacOS or Linux.

Navigate to your working directory in your terminal, enable whichever virtual environment you plan on using, and pip install the gdtm package. You can find everything you need to know about gdtm in its documentation.

> pip install gdtm

Once you have the Python package installed, all you need is the Mallet (Java) implementation of whichever topic-noise model you are going to use. You can find an implementation of TND in the Topic-Noise Models Source repository. Download the mallet-tnd folder from the repository and note its path, wherever it ends up on your computer (path/to/tnd). Mallet LDA can be found here, but we also provide a stripped-down version in our source repository, in the mallet-lda folder. Download it from either spot, and note its path on your computer (path/to/lda). Now that we have all of the code that we need, we can start playing with topic-noise models!

Loading Data Sets

Choose and download your data set for experimenting. You can use one of your own, or one of the two options provided in the Data Sets subsection above.

Note: Topic-noise models are best used on data sets of tens or hundreds of thousands of tweets or other social media posts. The training of the noise distribution is accomplished using a randomized algorithm. With smaller data sets, TND is not always able to get an accurate noise distribution, so don’t expect to see great results with the sample data set! If you want to play with a larger data set to see its true effect, we suggest using the Kaggle data set that we described above. The US Election 2020 Kaggle dataset contains 1.72 million tweets about the election between October 15, 2020 and November 8, 2020, collected using the Twitter API. It was released to the public domain, meaning you are free to use it for whatever means you wish. You will need to do some preprocessing and data wrangling before putting the larger data set into a model. We suggest starting with a subset of a couple hundred thousand tweets from the larger data set.

Data sets can be loaded in whatever way you find convenient, but the final data structure to be passed into the model should consist of a list of documents, where each document is itself a list of words.

You can load the sample data set using a built-in function from gdtm :

Figure 8. Loading data using the built-in function

The sample data set is a CSV file that is space-delimited. It is important, if making your own files of this type, to not accidentally include extra spaces. As you can see in Figure 8, there is an argument to pass your preferred delimiter, so you may use whichever you prefer in your own data sets.

Running TND and NLDA

Now that we have our data loaded into the proper format, we can run our models and generate topics in just a few lines of code. In the gist below, we specify the path to the Java implementation on line 4, instantiate and run the model on line 7, and get our topics and noise on lines 9 and 10. All we have to worry about is choosing the right parameters for our data.

Note: for domain-specific social media data, k should always be at least 20, and beta1 should be between 9 and 25. We like to use k = 30 and beta1 = 25 for Twitter, and beta1 = 9 or 16 for Reddit comments. We increase beta1 when data sets are noisier to try to keep more words in topics that truly belong there. For a full explanation of this parameter, see the research paper (Section IV.B.).

Figure 9. Running TND on our sample data set.

Running NLDA is similar to running TND, except this time we are running an ensemble of two models, TND and LDA. Thankfully, we don’t have to worry about doing any of the ensembling ourselves. gdtm takes care of it for us. We just provide the path to TND and LDA, and specify our parameters. Getting the final topics and noise is the same.

Figure 10. Running NLDA on our sample data set

Saving the Results

Now, what do we actually get when we call get_topics() and get_noise_distribution()?

As we might expect, get_topics() returns the most probable words for each topic. The top_words parameter that we passed into NLDA and TND decides how many words are returned for each topic.

The get_noise_distribution() function is similar. It takes the most probable noise words in the noise distribution and returns them in a list, ordered from most to least probable. Each word is in a tuple, with its frequency in the noise distribution (word, frequency). The noise_words_max parameter from TND (or tnd_noise_words_max from NLDA) dictates how many words should be returned from the distribution.

The top_words and noise_words_max parameters are necessary because the topic-word and noise distributions are a distribution over all the words in the vocabulary. These parameters are purely for convenience and to save a bit of time when analyzing the results.

We can save the topics to a CSV easily using gdtm .

Figure 11. Saving topics is simple with gdtm.

Interpreting the Results

Now that we have saved our topics, let’s finally look at them. The topics below are from the full 2020 Election data set that the sample tweets were taken from. In the CSV, each row is a topic (topics don’t come with titles, it’s up to users to decide what to call them).

Figure 12. Topics discovered using NLDA on a large corpus of tweets about the 2020 Election. Each row is a topic.

For the sake of simplicity and space, we show the top ten words of the first seven topics in the topic set. We can see that most topics are reasonably coherent. Topics include Gun Violence (row 1), Political Parties (row 2), a general Twitter topic (row 3), Covid-19 (row 4), Pro-Biden Phrases and Hashtags (row 5), Democratic Voters (row 6), and Pro-Trump Phrases and Hashtags (row 7).

There are many ways to judge topic models, and they do not always agree with each other. Some methods like perplexity and topic coherence rely on pure math to decide that topics are well-formed, while others rely on human judgement. We have found that topic coherence and human judgement-based methods are far better indicators of topic quality as far as we are concerned. While these methods are not covered in this article, you can find more details in The Evolution of Topic Modeling [Churchill and Singh, 2021 (3)].

These topics are certainly not perfect and all-encompassing. However, most are highly interpretable by humans, and trends in topics are easy to identify. In row three, the topic containing general Twitter words also contains noise words like lets, youre, and hey. These words are too general to consider as topic words, but leak into the most general topic in the topic set. The more specific topics like Covid-19 and Gun Violence contain very little noise, if any. The noise filtering properties of topic-noise models are to thank for this interpretability and coherence.

Comparing Topic-Noise Models to Traditional Topic Models

To wrap this article up, we will take a look at a couple examples of topics generated on the Covid-19 and Election data sets using LDA, TND, and NLDA. If you recall, these are the large Twitter data sets that we collected over different periods of 2020 and 2021. We like to show the performance of models on different data sets to demonstrate that these models are consistent across different domains.

The first topic, shown in Figure 13, is from the Covid-19 data set. The topic is about Masks and Social Distancing, and was generated on a data set of one million tweets about the Covid-19 pandemic. The green words are words that we believe belong in the topic.

Note: Topics are subjective. What we see as good and bad topic words, and what you see as good and bad topic words, will differ on our own experience within the domain and our perspectives. Feel free to disagree with our analysis of the topics above or below.

Figure 13. Topic as it was found by LDA, TND, and NLDA. Domain: Covid-19

As we can see, LDA includes words like China, city, and Wuhan, each of which likely belong in a topic in the domain, but which do not belong in this topic. TND contains words like fight, lets, and message. These words might belong in this topic, but we felt that they were a bit too general to be good topic words in this context. As we can see, NLDA’s topic contains words that are a more intuitive fit to the topic.

Finally, let’s look at a topic generated by LDA, TND, and NLDA on the Election data set. Figure 14 shows the topic, about Mail-In Voting. This topic was especially relevant to this election because of the pandemic measures in place in many states. As we can see, LDA and TND both have reasonable words, but for multiple topics combined. Many of these words are common within the domain, but do not pertain specifically to Mail-in Voting, like maga, trump2020, kag, republican, and political. However, NLDA again gives us a far more interpretable, coherent topic. While some words are noise, we see many words that reference mail-in voting systems and the drama related to its implementation in different states (Florida was one of the states in which mail-in voting was most hotly debated). In general, most of the words are connected to a single topic.

Figure 14. Topic as it was found by LDA, TND, and NLDA. Domain: Election 2020

This brief assessment of topics for two different domains highlights that different distinct sets of words can be representative of a specific topic. There is not one right answer. Therefore, it is important to assess the quality of the topics both quantitatively and qualitatively.

Conclusion

In this article, we learned how topic models work, how topic-noise models work, and how they improve the quality of topics we derive from social media data. We learned how to set up our coding environment to work with topic-noise models in the gdtm package, how to run the topic-noise models, and how to save and interpret our results. We displayed the quality of topics that we can generate with a topic-noise model, and we showed how topics generated by NLDA and TND compare to those generated by LDA.

In the next article, we will look at a semi-supervised topic-noise model that allows the user to guide the model to an even better set of topics by leveraging the prior domain knowledge of the user.

This article was co-authored by Lisa Singh, Professor of Computer Science and Director of the Massive Data Institute at Georgetown University.

All images unless otherwise noted are by the author.

References

[1] D. Blei, A. Ng, and M. Jordan, Latent Dirichlet Allocation (2003), Journal of Machine Learning Research 3, 993–1022.

[2] R. Churchill and L. Singh, Topic-Noise Models: Modeling Topic and Noise Distributions in Social Media Post Collections (2021), International Conference on Data Mining (ICDM), 71–80.

[3] A.K. McCallum, MALLET: A Machine Learning for Language Toolkit (2002), http://mallet.cs.umass.edu.

[4] R. Řehůřek and P. Sojka, Software Framework for Topic Modelling with Large Corpora (2010), LREC 2010 Workshop on New Challenges for NLP Frameworks. 45–50.

[5] R. Churchill and L. Singh, textPrep: A Text Preprocessing Toolkit for Topic Modeling on Social Media Data (2021), International Conference on Data Science, Technology and Applications.

[6] R. Churchill and L. Singh, The Evolution of Topic Modeling (2021), ACM Comput. Surv. (CSURV)

--

--

Ph.D. in Computer Science from Georgetown University, pursuing a life outside of academia.