The world’s leading publication for data science, AI, and ML professionals.

Clustering Contextual Embeddings for Topic Model

Extract topics by clustering contextual sentence embeddings

ling

Photo by Luisa Denu on Unsplash
Photo by Luisa Denu on Unsplash

TL;DR

We conduct extensive experiments comparing clustering-based topic models and conventional neural topic models (NTMs), showing an alternative way to extract topics from documents.

Check out our NAACL 2022 paper 📄 "Is Neural Topic Modelling Better than Clustering? An Empirical Study on Clustering with Contextual Embeddings for Topics" and the official Github for more details.


Paradigm 1: Conventional Topic Modelling

Topic modelling is an unsupervised method to extract semantic themes within documents. From the traditional Latent Dirichlet Allocation (LDA)¹ model to the neural network enhanced neural topic models (NTMs), topic modelling has been significantly advanced. However, these topic models typically employ Bag-of-Words (BoW) as document representations, which limits the model performance.

Later, contextualized word and sentence embeddings produced by various pre-trained language models such as Bert² came out and have shown state-of-the-art results in multiple Natural Language Processing (NLP) tasks. Recent works such as CombinedTM³ and ZeroShotTM⁴ incorporate these contextualized embeddings into NTMs, showing improved modelling performance than the conventional NTMs.

Despite the promising results, such NTMs suffer from computational overheads, and the current integration of the pre-trained embeddings to the NTM architecture is naive.

With high-quality contextualized document representations, do we really need sophisticated NTMs to obtain coherent and interpretable topics?

Paradigm 2: Clustering-based Topic Modelling

Instead, we explore an alternative way to model topics from documents. We use a simple Clustering framework with contextualized embeddings for topic modelling, as shown below.

The architecture of our method. Image by author.
The architecture of our method. Image by author.

We first encode pre-processed documents to obtain contextualized embeddings through pre-trained language models. After that, we lower the dimension of the embeddings (e.g., UMAP) before applying clustering methods (e.g., K-Means) to group similar documents. Each cluster will be regarded as a topic. Finally, we adopt a weighting method to select representative words as topics. Lowering the Embedding dimensionalities is optional but can save runtime (more details below).

This clustering-based topic modelling paradigm is not new. Top2vec⁵ jointly extracts word and document vectors and considers the centroid of each dense area as a topic vector and the n-closest word vectors as the topic words; BERTopic⁶ adopts a similar method, but it uses the proposed c-TF-IDF to select topic words within each cluster; Sia et al. (2020)⁷ cluster vocabulary-level word embeddings and obtain top words from each cluster using weighing and re-ranking.

However, the above methods ignore the promising NTMs proposed recently. The performance of the clustering-based topic models has not yet been compared with the conventional topic models.

Our Proposed Work

Here, we introduce our NAACL 2022 paper 📄 "Is Neural Topic Modelling Better than Clustering? An Empirical Study on Clustering with Contextual Embeddings for Topics" and the official Github. To our best knowledge, we are the first to compare with NTMs, using contextualized embeddings produced by various transformer-based models. Moreover, we propose new word selection methods that combine global word importance with local term frequency within each cluster.

We use the [Health News in Twitter](https://archive.ics.uci.edu/ml/datasets/Health+News+in+Twitter)¹⁰ dataset (retrieved from UCI Machine Learning Repository¹¹) as an example to show that Paradigm 2: Clustering-based Topic Modelling can also extract high-quality topics.

1. Data Preparation

We first load the preprocessed dataset using OCTIS:

The documents are tokenized, giving us 3929 preprocessed sentences:

length of documents: 3929
preprocessed documents: [
  'breast cancer risk test devised',
  'workload harming care bma poll',
  'short people heart risk greater',
  'new approach hiv promising',
  'coalition undermined nhs doctors',
  'review case nhs manager',
  'day empty going',
  'overhaul needed end life care',
  'care dying needs overhaul',
  'nhs labour tory key policies',
  ...
]

2. Embedding

Secondly, we need to convert these documents to vector representations. We can use any embeddings, here, we use the pretrained princeton-nlp/unsup-simcse-bert-base-uncased embedding from SimCSE.

3. Reduce & Cluster Embeddings

Now, we have the embedding representation of the documents, we can group similar documents together by applying the clustering method. Reducing embedding size can effectively save runtime.

4. Select Topic Words from Clusters

Finally, we can apply a weighting method to select topic words from each cluster:

The evaluation of the Health News in Twitter when setting 5 topics are:

num_topics: 5 td: 1.0 npmi: 0.106 cv: 0.812

and the example topics are:

Topic:
0: ['nhs', 'amp', 'care', 'hospital', 'health', 'mental']
1: ['ebola', 'vaccine', 'flu', 'liberia', 'leone', 'virus']
2: ['dementia', 'obesity', 'alzheimer', 'diabetes', 'brain', 'disabled']
3: ['cancer', 'heart', 'breast', 'surgery', 'transplant', 'lung']
4: ['cigarette', 'cigarettes', 'pollution', 'smoking', 'sugar', 'drug']

We can see that the topics found by Paradigm 2: Clustering-based Topic Modelling can also be highly coherent and diverse, although the lengths of the documents are relatively short. Despite all topics are about health, we can distinguish that topic 1 is about infectious viruses and diseases, topic 2 is about disease symptoms, etc.


To compare, we also use CombinedTM, a Paradigm 1: Conventional Topic Modelling based model to extract the same Health News in Twitter dataset. We use the same princeton-nlp/unsup-simcse-bert-base-uncased embedding.

The evaluation of the Health News in Twitter when setting 5 topics are:

num_topics: 5 td: 1.0 npmi: -0.267 cv: 0.401

and the example topics are:

Topic:
0: ['cancer', 'amp', 'drug', 'death', 'audio', 'test']
1: ['food', 'obesity', 'cigarette', 'smokers', 'link', 'smoking']
2: ['ebola', 'mers', 'liberia', 'vaccine', 'malaria', 'virus']
3: ['babies', 'brain', 'sperm', 'man', 'human', 'cell']
4: ['nhs', 'health', 'mental', 'care', 'staff', 'hospital']

We can see that some topics are not coherent, for instance, topic 0 and 3.


We summarise the findings of the clustering-based topic modelling.

Directly clustering high-quality embeddings can generate good topics.

Experiments show that high-quality embeddings are critical for clustering-based topic modelling. We experiment with different embeddings, including BERT, RoBERTa⁸, SimCSE⁹, etc., on three various length datasets. Clustering RoBERTa achieves similar or worse results than contextualized NTMs, demonstrating that the embedding quality matters.

The recent DiffCSE can achieve slightly higher performance on some datasets!

Different embeddings of topic coherence (NPMI and CV ) and topic diversity (TU) of the top 10 words. Image by paper ¹²
Different embeddings of topic coherence (NPMI and CV ) and topic diversity (TU) of the top 10 words. Image by paper ¹²

Word selecting method is vital.

Once we have a group of clustered documents, selecting representative topic words is vital to identify the semantics of topics. Different from previously proposed methods, we capture global word importance and local term frequency within each cluster and compare 4 different methods:

Image by author.
Image by author.
Image by author.
Image by author.
Image by author.
Image by author.
Image by author.
Image by author.

We find that TFTDF × IDFᵢ achieves significantly better results among all methods. This indicates that TFIDF marks out the important words to each document in the entire corpus, while IDFᵢ penalizes the common words in multiple clusters. Conversely, the other three methods ignore that frequent words in a cluster may also be prevalent in other clusters, hence selecting such words leading to low topic diversities.

Comparison between different topic word selecting methods. Image by paper ¹²
Comparison between different topic word selecting methods. Image by paper ¹²

Embeddings dimensionality negligibly affects topic quality.

We apply UMAP to reduce the dimensionality of the sentence embeddings before clustering. We find that reducing dimensionality before clustering has a negligible impact on performance, but can save runtime.

Example of Usage

As documented in Github, you can choose a word selecting method from [tfidf_idfi, tfidf_tfi, tfidfi, tfi]. If you prefer not to reduce the embedding dimensionalities using UMAP, simply set dim_size=-1. You can train the model, and get evaluation results and topics:

and the expected output should be something similar:

Conclusion

In this article, we introduce a clustering-based method that can generate commendable topics as long as high-quality contextualized embeddings are used, together with an appropriate topic word selecting method. Compared to neural topic models, clustering-based models are more simple, efficient and robust to various document lengths and topic numbers, which can be applied in some situations as an alternative.


[1]: David M Blei, Andrew Y Ng, and Michael I Jordan. 2003. Latent dirichlet allocation. the Journal of machine Learning research, 3:993–1022.

[2]: Devlin, J., Chang, M.W., Lee, K. and Toutanova, K., 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.

[3]: Bianchi, F., Terragni, S. and Hovy, D., 2020. Pre-training is a hot topic: Contextualized document embeddings improve topic coherence. arXiv preprint arXiv:2004.03974.

[4]: Bianchi, F., Terragni, S., Hovy, D., Nozza, D. and Fersini, E., 2020. Cross-lingual contextualized topic models with zero-shot learning. arXiv preprint arXiv:2004.07737.

[5]: Angelov, D., 2020. Top2vec: Distributed representations of topics. arXiv preprint arXiv:2008.09470.

[6]: Grootendorst, M., 2022. BERTopic: Neural Topic Modeling with a class-based TF-IDF procedure. arXiv preprint arXiv:2203.05794.

[7]: Sia, S., Dalmia, A. and Mielke, S.J., 2020. Tired of topic models? clusters of pretrained word embeddings make for fast and good topics too!. arXiv preprint arXiv:2004.14914.

[8]: Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L. and Stoyanov, V., 2019. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692.

[9]: Gao, T., Yao, X. and Chen, D., 2021. Simcse: Simple contrastive learning of sentence embeddings. arXiv preprint arXiv:2104.08821.

[10]: Karami, A., Gangopadhyay, A., Zhou, B., & Kharrazi, H. (2017). Fuzzy approach topic discovery in health and medical corpora. International Journal of Fuzzy Systems, 1–12.

[11]: Dua, D. and Graff, C. (2019). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.

[12]: Zhang, Z., Fang, M., Chen, L., & Namazi-Rad, M. R. (2022). Is Neural Topic Modelling Better than Clustering? An Empirical Study on Clustering with Contextual Embeddings for Topics. arXiv preprint arXiv:2204.09874.


Related Articles