ling

TL;DR
We conduct extensive experiments comparing clustering-based topic models and conventional neural topic models (NTMs), showing an alternative way to extract topics from documents.
Check out our NAACL 2022 paper 📄 "Is Neural Topic Modelling Better than Clustering? An Empirical Study on Clustering with Contextual Embeddings for Topics" and the official Github for more details.
Paradigm 1: Conventional Topic Modelling
Topic modelling is an unsupervised method to extract semantic themes within documents. From the traditional Latent Dirichlet Allocation (LDA)¹ model to the neural network enhanced neural topic models (NTMs), topic modelling has been significantly advanced. However, these topic models typically employ Bag-of-Words (BoW) as document representations, which limits the model performance.
Later, contextualized word and sentence embeddings produced by various pre-trained language models such as Bert² came out and have shown state-of-the-art results in multiple Natural Language Processing (NLP) tasks. Recent works such as CombinedTM³ and ZeroShotTM⁴ incorporate these contextualized embeddings into NTMs, showing improved modelling performance than the conventional NTMs.
Despite the promising results, such NTMs suffer from computational overheads, and the current integration of the pre-trained embeddings to the NTM architecture is naive.
With high-quality contextualized document representations, do we really need sophisticated NTMs to obtain coherent and interpretable topics?
Paradigm 2: Clustering-based Topic Modelling
Instead, we explore an alternative way to model topics from documents. We use a simple Clustering framework with contextualized embeddings for topic modelling, as shown below.

We first encode pre-processed documents to obtain contextualized embeddings through pre-trained language models. After that, we lower the dimension of the embeddings (e.g., UMAP) before applying clustering methods (e.g., K-Means) to group similar documents. Each cluster will be regarded as a topic. Finally, we adopt a weighting method to select representative words as topics. Lowering the Embedding dimensionalities is optional but can save runtime (more details below).
This clustering-based topic modelling paradigm is not new. Top2vec⁵ jointly extracts word and document vectors and considers the centroid of each dense area as a topic vector and the n-closest word vectors as the topic words; BERTopic⁶ adopts a similar method, but it uses the proposed c-TF-IDF to select topic words within each cluster; Sia et al. (2020)⁷ cluster vocabulary-level word embeddings and obtain top words from each cluster using weighing and re-ranking.
However, the above methods ignore the promising NTMs proposed recently. The performance of the clustering-based topic models has not yet been compared with the conventional topic models.
Our Proposed Work
Here, we introduce our NAACL 2022 paper 📄 "Is Neural Topic Modelling Better than Clustering? An Empirical Study on Clustering with Contextual Embeddings for Topics" and the official Github. To our best knowledge, we are the first to compare with NTMs, using contextualized embeddings produced by various transformer-based models. Moreover, we propose new word selection methods that combine global word importance with local term frequency within each cluster.
We use the [Health News in Twitter](https://archive.ics.uci.edu/ml/datasets/Health+News+in+Twitter)
¹⁰ dataset (retrieved from UCI Machine Learning Repository¹¹) as an example to show that Paradigm 2: Clustering-based Topic Modelling can also extract high-quality topics.
1. Data Preparation
We first load the preprocessed dataset using OCTIS:
The documents are tokenized, giving us 3929 preprocessed sentences:
length of documents: 3929
preprocessed documents: [
'breast cancer risk test devised',
'workload harming care bma poll',
'short people heart risk greater',
'new approach hiv promising',
'coalition undermined nhs doctors',
'review case nhs manager',
'day empty going',
'overhaul needed end life care',
'care dying needs overhaul',
'nhs labour tory key policies',
...
]
2. Embedding
Secondly, we need to convert these documents to vector representations. We can use any embeddings, here, we use the pretrained princeton-nlp/unsup-simcse-bert-base-uncased
embedding from SimCSE.
3. Reduce & Cluster Embeddings
Now, we have the embedding representation of the documents, we can group similar documents together by applying the clustering method. Reducing embedding size can effectively save runtime.
4. Select Topic Words from Clusters
Finally, we can apply a weighting method to select topic words from each cluster:
The evaluation of the Health News in Twitter when setting 5 topics are:
num_topics: 5 td: 1.0 npmi: 0.106 cv: 0.812
and the example topics are:
Topic:
0: ['nhs', 'amp', 'care', 'hospital', 'health', 'mental']
1: ['ebola', 'vaccine', 'flu', 'liberia', 'leone', 'virus']
2: ['dementia', 'obesity', 'alzheimer', 'diabetes', 'brain', 'disabled']
3: ['cancer', 'heart', 'breast', 'surgery', 'transplant', 'lung']
4: ['cigarette', 'cigarettes', 'pollution', 'smoking', 'sugar', 'drug']
We can see that the topics found by Paradigm 2: Clustering-based Topic Modelling can also be highly coherent and diverse, although the lengths of the documents are relatively short. Despite all topics are about health, we can distinguish that topic 1 is about infectious viruses and diseases, topic 2 is about disease symptoms, etc.
To compare, we also use CombinedTM, a Paradigm 1: Conventional Topic Modelling based model to extract the same Health News in Twitter dataset. We use the same princeton-nlp/unsup-simcse-bert-base-uncased
embedding.
The evaluation of the Health News in Twitter when setting 5 topics are:
num_topics: 5 td: 1.0 npmi: -0.267 cv: 0.401
and the example topics are:
Topic:
0: ['cancer', 'amp', 'drug', 'death', 'audio', 'test']
1: ['food', 'obesity', 'cigarette', 'smokers', 'link', 'smoking']
2: ['ebola', 'mers', 'liberia', 'vaccine', 'malaria', 'virus']
3: ['babies', 'brain', 'sperm', 'man', 'human', 'cell']
4: ['nhs', 'health', 'mental', 'care', 'staff', 'hospital']
We can see that some topics are not coherent, for instance, topic 0 and 3.
We summarise the findings of the clustering-based topic modelling.
Directly clustering high-quality embeddings can generate good topics.
Experiments show that high-quality embeddings are critical for clustering-based topic modelling. We experiment with different embeddings, including BERT, RoBERTa⁸, SimCSE⁹, etc., on three various length datasets. Clustering RoBERTa achieves similar or worse results than contextualized NTMs, demonstrating that the embedding quality matters.
The recent DiffCSE can achieve slightly higher performance on some datasets!

Word selecting method is vital.
Once we have a group of clustered documents, selecting representative topic words is vital to identify the semantics of topics. Different from previously proposed methods, we capture global word importance and local term frequency within each cluster and compare 4 different methods:




We find that TFTDF × IDFᵢ achieves significantly better results among all methods. This indicates that TFIDF marks out the important words to each document in the entire corpus, while IDFᵢ penalizes the common words in multiple clusters. Conversely, the other three methods ignore that frequent words in a cluster may also be prevalent in other clusters, hence selecting such words leading to low topic diversities.

Embeddings dimensionality negligibly affects topic quality.
We apply UMAP to reduce the dimensionality of the sentence embeddings before clustering. We find that reducing dimensionality before clustering has a negligible impact on performance, but can save runtime.
Example of Usage
As documented in Github, you can choose a word selecting method from [tfidf_idfi, tfidf_tfi, tfidfi, tfi]
. If you prefer not to reduce the embedding dimensionalities using UMAP, simply set dim_size=-1
. You can train the model, and get evaluation results and topics:
and the expected output should be something similar:
Conclusion
In this article, we introduce a clustering-based method that can generate commendable topics as long as high-quality contextualized embeddings are used, together with an appropriate topic word selecting method. Compared to neural topic models, clustering-based models are more simple, efficient and robust to various document lengths and topic numbers, which can be applied in some situations as an alternative.
[1]: David M Blei, Andrew Y Ng, and Michael I Jordan. 2003. Latent dirichlet allocation. the Journal of machine Learning research, 3:993–1022.
[2]: Devlin, J., Chang, M.W., Lee, K. and Toutanova, K., 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
[3]: Bianchi, F., Terragni, S. and Hovy, D., 2020. Pre-training is a hot topic: Contextualized document embeddings improve topic coherence. arXiv preprint arXiv:2004.03974.
[4]: Bianchi, F., Terragni, S., Hovy, D., Nozza, D. and Fersini, E., 2020. Cross-lingual contextualized topic models with zero-shot learning. arXiv preprint arXiv:2004.07737.
[5]: Angelov, D., 2020. Top2vec: Distributed representations of topics. arXiv preprint arXiv:2008.09470.
[6]: Grootendorst, M., 2022. BERTopic: Neural Topic Modeling with a class-based TF-IDF procedure. arXiv preprint arXiv:2203.05794.
[7]: Sia, S., Dalmia, A. and Mielke, S.J., 2020. Tired of topic models? clusters of pretrained word embeddings make for fast and good topics too!. arXiv preprint arXiv:2004.14914.
[8]: Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L. and Stoyanov, V., 2019. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692.
[9]: Gao, T., Yao, X. and Chen, D., 2021. Simcse: Simple contrastive learning of sentence embeddings. arXiv preprint arXiv:2104.08821.
[10]: Karami, A., Gangopadhyay, A., Zhou, B., & Kharrazi, H. (2017). Fuzzy approach topic discovery in health and medical corpora. International Journal of Fuzzy Systems, 1–12.
[11]: Dua, D. and Graff, C. (2019). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.
[12]: Zhang, Z., Fang, M., Chen, L., & Namazi-Rad, M. R. (2022). Is Neural Topic Modelling Better than Clustering? An Empirical Study on Clustering with Contextual Embeddings for Topics. arXiv preprint arXiv:2204.09874.