FuzzyTM: A Python package for Fuzzy Topic Models

Emil Rijcken
Towards Data Science
6 min readFeb 21, 2022

--

Photo by Sharon McCutcheon on Unsplash

In my previous posts, I showed how to get started with the Python package OCTIS, a package to compare and optimize state-of-the-art topic modeling algorithms. The first post shows how to get started with OCTIS, and the second post focuses on model optimization. The reason that my group and I started working with OCTIS is that we have developed a new topic modeling algorithm, called FLSA-W [1], and wanted to see how it performs compared to the existing state-of-the-art. Based on comparisons on various open datasets we have found that it outperforms the other models (e.g. LDA, ProdLDA, NeuralLDA, NMF and LSI) on various open datasets in most settings in terms of coherence (c_v), diversity and interpretability. I can’t share these results yet, as we have submitted this work to a conference and are awaiting acceptance. In the meantime, we have also developed a Python package, ‘FuzzyTM’, that features FLSA-W and two other topic modeling algorithms based on fuzzy logic (FLSA and FLSA-V). This post will briefly describe fuzzy topic models and the rationale of FLSA-W. Then it demonstrates how to get started with FuzzyTM (just go to ‘Getting started with FuzzyTM’ if you want to start training a model). In future posts, I will explain in more detail how the various algorithms work and use OCTIS to compare them with existing algorithms.

Fuzzy Topic Models

Note that although there are 50 shades of topic modeling algorithms, they all return two matrices P(W_i|T_k) and P(T_k|D_j), the probability of a word given a topic and the probability of a topic given a document, respectively.

In 2018, Fuzzy Latent Semantic Analysis was proposed [2] and outperformed LDA in terms of coherence. FLSA uses Bayes’ Theorem, Singular Value Decomposition (SVD) and matrix multiplication to find P(W_i|T_k) and P(T_k|D_j).

For topic modeling, we start with a corpus of texts. In Python, this is stored as a list of lists of strings, where each list represents a document and each string is a word in the document.

Let’s start by defining the following quantities:

M — the number of unique words in the data set

N — the number of documents in the data set

C — the number of topics

S — the number of SVD dimensions

i — word index, i ∈ {1,2,3,…,M}

j — document index, j ∈ {1,2,3,…,N}

k — topic index, k∈ {1,2,3,…,C}

Then, the steps to obtaining P(W_i|T_k) and P(T_k|D_j) in FLSA are the following:

  1. Get Local Term Weights (MxN) — a document-term matrix indicating how often a word i appears in document j.
  2. Get Global Term Weights (MxN) — in this step the presence of words in a document are related to the presence of words in other documents.
  3. Obtain U from SVD on the Global Term Weights (NxS) — SVD is used for dimensionality reduction, see this post for an intuitive explanation of SVD.
  4. Use fuzzy clustering on U^T to obtain
    P(T|D)^T (MxC) — The most common method is Fuzzy C-Means clustering, but various algorithms are available in FuzzyTM.
  5. Use matrix multiplication based on Bayes’ Theorem, using P(T|D)^T and P(D_j) to obtain P(W_i|T_k).

In FLSA, the SVD’s U matrix is used as input to clustering, meaning that documents are being clustered. Since topic models are often used for finding the words corresponding to a topic, it seemed to make more sense to take SVD’s V^T for clustering since now words are being clustered. In FLSA-W (now the ‘W’ makes sense, hopefully) words are being clustered instead of documents.

Getting started with FuzzyTM

FuzzyTM is built modularly so that each algorithm’s step is a different method in the parent class and each algorithm is a child-class calling methods in the parent class. Novice practitioners can train the various topic models with minimal work, while researchers can modify each step and add more functionalities.

Let’s get started training a model using a dataset from the OCTIS package. Firstly, we install FuzzyTM using:

pip install FuzzyTM

Secondly, we import the dataset:

from octis.dataset.dataset import Dataset
dataset = Dataset()
dataset.fetch_dataset('DBLP')
data = dataset._Dataset__corpus

Let’s see what this dataset looks like:

print(data[0:5])
>>> [['fast', 'cut', 'protocol', 'agent', 'coordination'],
['retrieval', 'base', 'class', 'svm'],
['semantic', 'annotation', 'personal', 'video', 'content', 'image'],
['semantic', 'repository', 'modeling', 'image', 'database'],
['global', 'local', 'scheme', 'imbalanced', 'point', 'matching']]

Now, we are ready to import FuzzyTM :

from FuzzyTM import FLSA_W

We initiliaze the model as follows (`num_words`’s default value is 20, but for clarity I only ten words here):

flsaW = FLSA_W(input_file = data, num_topics=10, num_words=10)

Then, we obtain P(W_i|T_k) and P(T_k|D_j) as follows:

pwgt, ptgd = flsaW.get_matrices()

Now, we are ready to look at the topics:

topics = flsaW.show_topics(representation='words')print(topics)
>>> [['machine', 'decision', 'set', 'evaluation', 'tree', 'performance', 'constraint', 'stream', 'process', 'pattern'], ['face', 'robust', 'tracking', 'error', 'code', 'filter', 'shape', 'detection', 'recognition', 'color'], ['generalization', 'neighbor', 'predict', 'sensitive', 'computation', 'topic', 'link', 'recursive', 'virtual', 'construction'], ['language', 'logic', 'data', 'web', 'mining', 'rule', 'processing', 'discovery', 'query', 'datum'], ['factorization', 'regularization', 'people', 'measurement', 'parametric', 'progressive', 'dimensionality', 'histogram', 'selective', 'correct'], ['active', 'spatial', 'optimal', 'view', 'level', 'modeling', 'combine', 'hierarchical', 'dimensional', 'space'], ['correspondence', 'calibration', 'compress', 'curve', 'geometry', 'track', 'background', 'appearance', 'deformable', 'light'], ['heuristic', 'computational', 'update','preference', 'qualitative', 'mechanism', 'engine', 'functional', 'join', 'relation'], ['graphic', 'configuration', 'hypothesis', 'walk', 'relaxation', 'family', 'composite', 'factor', 'string', 'pass'], ['theorem', 'independence', 'discourse', 'electronic', 'auction', 'composition', 'diagram', 'version', 'hard', 'create']]

From this output we can recognize some themes: the first topic seems to be about general machine learning, the second topic about image recognition and the fourth topic about natural language processing.

Now, let’s look at the score evaluation metrics:

#Get coherence value
flsaW.get_coherence_value(input_file = data, topics = topics)
>>> 0.34180921613509696
#Get diversity score
flsaW.get_diversity_score(topics = topics)
>>> 1.0
#Get interpretability score
flsaW.get_interpretability_score(input_file = data, topics = topics)
>>> 0.34180921613509696

Since a topic model’s output consists of various topics, where each topic is a collection of words, the quality of a topic model should focus both on the quality of words within each topic (intra-topic quality) and the diversity amongst different topics (inter-topic quality). The coherence score captures how well the words within each topic support each other and the diversity score shows how diverse each topic is (whether there is word overlap between topics). Then, the interpretability score combines both metrics and is calculated as the product between coherence and diversity.

From the results, we can see that FLSA-W has a perfect diversity. This is no surprise since it clusters words explicitly. Still, this is a big improvement compared to most existing algorithms.

Although, a coherence score of 0.3418 seems rather low, comparative experimental results will show that FLSA-W has a higher coherence score than other algorithms in most settings.

Conclusion

In this post, I have briefly explained how the topic models FLSA and FLSA-W work. Then, I demonstrated that FLSA-W can be trained with two lines of code only and how the topics can be analyzed. In addition to training topic models, FuzzyTM also contains a method to obtain a topic embedding of new documents based on a trained topic model. This can be useful as document embedding for downstream tasks such as text classification. In future posts, I will describe the algorithms in more detail and compare FLSA-W to existing algorithms. Please see my Github page, for more details: https://github.com/ERijck/FuzzyTM.

[1] Rijcken, E., Scheepers, F., Mosteiro, P., Zervanou, K., Spruit, M., & Kaymak, U. (2021, December). A Comparative Study of Fuzzy Topic Models and LDA in terms of Interpretability. In 2021 IEEE Symposium Series on Computational Intelligence (SSCI). IEEE.

[2] Karami, A., Gangopadhyay, A., Zhou, B., & Kharrazi, H. (2018). Fuzzy approach topic discovery in health and medical corpora. International Journal of Fuzzy Systems, 20(4), 1334–1345.

--

--