The world’s leading publication for data science, AI, and ML professionals.

The Ultimate Guide to Clustering Algorithms and Topic Modeling

Part 2, A beginner's guide to the LDA Model

In my previous post, I introduced the Clustering algorithm and discussed the K-Means algorithm in detail as the first part of the topic modeling series:

Part 1: A beginner’s guide to K-Means

Part 2: A beginner’s guide to LDA (this article)

Part 3: Use K-Means and LDA for topic modeling (coming soon)

We cannot discuss topic modeling without introducing the Lda Model. LDA is short for Latent Dirichlet Allocation. It is a model primarily used to understand the underlying set of topics over collections of textual data, with other applications in collaborative filtering, content-based image retrieval, and bioinformatics, etc. In this article, I will discuss more details in topic modeling and the LDA model setup.


Topic Modeling and Terminology

Topic Modeling is an analysis of finding the underlying topics over a set of textual data. Studying the topics helps researchers understand the hidden semantic structures in a text body. Knowing the topics is useful in both classifying the existing text data and generating new data. Before going further in detail, we need to specify some conventional terminologies used in text analysis. According to Blei, Ng, and Jordan 2003 (detailed citation in Reference), the terminology is presented as follows:

  • A word is the basic unit of the textual data and is indexed from a vocabulary set {1,…V};
  • A document is a sequence of N words defined as w = (w_1,w_2,…w_N), where w_n is the nth word in this document;
  • A corpus is a collection of M documents defined as D = (w_1, w_2,…w_M);
  • A topic is a distribution of words. LDA model treats each document in the corpus as a mixture of topics.

LDA model setup

Essentially, topic modeling is a text clustering problem. Using the LDA model, the goal is to estimate two sets of distributions through studying the corpus:

  • the distribution of words in each topic
  • the topic distribution over a corpus

LDA is a three-level hierarchical Bayesian model. Rather than going deep into the mathematical detail, it is easier to demonstrate the model with an example.

LDA model generates a topic distribution for each document in the corpus. For example, a document can be distributed over two topics: 40% in topic "Fruit," 60% in topic "Vegetable":

image by Author
image by Author

Each topic is a distribution of all words from the vocabulary set {"sweet," "carrot," "apple," "green"}. Depends on the topic, we see some words with higher probabilities. For example, intuitively, the word "green" would show up with a higher probability in topic "vegetable" than topic "fruit."

Image by Author
Image by Author

We need to define two latent variables that help in the text generation process from corpus to document to word. Latent variable is the variable we do not observe directly from the data. However, it reveals the hidden structures in the data, and it can be beneficial in building the probabilistic model. The first latent variable is θ, the distribution of topics over each document (40% "Fruit," 60% "Vegetable"). The second latent variable is Z (Z ∈{1,2..T}), which presents the topic of each word. We can see the text generating process in the graph below:

Text Generation in LDA. Image by Author
Text Generation in LDA. Image by Author

For each document d, there is a topic distribution θ_d. For each word i in document d (w_di), it is generated based on the topic distribution θ_d, topic Z for this word di, and word distribution over topic Z_di. Suppose the first word is "green" in document d. It is generated by first specifying a distribution of topics: 60% "Fruit" and 40% "Vegetable" (θ_d). Then for the first word, we sample the topic "Vegetable" (Z_d1). In the topic "Vegetable," we sample the word "green" (w_d1) from the topic distribution.

Mathematically, the process can be specified as the Bayesian equation below:

Combining with the latent variables, the hierarchy can be viewed in the graph and steps below for better understanding:

Image by Author
Image by Author

By defining the probability distribution of P(θ_d), P(Z_dn|θ_d), and P(w_dn|z_dn), we can calculate the joint distribution P(w,z,θ), which is the probability of observing the corpus we have.

The Dirichlet in LDA comes from the fact that θ_d follows Dirichlet Distribution. Given the corpus to train the model, we can use the EM algorithm for the parameter estimation or MCMC for full Bayesian inference. This article will not cover the mathematical details of the model. If interested, refer to the paper written by Blei, Ng, and Jordan in 2003. Moreover, gensim is the common Python library to apply LDA in topic modeling. Refer to this document for more details, I will also show it in my next post when discussing the application.


This is all for this article. In the last article of the series, I will compare and discuss the difference between K-Means and LDA in topic modeling and show an example of applying both algorithms in topic modeling using Python libraries.

Thank you for reading! Here is the list of all my blog posts. Check them out if you are interested.

My Blog Posts Gallery

Read every story from Zijing Zhu (and thousands of other writers on Medium)

Reference

[1] David M. Blei, Andrew Y. Ng, and Michael I. Jordan. 2003. Latent dirichlet allocation. J. Mach. Learn. Res. 3, null (3/1/2003), 993–1022.


Related Articles