Latent Dirichlet Allocation(LDA): A guide to probabilistic modelling approach for topic discovery

Implementation of Latent Dirichlet Allocation in python

Published in

Towards Data Science

7 min readApr 13, 2020

Latent Dirichlet Allocation(LDA) is one of the most common algorithms in topic modelling. LDA was proposed by J. K. Pritchard, M. Stephens and P. Donnelly in 2000 and rediscovered by David M. Blei, Andrew Y. Ng and Michael I. Jordan in 2003. In this article, I will try to give you an idea of what topic modelling is. We will learn how LDA works and finally, we will try to implement our LDA model.

What is Topic Modelling?

Topic Modelling is one of the most interesting fields in Machine Learning and Natural Language Processing. Topic Modelling means the extraction of abstract “topics” from the collection of documents. One of the primary application of natural language processing is to know what people are talking about in a large number of text documents. And it is really hard to read through all of these documents and extract or compile topics. In these cases, topic modelling is used to extract documents information. To understand the concept of topic modelling let’s see an example.

Suppose, you are reading some articles on a newspaper and in those articles, the word “climate” appears most than any other words. So, in a normal sense, you can say that these articles will more probably about something related to climate. Topic modelling does the same thing in a statistical way. It produces topics by clustering similar words. Here come two terms: one is “Topic Modelling” and the other is “Topic Classification”. Though they look similar, they are totally different processes. The first is an unsupervised machine learning technique and the second one is the supervised technique.
Let’s elaborate on the concept.

Topic classification often involves mutually-exclusive classes. That means each document is labelled with a specific class. On the other hand, Topic modelling is not mutually exclusive. The same document may involve with many topics. As Topic modelling works on the basis of the probability distribution, the same document may have a probability-distribution spread across many topics.

For topic modelling, there are several existing algorithms that you can use. Non-Negative Matrix Factorization(NMF), Latent Semantic Analysis or Latent Semantic Indexing(LSA or LSI) and Latent Dirichlet Allocation(LDA) are some of these algorithms. Here in this article, we will talk about Latent Dirichlet Allocation, one of the most common algorithms for topic modelling.

Latent Dirichlet Allocation(LDA)

“ The latent Dirichlet allocation (LDA) is a generative statistical model that allows sets of observations to be explained by unobserved groups that explain why some parts of the data are similar. For example, if observations are words collected into documents, it posits that each document is a mixture of a small number of topics and that each word’s presence is attributable to one of the document’s topics.” — Wikipedia

Okay, Let’s try to understand this definition.

The basic idea of Latent Dirichlet allocation (LDA) is that documents are considered as random mixtures of various topics and topics are considered a mixture of different words. Now, suppose you need some articles which are related to animals and you have thousands of articles in front of you and you really don’t know what these articles are about. Reading all these articles are really cumbersome to find out the articles related to animals. Let’s see an example of it.

As an example, let’s consider we have four articles. Article no. 1 related to animal, article no. 2 related to genetic type, article no. 3 related to computer types and article no. 4 is a combination of animal and genetic type. As a human, you can easily differentiate these topics according to the words it contains. But what will you do if there are thousands of articles and each article has thousands of lines? The answer will be like -“If we can do this with the help of a computer then we should do so”. Yes, Computer can do so with the help of Latent Dirichlet allocation. Now we will try to understand how LDA works. First, we will see the graphical representation of LDA and then we will see the probability calculation formula.

The above figure is a graphical representation of LDA. In the above figure, we can see that there are six parameters-

α(alpha) and η(eta) — represents Dirichlet distribution. The high alpha value indicates that each document contains most of the topics and on the contrary, a lower alpha value indicates that the documents are likely to contain a fewer number of topic. Same as alpha, a Higher value of η indicates that the topics are likely to cover most of the words and on the contrary, lower eta value indicates that the topics are likely to contain a fewer number of words.

β(beta) and θ(theta) — represents multinomial distribution.

z — represents a bunch of topics

w — represents a bunch of words

The left side of the formula indicates the probability of the document. In the right of the formula, there are four terms. The 1st and 3rd term of the formula will help us to find topics. The 2nd and 4th will help us to find the words in articles. The first two terms of the right side of the formula indicate Dirichlet distribution and the rest portion of the right side is multinomial distribution.

Let’s assume, in the above figure, in the left triangle, the blue circles indicate different articles. Now if we distribute the articles over different topics, it will be distributed as shown in the right triangle. The blue circles will move to the corners of the triangle which depends on the percentage of its being that topic. This process is done by the first term of the right side of the formula. Now we use multinomial distribution to generate topics based on the percentage get from the first term.

Now after getting the topics we will find which words are more relatable to these topics. This is done by another Dirichlet distribution. Topics are distributed based on the words as shown below.

Now we will use another multinomial distribution to find the words which are more related to those topics and generate words with probability using that Dirichlet distribution. This process is done multiple times.

Thus we will find the words which are more relatable to the topics and can distribute the articles based on those topics.

Implementation of LDA

You can find the code in GitHub. For implementing LDA you can use either gensim or sklearn. Here, we will use gensim.

Load Data

For implementing purpose, I have used the Kaggle dataset. This dataset consists of 2150 datasets information in 15 columns:

dataset = pd.read_csv('/content/drive/My Drive/topic modelling/voted-kaggle-dataset.csv')

Data Pre-Processing

For processing the data, first, we select the columns that are meaningful for this process. Then remove the rows containing any missing values.

modified_dataset = modified_dataset.dropna()

We will then calculate the number of unique tag in Tags columns as we will consider this as the number of topics for our model.

unique_tag = []
for i in range(len(tag_dataset)):
  tag_string = str(tag_dataset[i])
  if tag_string != "nan" :
    tag_word=convert(tag_string)
    for j in range(len(tag_word)):
      if tag_word[j] not in unique_tag:
        unique_tag.append(tag_word[j])
print(len(unique_tag))

Removing punctuations and transforming the whole text in the lower casing makes the training task easier and increase the efficiency of the model.

remove_digits = str.maketrans('', '', string.digits)
exclude = '[!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~]'
for column in ['Title','Subtitle','Description']:
  modified_dataset[column] = modified_dataset[column].map(lambda x : x.translate(remove_digits))
  modified_dataset[column] = modified_dataset[column].map(lambda x : re.sub(str(exclude), '', x))

We need to tokenize our dataset and perform stemming operation.

import nltk
nltk.download('punkt')
tokenized_dataframe =  modified_dataset.apply(lambda row: nltk.word_tokenize(row['Description']), axis=1)
print(type(tokenized_dataframe))def lemmatize_text(text):
    return [ps.stem(w)  for w in text if len(w)>5]ps = PorterStemmer() 
stemmed_dataset = tokenized_dataframe.apply(lemmatize_text)

Exploratory Data Analysis

By using WordCloud, we can verify whether our preprocessing is correctly done or not. A word cloud is an image made of words that together resemble a cloudy shape. It shows us how often a word appeared in a text — its frequency.

from wordcloud import WordCloud
import matplotlib.pyplot as plt
#dataset_words=''
#for column in ['Title','Subtitle','Description']:
dataset_words=''.join(list(str(stemmed_dataset.values)))
print(type(dataset_words))
wordcloud = WordCloud(width = 800, height = 500, 
                background_color ='white',  
                min_font_size = 10).generate(dataset_words) 

plt.figure(figsize = (5, 5), facecolor = None) 
plt.imshow(wordcloud) 
plt.axis("off") 
plt.tight_layout(pad = 0) 
  
plt.show()

Build Model

For the LDA model, we first need to build a dictionary of words where each word is given a unique id. Then need to create a corpus which contains word id mapping with word_frequency — ->(word_id, word_frequency).

dictionary_of_words = gensim.corpora.Dictionary(stemmed_dataset)word_corpus = [dictionary_of_words.doc2bow(word) for word in stemmed_dataset]

Finally, train the model.

lda_model = gensim.models.ldamodel.LdaModel(corpus=word_corpus,
                                                   id2word=dictionary_of_words,
num_topics=329, 
random_state=101,
update_every=1,
chunksize=300,
passes=50,
alpha='auto',
per_word_topics=True)

Coherence measures the relative distance between words within a topic.

coherence_val = CoherenceModel(model=lda_model, texts=stemmed_dataset, dictionary=dictionary_of_words, coherence='c_v').get_coherence()

print('Coherence Score: ', coherence_val)

Coherence value: 0.4

Evaluation

for  index,score in sorted(lda_model[word_corpus[2]][0], key=lambda tup: -1*tup[1]):
    print("\nScore: {}\t \nTopic: {}".format(score, lda_model.print_topic(index, 10)))