Beginner’s Guide to LDA Topic Modelling with R

Identifying topics within unstructured text

Farren tang
Towards Data Science

--

Picture credits

Nowadays many people want to start out with Natural Language Processing(NLP). Yet they don’t know where and how to start. It might be because there are too many “guides” or “readings” available, but they don’t exactly tell you where and how to start. This article aims to give readers a step-by-step guide on how to do topic modelling using Latent Dirichlet Allocation (LDA) analysis with R.

This technique is simple and works effectively on small dataset. Hence, I would suggest this technique for people who are trying out NLP and using topic modelling for the first time.

What is topic modelling? In layman terms, topic modelling is trying to find similar topics across different documents, and trying to group different words together, such that each topic will consist of words with similar meanings. An analogy that I often like to give is — when you have a story book that is torn into different pages. After you try to run a topic modelling algorithm, you should be able to come up with various topics such that each topic would consist of words from each chapter. Otherwise, you may simply just use sentiment analysis — positive or negative review.

In machine learning and natural language processing, a topic model is a type of statistical model for discovering the abstract “topics” that occur in a collection of documents. - wikipedia

After a formal introduction to topic modelling, the remaining part of the article will describe a step by step process on how to go about topic modeling. It is made up of 4 parts: loading of data, pre-processing of data, building the model and visualisation of the words in a topic.

As mentioned above, I will be using LDA model, a probabilistic model that assigns word a probabilistic score of the most probable topic that it could be potentially belong to. I will skip the technical explanation of LDA as there are many write-ups available.(Eg: Here) Not to worry, I will explain all terminologies if I am using it.

  1. Loading of data

The dataset we will be using for simplicity purpose will be the first 5000 rows of twitter sentiments data from kaggle. For our model, we do not need to have labelled data. All we need is a text column that we want to create topics from and a set of unique id. There was initially 18 columns and 13000 rows of data, but we will just be using the text and id columns.

data <- fread(“~/Sentiment.csv”)
#looking at top 5000 rows
data <- data %>% select(text,id) %>% head(5000)
Dataframe after selecting the relevant columns for analysis

2. Pre-processing

As we observe from the text, there are many tweets which consist of irrelevant information: such as RT, the twitter handle, punctuation, stopwords (and, or the, etc) and numbers. These will add unnecessary noise to our dataset which we need to remove during the pre-processing stage.

data$text <- sub("RT.*:", "", data$text)
data$text <- sub("@.* ", "", data$text)
text_cleaning_tokens <- data %>%
tidytext::unnest_tokens(word, text)
text_cleaning_tokens$word <- gsub('[[:digit:]]+', '', text_cleaning_tokens$word)
text_cleaning_tokens$word <- gsub('[[:punct:]]+', '', text_cleaning_tokens$word)
text_cleaning_tokens <- text_cleaning_tokens %>% filter(!(nchar(word) == 1))%>%
anti_join(stop_words)
tokens <- text_cleaning_tokens %>% filter(!(word==""))
tokens <- tokens %>% mutate(ind = row_number())
tokens <- tokens %>% group_by(id) %>% mutate(ind = row_number()) %>%
tidyr::spread(key = ind, value = word)
tokens [is.na(tokens)] <- ""
tokens <- tidyr::unite(tokens, text,-id,sep =" " )
tokens$text <- trimws(tokens$text)

3. Model Building

Finally here comes the fun part! Creating the model.

First you will have to create a DTM(document term matrix), which is a sparse matrix containing your terms and documents as dimensions. When building the DTM, you can select how you want to tokenise(break up a sentence into 1 word or 2 words) your text. This will depend on how you want the LDA to read your words. You will need to ask yourself if singular words or bigram(phrases) makes sense in your context. For instance if your texts contain many words such as “failed executing” or “not appreciating”, then you will have to let the algorithm choose a window of maximum 2 words. Otherwise using a unigram will work just as fine. In our case, because it’s Twitter sentiment, we will go with a window size of 1–2 words, and let the algorithm decide for us, which are the more important phrases to concatenate together. We will also explore the term frequency matrix, which shows the number of times the word/phrase is occurring in the entire corpus of text. If the term is < 2 times, we discard them, as it does not add any value to the algorithm, and it will help to reduce computation time as well.

#create DTM
dtm <- CreateDtm(tokens$text,
doc_names = tokens$ID,
ngram_window = c(1, 2))
#explore the basic frequency
tf <- TermDocFreq(dtm = dtm)
original_tf <- tf %>% select(term, term_freq,doc_freq)
rownames(original_tf) <- 1:nrow(original_tf)
# Eliminate words appearing less than 2 times or in more than half of the
# documents
vocabulary <- tf$term[ tf$term_freq > 1 & tf$doc_freq < nrow(dtm) / 2 ]
dtm = dtm

With your DTM, you run the LDA algorithm for topic modelling. You will have to manually assign a number of topics k. Next, the algorithm will calculate a coherence score to allow us to choose the best topics from 1 to k. What is coherence and coherence score? Coherence gives the probabilistic coherence of each topic. Coherence score is a score that calculates if the words in the same topic make sense when they are put together. This gives us the quality of the topics being produced. The higher the score for the specific number of k, it means for each topic, there will be more related words together and the topic will make more sense. For instance: {dog, talk, television, book} vs {dog, ball, bark, bone}. The latter will yield a higher coherence score than the former as the words are more closely related.

In our example, we set k = 20 and run the LDA on it, and plot the coherence score. It’s up to the analyst to define how many topics they want.

k_list <- seq(1, 20, by = 1)
model_dir <- paste0("models_", digest::digest(vocabulary, algo = "sha1"))
if (!dir.exists(model_dir)) dir.create(model_dir)
model_list <- TmParallelApply(X = k_list, FUN = function(k){
filename = file.path(model_dir, paste0(k, "_topics.rda"))

if (!file.exists(filename)) {
m <- FitLdaModel(dtm = dtm, k = k, iterations = 500)
m$k <- k
m$coherence <- CalcProbCoherence(phi = m$phi, dtm = dtm, M = 5)
save(m, file = filename)
} else {
load(filename)
}

m
}, export=c("dtm", "model_dir")) # export only needed for Windows machines
#model tuning
#choosing the best model
coherence_mat <- data.frame(k = sapply(model_list, function(x) nrow(x$phi)),
coherence = sapply(model_list, function(x) mean(x$coherence)),
stringsAsFactors = FALSE)
ggplot(coherence_mat, aes(x = k, y = coherence)) +
geom_point() +
geom_line(group = 1)+
ggtitle("Best Topic by Coherence Score") + theme_minimal() +
scale_x_continuous(breaks = seq(1,20,1)) + ylab("Coherence")

Upon plotting of the k, we realise that k = 12 gives us the highest coherence score. In this case, even though the coherence score is rather low and there will definitely be a need to tune the model, such as increasing k to achieve better results or have more texts. But for explanation purpose, we will ignore the value and just go with the highest coherence score. After understanding the optimal number of topics, we want to have a peek of the different words within the topic. Each topic will have each word/phrase assigned a phi value (pr(word|topic)) — probability of word given a topic. So we only take into account the top 20 values per word in each topic. The top 20 terms will then describe what the topic is about.

model <- model_list[which.max(coherence_mat$coherence)][[ 1 ]]
model$top_terms <- GetTopTerms(phi = model$phi, M = 20)
top20_wide <- as.data.frame(model$top_terms)
Preview of top 10 words for the first 5 topic. The first word implies a higher phi value

The above picture shows the first 5 topics out of the 12 topics. The words are in ascending order of phi-value. The higher the ranking, the more probable the word will belong to the topic. It seems like there are a couple of overlapping topics. It’s up to the analyst to think if we should combine the different topics together by eyeballing or we can run a Dendogram to see which topics should be grouped together. A Dendogram uses Hellinger distance(distance between 2 probability vectors) to decide if the topics are closely related. For instance, the Dendogram below suggests that there are greater similarity between topic 10 and 11.

model$topic_linguistic_dist <- CalcHellingerDist(model$phi)
model$hclust <- hclust(as.dist(model$topic_linguistic_dist), "ward.D")
model$hclust$labels <- paste(model$hclust$labels, model$labels[ , 1])
plot(model$hclust)

4. Visualisation

We can create word cloud to see the words belonging to the certain topic, based on the probability. Below represents topic 2. As ‘gopdebate’ is the most probable word in topic2, the size will be the largest in the word cloud.

#visualising topics of words based on the max value of phi
set.seed(1234)
final_summary_words <- data.frame(top_terms = t(model$top_terms))
final_summary_words$topic <- rownames(final_summary_words)
rownames(final_summary_words) <- 1:nrow(final_summary_words)
final_summary_words <- final_summary_words %>% melt(id.vars = c("topic"))
final_summary_words <- final_summary_words %>% rename(word = value) %>% select(-variable)
final_summary_words <- left_join(final_summary_words,allterms)
final_summary_words <- final_summary_words %>% group_by(topic,word) %>%
arrange(desc(value))
final_summary_words <- final_summary_words %>% group_by(topic, word) %>% filter(row_number() == 1) %>%
ungroup() %>% tidyr::separate(topic, into =c("t","topic")) %>% select(-t)
word_topic_freq <- left_join(final_summary_words, original_tf, by = c("word" = "term"))
pdf("cluster.pdf")
for(i in 1:length(unique(final_summary_words$topic)))
{ wordcloud(words = subset(final_summary_words ,topic == i)$word, freq = subset(final_summary_words ,topic == i)$value, min.freq = 1,
max.words=200, random.order=FALSE, rot.per=0.35,
colors=brewer.pal(8, "Dark2"))}
dev.off()
Word cloud for topic 2

5. Conclusion

We are done with this simple topic modelling using LDA and visualisation with word cloud. You may refer to my github for the entire script and more details. This is not a full-fledged LDA tutorial, as there are other cool metrics available but I hope this article will provide you with a good guide on how to start with topic modelling in R using LDA. I would also strongly suggest everyone to read up on other kind of algorithms too. I’m sure you will not get bored by it!

Feel free to drop me a message if you think that I am missing out on anything.

Happy topic modeling!

References:

  1. Wikipedia
  2. Tyler Doll, LDA Topic Modelling (2018)
  3. Thomas W. Jones, Topic Modelling (2019)

--

--