NLP with LDA (Latent Dirichlet Allocation) and Text Clustering to improve classification

Published in

Towards Data Science

8 min readDec 7, 2020

This post is part 2 of solving CareerVillage’s kaggle challenge; however, it also serves as a general purpose tutorial for the following three things:

Finding topics and keywords in texts using LDA
Using Spacy’s Semantic Similarity library to find similarities between texts
Using scikit-learn’s DBSCAN clustering algorithm for topic and keyword clustering

Problem Description

This section serves as a short reminder on what we are trying to do. CareerVillage, in its essence, is like Stackoverflow or Quora but for career questions. Users can post questions about any careers like computer science, pharmacology, aerospace engineering etc, and volunteer professionals try their best to answer the questions.

When a new question comes in, CareerVillage recommends that question to a specific professional who is suitable to answer that question. In order to maximize the chance that a user’s questions get answered, CareerVillage needs to send the right question to the most apt professional. This is where our job comes in! We have to design a recommendation system that takes in a newly posted question, and outputs the professionals who are most suitable to answer that question.

Data Preparation

Before we perform topic modeling, we need to specify our goals. In what context do we need topic modeling. In this article (part 3 where we make the model), we have four important features we need to calculate.

No. of common topics/tags between the question and the topics/tags the professional is following (follow_I)
No. of common topics/tags between the question and the topics/tags of the questions answered previously answered by the professional (prev_I).
Intersection over Intersection (follow_IoI) which is the no. of common topics between the question and the professional’s interest, divided by the number of topics the professional is following.
Intersection over Union (follow_IoU) which is the no. of common topics between the question and the professional’s interest, divided by the total no. of topics in the question and the professional’s interests.

The idea behind these features is that if we have a higher number of common topics between the question and the professional’s interests, then that question should be sent to the professional.

However, there are cases where professionals who don’t follow any tags but have answered questions. Ideally, we would want to extract topics/tags from their answers and assign them to to the professional. There’s also the case where questions can be mis-tagged by the user e.g. ‘How hard is the daily life of a data scientist’ tagged as #dailylife could also use the tag #datascience or #datascientist. Our goal is to find these missing tags and improve the quality of the features mentioned above.

Running the following LONG script downloads the data from a git repository, reads it, then preprocesses it for our model. I did not write the code for preprocessing this data, it was written by one of my team members.

The first row of the processed dataset is below (it excludes some features irrelevant to this article) where you can see the professional’s average answer score is 5.0 but they don’t seem to be following any tags, and there are no tags in the ‘prev_q_tags’ either. Hence, even though the professional answered a question; since the question was not tagged and the professional is not following tags, those features show up empty.

Topic Modeling (LDA)

As you can see from the image above, we will need to find tags to fill in our feature values and this is where LDA helps us. But first, what is LDA? A very basic explanation looks like this:

Imagine you have 2 documents and these documents have 2 topics each i.e. 4 topics in total. We can say that we can represent each document using some topics and each topic can be represented by some words. What LDA does is that it takes all the words present in our documents, and randomly assign them to each topic. So if we had 10 words, each topic would be a mixture of these 10 words, but some words will have a very low weightage in some topics i.e. the word ‘cat’ would have a very low weightage in a topic about ‘software engineering’. Until now, we have 4 topics with 10 words each, and each word has an associated randomly assigned probability for that topic.

The next step for LDA is to iterate over these probabilities and improve them in such a way that we are able to maximize the probability that we can generate our original documents, using these topics. Remember, each document is a mixture of topics, and each topic is a mixture of weighted words. Our goal is to find right mixture of weighted words for each topic, such that it maximizes our chances of generating the original 2 documents if we only had these 4 topics and their associated weight words. I won’t go into the exact details of how it’s done but if you are interested you can read this wonderful article.

In order to fill in the missing tags, we need to iterate over each question, apply LDA to find new topics, and add these tags to the professional who answered that question. Similarly, we also repeat these steps for each professional’s answers to find topics/tags they are following.

Before we apply LDA, we need to ensure that our dataset is processed using natural language processing (NLP). For example, the above question “How hard is the daily life of a data scientist?’ has words that really add no ‘meaning’ to the sentence i.e. ‘is’ , ‘a’, ‘of’, ‘how’. The essence of the question can be captured in ‘hard daily life data scientist’. In order to process text in such a manner, I’ve added text below which removes punctuation, removes stop words, remove html tags, etc. You can call the nlp_pipeline() function that performs all processing in one aggregated function.

Next, we perform LDA on each question and each answer using the function below which performs the following steps:

Perform NLP on the text body.
Use CounterVectorizer to turn our text into a matrix of token counts i.e. count the number of instances of each token/word in our body of text.
Find one topic and two words per topic in our body of text. I chose these hyperparameters because each question is likely talking about only one topic, and we can use two words/tags to represent that topic.
Return the words/tags in that one topic only if they are present in the dataset of all the tags in CareerVillage’s database. This stops irrelevant tags from getting retrieved.

The code below then applies these functions to the questions, and appends the new tags to the dataset. It takes around 30 minutes to run on colab.

Applying these on the body of answers is similar, and the code can be found in the notebook for this project.

Checking the first row of the dataset again gives us the following result! We have new tags from the questions (consulting and resume)!

Semantic Similarity using Spacy

In our next step, we are going to check out semantic similarity using this wonderful library called Spacy and its similarity check.

Before I dive into the details, let me explain what semantic similarity is. Words like engineering and engineers are similar; and computers can quickly recognize that using something like the levenshtein distance. However, what about words like ‘Queen’ and ‘Music’? Those two words are very different if taken in a literal meaning or in terms of the levenshtein distance, but contextually, they both can be linked together through the music band ‘Queen’. Hence, semantic similarity also entails the context of our words.

Our goal is to get semantically similar tags/topics grouped together, so we get a higher intersection score e.g. if a question has the tags ‘university, education’ while the professor only follows the tag ‘education’, our intersection over union score will be 1/2. However, university and education kind of entail the same thing, no? Ideally, our intersection over union score should be 1 since the professor is practically following both those topics.

Luckily for us, Spacy has a huge model that maps over 1 million words (and their semantic similarity) to a 300-dimensional vector space. All we have to do is calculate the distance between vectors to find how ‘similar’ they are. You can use smaller models, but their vector mapping isn’t as good as the bigger one and it has less words.

The code below is an example of running this model on the words ‘university’ and ‘education’. The resulting similarity score is ~0.663.

Semantic Similarity score between ‘education’ and ‘university’

In our next step, we have to cluster similar words together. To get vector representation of our words, we simply apply Spacy’s model to all our tags and find their vector representation. The code below does that.

DBSCAN for text clustering

Again, I will start this section by reinstating the goal of the section. We want to group semantically similar tags together, and at the end of the last section, we ended up converting all our tags to their respective vector representations in Spacy’s similarity model. Now, all we have to do is cluster similar vectors together using sklearn’s DBSCAN clustering algorithm which performs clustering from vector arrays.

Unfortunately, the DBSCAN model does not have a built in predict function which we can use to label new tags. However, the code below manually does that (courtesy of this stackoverflow question)

Let’s test out our model on the words [‘university’, ‘colleges’, ‘education’, ‘courses’]. Ideally, we want all these to be in the same category.

Model’s class labels for the list of words above

Turns out the model successfully classified 3 of the 4 labels in the same class, not bad!

From this point onwards, we are ready to group similar tags together in order to get higher feature values for our intersection features! We simply have to pass the ‘q_tags’ and ‘following_tags’ through this model before calculating our intersection feature values.

Summary

I hope this article served as a good tutorial for using Spacy, LDA, and the DBSCAN clustering algorithm. Although these topics are applied on the CareerVillage kaggle challenge, they are still applicable to other scenarios. In the next section , we will discuss how to use the methods covered in this article to finally build our model for the CareerVillage kaggle challenge!