LDA Topic Modeling for High Blood Pressure Drugs Reviews

Topic modelling is a type of text analysis technique that makes use of unsupervised machine learning algorithm. The ultimate goal of Topic Modeling is to find themes across a large number of text documents and discover hidden topics. Each document in the corpus will be made up of at least one topic, if not multiple topics. Topic modelling is great for document clustering and information retrieval from unstructured text.
Latent Dirichlet Allocation (LDA) is a type of algorithm for topic modelling and is used to cluster text in a document to a list of topics.
Here we are going to apply LDA to a set of high blood pressure drugs reviews documents and split them into topics.
Motivation
The objective is to draw insights from these reviews via analysis of those clustered topics:
- Exploring the underlying topics that are common across all drugs for high blood pressure from the reviews.
- Predict the rating for a given review text.
Let’s get started!
The Dataset
The dataset used in this analysis is a list of around 18,000 patient’s reviews on drugs used to treat high blood pressure extracted from the WebMD Drug Reviews Dataset which can be downloaded from Kaggle. The data provider acquired the dataset by scraping the WebMD site. There are around 0.36 million rows of unique reviews in the original dataset and is updated till Mar 2020.
import pandas as pd
Read data into papers
drug_df = pd.read_csv('C:/Users/Johnny Phua/Metis/Project4/Drug Review/webmd.csv')
hbp_drug_df = drug_df[drug_df["Condition"] == 'High Blood Pressure']
Take a peek of the data.
print(len(hbp_drug_df))
hbp_drug_df.head()

Data Pre-processing
The following steps were performed:
- Lowercase the words and remove non-letter characters.
- Words < 3 characters are removed.
- Tokenization: Split the text into words.
- All stopwords are removed.
- Words are lemmatized: Words in third person are changed to first person and verbs in past and future tenses are changed into present.
- Words are stemmed: Words are reduced to their root form.
The below functions are to lower case the words, remove non-letter characters and remove words that less than 3 characters:
import re
def clean_non_alpha(text):
return re.sub('[^a-zA-Z]',' ', str(text))
def remove_short_word(text):
return re.sub(r'bw{1,3}b', '', str(text))
def convert_to_lower(text):
return re.sub(r'([A-Z])', lambda m: m.group(1).lower(),
str(text))
Tokenization
Tokenization is simply a process of splitting a string into a list of tokens. A token is a piece of a whole, so in our context a word is a token in a sentence.
from nltk.tokenize import RegexpTokenizer
def get_tokenize(text):
"""
function to tokenize text to list of words.
"""
tokenizer = RegexpTokenizer(r'w+')
return tokenizer.tokenize(str(text))

Remove stopwords, Lemmatization, Stemming
The below functions do the jobs of removing stopwors, lemmmatization and stemming.
from nltk.stem import WordNetLemmatizer
lemma = WordNetLemmatizer()
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()
from nltk.corpus import stopwords
stop_words = stopwords.words('english')
def remove_stopwords(token):
"""
function to remove stopwords
"""
return [item for item in token if item not in stop_words]
def clean_lemmatization(token):
#from nltk.stem import WordNetLemmatizer
#lemma = WordNetLemmatizer()
return [lemma.lemmatize(word=w,pos='v') for w in token]
def clean_stem(token):
#from nltk.stem import PorterStemmer
#stemmer = PorterStemmer()
return [stemmer.stem(i) for i in token]

The dataset is clean and in right format now. It can be used for LDA modeling now.
However, before we start training the model, how will we decide the optimal number of topics which makes topics interpretability.
1) What is the way to determine the number of topics (k) in a topic modeling?
One of the ways to determine the optimum number of topics (k) for topic model is through comparing C_V Coherence score. The optimum number of topics will produce the highest C_V Coherence score. Coherence looks at the most-frequently occurring words in each of the generated topics, rates the semantic similarity between them (using either UCI or Umass to do the pairwise calculations) and then finds the mean coherence score across all the topics in the model. (http://qpleple.com/topic-coherence-to-evaluate-topic-models/)
2) How do we evaluate and improve the interpretability of a model’s results?
Once we choose the optimal number of topics, the next question to ask is how to best evaluate and improve the interpretability of those topics. One approach is to visualize the results of our topic model to ensure that they make sense for our scenario. We can use the pyLDAvis tool to visualize the fit of your LDA model across topics and their top words.
Bag of Words
Create a dictionary from ‘reviews_docs’ containing the number of times a word appears in the training set.
from gensim import corpora
dictionary = corpora.Dictionary(reviews)

Using Gensim.filter_extremes built-in function to filter out tokens that appear in,
- less than 15 documents (absolute number) or
- more than 0.5 documents (fraction of total corpus size, not absolute number).
- after the above two steps, keep only those tokens with 11 and above number of occurrence, in other words just keep only the first 2130 most frequent tokens.
By checking the frequency distributions of the entire word dictionary(sorted by frequency in descending order), we can know that the first 2130 words/tokens has 11 and above number of occurrence for each token.
first get a list of all words
all_words = [word for item in list(hbp_drug_df['reviews_text_processed']) for word in item]
use nltk FreqDist to get a frequency distribution of all words
fdist = FreqDist(all_words)
choose k and visually inspect the bottom 10 words of the top k
k = 2130
top_k_words = fdist.most_common(k)
top_k_words[-10:]

Apply the filter mentioned above to the gensim dictionary:
dictionary.filter_extremes(no_below=15, no_above=0.5, keep_n=2130)
Then for each document we create a document words matrix reporting how many words and how many times those words appear. Save this to ‘doc_term_matrix’.
from gensim import corpora
doc_term_matrix = [dictionary.doc2bow(rev) for rev in reviews]
Running LDA using Bag of Words
With the dictionary and doc_term_matrix ready, now we can Train the LDA model using gensim.models.ldamodel.LdaModel.
However, at this point, the next question we need to ask is that what is the number of topics (k) should we use to train our model? We can utilize the c_v coherence score and the pyLDAvis visualization tool to decide the optimum number of topics for our model.
Calculate the c_v Coherence Score:
#Calculate the c_v Coherence Score for different k from 1 to 10.
from tqdm import tqdm
from gensim.models import LdaModel
coherenceList_cv = []
num_topics_list = np.arange(1,10)
for num_topics in tqdm(num_topics_list):
lda_model = LdaModel(corpus=doc_term_matrix,id2word=dictionary,num_topics=num_topics,random_state=100, update_every=1, chunksize=1000, passes=50, alpha='auto', per_word_topics=True)
cm_cv = CoherenceModel(model=lda_model, corpus=doc_term_matrix, texts=reviews, dictionary=dictionary, coherence='c_v')
coherenceList_cv.append(cm_cv.get_coherence())
Plotting the pyLDAvis:
Plotting tools
import pyLDAvis
import pyLDAvis.gensim
Visualize the topics
pyLDAvis.enable_notebook()
vis = pyLDAvis.gensim.prepare(lda_model, doc_term_matrix, dictionary)
vis
What is topic coherence?
Coherence measures the most-frequently occurring words in each of the generated topics, rates the semantic similarity between them; making use of either UCI or Umass to perform the pairwise calculations; and then calculates the mean coherence score across all the topics for the model.
As you can notice from below the c_v coherrence score distribution for different k topics for our modelling, the number of topics k=3 and k=5 gave equally high score.

However if we plot the pyLDAvis, it is obvious that k=3 will give more promising result than k=5. It can be notice that for k=5, the topic 1, 2 and 3 are highly overlap one another. While for k=3, the topics are well separated in different quadrants.

The t-SNE plot also obviously shows that k=3 is much better than k=5 as it can be noticed that when k=5 there are 3 topics overlap one another.


Now, we are good to generate the LDA model using k=3. And let the Alpha and Beta hyperparameter be auto generated.
import gensim
Creating the object for LDA model using gensim library
LDA = gensim.models.ldamodel.LdaModel
Build LDA model
lda_model = LDA(corpus=doc_term_matrix, id2word=dictionary, num_topics=3, random_state=100, chunksize=1000, passes=50, update_every=1, alpha='auto', eta='auto', per_word_topics=True)
You can see the top 10 keywords for each topic and the weightage(importance) of each keyword using _lda_model.showtopic():
0 [(‘day’, 0.043550313), (‘feel’, 0.03116778), (‘time’, 0.030068435), (‘get’, 0.0252915), (‘like’, 0.020811914), (‘dizzi’, 0.017624224), (‘make’, 0.017559748), (‘tire’, 0.01706512), (‘heart’, 0.01668307), (‘headach’, 0.013266859)]
1 [(‘pressur’, 0.046297483), (‘effect’, 0.045646794), (‘blood’, 0.044705234), (‘side’, 0.041772064), (‘work’, 0.027191896), (‘year’, 0.026943726), (‘medic’, 0.024851086), (‘lower’, 0.020396743), (‘drug’, 0.01844332), (‘high’, 0.018091053)]
2 [(‘cough’, 0.038522985), (‘pain’, 0.021691399), (‘drug’, 0.02127608), (‘caus’, 0.017660549), (‘sever’, 0.01583576), (‘go’, 0.01558094), (‘doctor’, 0.01529741), (‘stop’, 0.015242598), (‘swell’, 0.015056929), (‘week’, 0.014600373)]
Word Cloud
Let’s take a look at the word cloud distribution of the model.



By reference to the list of some common side effects on high blood pressure medication, it can be noticed that topic 0 and topic 2 showing some side effects from the list mentioned below.
Some common side effects of high blood pressure medicines include:
•Cough
•Diarrhea or constipation
•Dizziness or lightheadedness
•Intense and sudden foot pain
•Feeling nervous
•Feeling tired, weak, drowsy, or a lack of energy
•Headache
•Nausea or vomiting
•Skin rash
•Weight loss or gain without trying
- Leg swelling
Draw Insights
By randomly reading the reviews of the patience from the dataset, it can be noticed that topic 0 and topic 2 reviews show the patience experienced different type of side effect for these 2 topics. And topic 2 reviews show that the patience of this category experienced more serious side effect than topic 0, because topic 2 patients took actions such as stopped taking the drug and change to another drug. However, for the patience’s reviews which has topic 1 as dominant topic show that the drug worked for them and only experience mild symptom of side effect.
Now we can label the topic as show below,
Topic 0: Side effect type I – "dizziness", "tired", "headache"
Topic 1: Drugs works well
Topic 2: Side effect type II – "cough", "swell", "pain"



As seen from the three plots above, patients with reviews clustered in topic 1 were mostly satisfied with the drugs with rating 3 and above. While patients of topic 0 and topic 2 were mostly not satisfied with rating 2 and below.

As seen from the chart showed above, around 55% of the patients who take high blood pressure medication will develop some kind of side effect symptoms.