Topic Modeling using LDA

Quickly get the gist on over 30,000 Tweets!

Leah Pope
Towards Data Science

--

Photo by Jan Antonin Kolar on Unsplash

In my latest project, I explored the question, “What is the public sentiment in the United States on K-12 learning during the COVID-19 pandemic?”.

Using data collected from Twitter, Natural Language Processing, and Supervised Machine Learning, I created a text classifier to predict Tweets' sentiment on this topic.

Fantastic! I’m able to classify Tweets by Positive, Negative, or Neutral sentiment. But what are Twitter users actually talking about in these Tweets? Are there any patterns or trends that we might miss if we only focus on the sentiment? Do certain topics show up more often under one particular sentiment? There is still so much we can explore within the content of these Tweets.

With that in mind, I decided to perform Topic Modeling to complement the sentiment classifier. My opinion is that Topic Modeling is an improvement over a frequency-based ‘word cloud’ approach to understanding the content within a text corpus.

Don’t get me wrong; I love a nice word cloud for a snazzy presentation graphic. However, since my text (or should I say Tweet) corpus was so large (~30,000 Tweets) and so varied (US-wide and on the broad topic of K-12 Learning during the COVID pandemic), Topic Modeling was definitely the right choice. To make the Topic Modeling even more effective, I leveraged the fantastic interactive visualization tool, pyLDAvis.

Here are the steps I took to split the Tweets into separate Topics and conduct Exploratory Data Analysis.

  • Text Processing
  • Topic Modeling
  • Interactive Visualization

Text Processing

Before performing Topic Modeling using LatentDirichletAllocation (LDA), I’ll need to apply some text processing. When I trained the text classifier to recognize sentiment, punctuation and capitalization could actually be useful to the classifier, so I only did very light text processing to the corpus before training the classifier. With Topic Modeling, it is actually essential to normalize the corpus text. Here is the text processing that I applied:

  • Change the Tweet text to lowercase (TweetTokenizer from nltk handles this)
  • Remove RT (Retweets) and # (hashtag symbol) (using a regex)
  • Remove URLs (using a regex)
  • Remove stopwords and punctuation from the text.
  • Perform Lemmatization on all words (using WordNetLemmatizer from nltk)

All text processing occurs in the clean_tokenize_lemmatize_tweet function.

sw_and_punct = stopwords.words(‘english’) + list(string.punctuation)# TweetTokenizer will put all text in the tweet to lowercase, strip out usernames/handles and reduce reated chars in words
tweet_tokenizer = TweetTokenizer(preserve_case=False, strip_handles=True, reduce_len=True)
lemmatizer = WordNetLemmatizer()def clean_tokenize_lemmatize_tweet(tweet):
#remove urls
tweet = re.sub(r’http\S+|www\S+|https\S+’, ‘’, tweet, flags=re.MULTILINE)
#remove RT
tweet = re.sub(r’^RT\s+’, ‘’, tweet)
#remove the # symbol
tweet = re.sub(‘#’, ‘’, tweet)
#tokenize
tokens = tweet_tokenizer.tokenize(tweet)
# remove stopwords and punctuation
tokens = [token for token in tokens if token not in sw_and_punct]
# remove tokens that are only 1 char in length
tokens = [token for token in tokens if len(token)>1]
#lemmatize
lemmatized_tokens = [lemmatizer.lemmatize(token) for token in tokens]
return lemmatized_tokens

Topic Modeling

Using sklearn’s LatentDirichletAllocation for Topic Modeling is rather straightforward. First, you’ll need to transform your corpus words into numbers by creating a Document Term Matrix (DTM). I used sklearn’s CountVectorizer to create a DTM. Check out the code below:

sample_corpus = [ 'This is the first sample document.',
'This document is the second sampledocument.',
'And this is the third one.',
'Is this the first document? I do think so!' ]
my_cv = CountVectorizer(tokenizer=clean_tokenize_lemmatize_tweet)my_lda = LatentDirichletAllocation(n_components=5)

my_dtm = cv.fit_transform(sample_corpus)
my_lda.fit(my_dtm) # may take a while with a large corpus ;)

Interactive Data Visualization

Now that LDA has done its magic, we can take a look at the Topics. You can plot these out; however, it is much more fun to use pyLDAvis to explore your corpus topics!

import pyLDAvis.sklearn
pyLDAvis.enable_notebook()

pyLDAvis.sklearn.prepare(my_lda, my_dtm, my_cv)

With just those few lines of code, you’ll get an impressive and informative interactive visualization like the one in the image below!

Image provided by the author

If you want to see pyLDAvis in action, take a look into some of the topics from my corpus of “Learning during COVID” Tweets.

Conclusion

By applying Topic Modeling to the corpus and adding in the pyLDAvis package to provide interactive visualization, we’ve opened the door for a topic deep dive into over 30,000 Tweets from across the United States!

If you’ve found this helpful, I encourage you to expand on this example in your own work. Please feel free to reach out with suggestions for improvement or let me know if this has helped you.

--

--