The world’s leading publication for data science, AI, and ML professionals.

LDA Topic modeling with Tweets

Making sense of unstructured text

Photo by Kevin Grieve on Unsplash
Photo by Kevin Grieve on Unsplash

Problem and Purpose

Having some experience with building NLP models for text classification, I’ve been thinking further about how to work with completely unstructured text data. With approximately 500 million tweets tweeted on Twitter on a daily basis, Twitter is a great source of textual data. What are people tweeting about? Can these tweets be programmatically organized into topics? What are people saying about your product or company? Can patterns be discovered that may be useful for further analysis/modeling? Well – I guess we’ll find out.

For this project, I decided to use tweets pulled from twitters API using the GetOldTweets3 python library. Any notebooks used for this project will be available on my GitHub.

GetOldTweets3

import GetOldTweets3 as got
  • import GetOldTweets3 library with "got" alias for ease of use
  • There are plenty of parameters to play with, i.e. username, location, date. I I chose to use "get tweets by query search" so that I could specify a particular topic.
tweetCriteria =    got.manager.TweetCriteria().setQuerySearch('anxiety')
                           .setSince("2018-06-30")
                           .setUntil("2020-06-30")
                           .setNear('New York')
                           .setMaxTweets(10000)
tweet_object = got.manager.TweetManager.getTweets(tweetCriteria)

I chose to look at a 2 year time period of tweets containing the word "anxiety" in the New York area, hoping to gain an understanding of what people may be tweeting about their anxiety. What’s causing it? Are people at risk of self-harm? Can we find specific topics of any value to potentially perform further modeling on?

After making the API call, I constructed a dataframe with the below snippet:

The main value of interest here is just the text. I won’t be interested in usernames, dates, etc. at the moment.

Preprocessing Text

Before I can really do anything useful with text data, it must be preprocessed to be a machine readable numeric input. I’ve covered the steps below in a previous blog, but I’ll list the general steps here.

  1. Remove URL’s
  2. Tokenize each tweet
  3. Remove stop words, punctuation, and lowercase all words
  4. Remove any remaining special characters
  5. Lemmatize text

Once the text is processed, it should like something like this:

['feel', 'like', 'cheer', 'slowly', 'morphed', 'scream']

I can examine the full vocabulary of my tweet corpus:

99716 words total, with a vocabulary size of 13633 Max tweet length is 32

EDA

I can then examine the length of each tweet with a countplot using seaborn and matplotlib:

Length of tweets in corpus
Length of tweets in corpus

Considering this distribution is right-skewed with a high concentration of tweets(documents) containing less than 5 words(tokens) a piece, I will be removing additional documents with less than 4 words, with the assumption that these documents will not provide any meaningful context.

Let’s check out a wordcloud of the top 30 words. With the FreqDist class from nltk.probability and the WordCloud Library, it’s actually quite easy.

First, I created a flat list of all tokens from each tweet. This list of all tokens is then passed to FreqDist() which will count the occurrence of each token, and create a tuple containing (word, frequency) for each token.

The output:

[('time', 854),  ('people', 774),  ('depression', 608),  ('much', 548),  ('know', 501),  ('stress', 469),  ('really', 449),  ('thing', 438),  ('need', 437),  ('attack', 428),  ('make', 407),  ('social', 396),  ('today', 389),  ('work', 354), ...]

Once I have the word frequencies I can then construct a dictionary to create the word cloud.

This dictionary can then be used with the .generate_from_frequencies method from WordCloud.

The wordcloud:

People, time, stress, depression are among the top 10 most frequently occurring words. Not surprising considering the tweets queried all contained the word "anxiety".

Creating a Bag Of Words

I’ll use Gensim’s Dictionary constructor to give each word in the tweet corpus a unique integer identifier.

The constructed dictionary should contain 13633 tokens (the size of corpus vocab).

{'applies': 0,  'exist': 1,  'go': 2,  'japanese': 3,  'learning': 4,  'list': 5,  'mean': 6,  'okay': 7,  'open': 8,  'people': 9, ....}

Bag of WordsWhat is a bag of words? The simple explanation is that it is a count of how many times a word token(or term) occurs within a document(in this case a tweet). The below code utilizes gensim’s doc2bow method, taking the above dictionary as input.

The output will contain a vector for each tweet, in the form of (word id, frequency of word occurrence in document).

[[(0, 1),
  (1, 1),
  (2, 1),
  (3, 1),
  (4, 1),
  (5, 1),
  (6, 1),
  (7, 1),
  (8, 1),
  (9, 1),
  (10, 1),
  (11, 1),
  (12, 1),
  (13, 1),
  (14, 1)],
      ...]

Given that tweets are typically pretty short, with an average rate of about 33 characters, we can see that this tweet contains 14 unique words, each occurring only once. With this bag of words the modeling can begin!

Fitting the LDA Model

So, what is Lda anyway?

LDA, or Latent Dirchlet Allocation, is one of the most popular topic modeling algorithms around. LDA is a generative statistical model that allows observations to be explained by unobserved groups that explain why parts of the data are similar. LDA will take a corpus of documents as an input, assume that each document is a mixture of a small number of topics, and that each word is attributable to one of the documents topics. For a deep dive, check out any of Dave Blei’s lectures, as well as this blog.

Gensim

Fitting an LDA model in Gensim is quite simple. As a starting point, I’ve fit a model with 5 topics and 10 passes. This is somewhat arbitrary at this phase. I’m looking to see if some general topics start to emerge as to why people are tweeting about anxiety, and discover an observable pattern.

Here’s the output:

[(0,
  '0.019*"work" + 0.015*"take" + 0.014*"stress" + 0.014*"need" + 0.014*"time" + 0.013*"much" + 0.011*"today" + 0.010*"sleep" + 0.009*"next" + 0.009*"watch"'),
 (1,
  '0.034*"know" + 0.027*"people" + 0.025*"depression" + 0.017*"nice" + 0.016*"social" + 0.016*"family" + 0.015*"please" + 0.015*"really" + 0.015*"fucking" + 0.015*"think"'),
 (2,
  '0.022*"attack" + 0.019*"much" + 0.018*"give" + 0.017*"really" + 0.014*"make" + 0.014*"know" + 0.014*"never" + 0.013*"would" + 0.013*"something" + 0.013*"people"'),
 (3,
  '0.018*"news" + 0.014*"respond" + 0.013*"good" + 0.013*"disorder" + 0.012*"fear" + 0.012*"hard" + 0.011*"trying" + 0.011*"still" + 0.010*"love" + 0.010*"literally"'),
 (4,
  '0.030*"people" + 0.020*"real" + 0.019*"body" + 0.019*"social" + 0.018*"find" + 0.018*"good" + 0.018*"depression" + 0.018*"want" + 0.016*"next" + 0.016*"actually"')]

Evaluation

In my quest of evaluation, which has really only just begun, I’ve found a few resources helpful. Specifically Matti Lyra’s talk at PyData Berlin 2017 on Evaluating Topic Models. I’m not going to go deep into evaluation or optimization in this blog, as it would become much too long. But here are a few basic strategies of visualizing a topic model.

From Eyeballing

The output represents 5 topics, consisting of the top keywords and associated weightage contribution to the topic.

While I can infer some level of sentiment from each topic, there is not a clear separation, or identifiable pattern for topic allocation. There is also quite a bit of crossover with common words.

  • Topic 0 appears to be about the struggles of working from home. Although if that’s the case, I’m surprised "covid", "covid-19", "coronavirus", etc. are not included here.
  • Topic 1 may be about social isolation?
  • Topic 2?
  • Topic 3 seems to be about media, and reactions to media
  • Topic 4 is not very different than topic 1

PyLDAvis

The PyLDAvis library is a great way to visualize topics from a topic model. I’m not going to attempt to explain it in great detail, but here are the docs for the library as well as the original research paper, which was presented at the 2014 ACL Workshop on Interactive Language Learning, Visualization, and Interfaces in Baltimore on June 27, 2014.

Visualizing a topic model with PyLDAvis is pretty straightforward. Pass in the model, bag of words, and id2word mapping, and there you have it.

Output:

On the left, the topics are plotted on a 2 dimensional plane representing the distance between each topic. While the right horizontal bar chart represents the words most relevant to each topic. The chart is interactive, allowing you to select specific topics and view the related words for each topic, in hopes of inferring meaning from each topic.

Conclusions and Future Steps

While my present iteration of this tweet topic model is not clearly separating or clustering tweets in a way that I’d consider valuable for an extrinsic task, there are a few optimization strategies I’m going to look into moving forward.

  • Using n-grams to connect words that occur together commonly, such as "mental", and "health". A bi-gram would retain "mental_health" so that the model reads it as one token.
  • Additional filtering of commonly occurring words that aren’t providing any meaningful context.
  • Calculating a Coherence Score, and find optimal parameters for k topics, number of passes, etc.
  • Collect a larger corpus of tweets, and experiment with a more diverse dataset.

Congratulations on making it to the end. Feel free to reach out to me on LinkedIn if you’d like to connect @ https://www.linkedin.com/in/rob-zifchak-82a551197/


Related Articles