The world’s leading publication for data science, AI, and ML professionals.

Topic modeling and sentiment analysis on twitter data using Spark

What are twitter users from a certain region concerned about and what are their reactions towards certain issues?

Photo by Alexander Shatov on Unsplash
Photo by Alexander Shatov on Unsplash

Introduction

This is a summary of school project completed with teammates Yishan, Xiaojing, Janice and Ranjit. Twitter sentiment analysis is kind of hello world project for people new to Big Data analysis.

In this project, we would like to know what Singapore users are concerned about on twitter and what are people’s reaction towards certain topic. To answer this, we designed two pipelines for twitter data analysis using Spark libraries like Spark SQL, Spark MLlib and Spark streaming.

The Architecture

Image by author
Image by author

Pipeline 1: Batch processing and topic modelling

In this part, we used tweepy to extract tweets from full archive API and saved them to hdfs. We then called Spark SQL and Spark MLlib to process the tweets for topic modelling and saved the result for visualization with matplotlib or tableau.

Pipeline 2: Stream processing and Sentiment Analysis

In this part, we used tweepy and the keywords from topic modeling to filter streaming tweets. We also created a TCP socket between tweepy and Spark. The TCP socket would send the tweets to Spark streaming to process. The result would then be sent to a dashboard built using Flask.

Batch processing and topic modelling

Step 1: Batch ingestion of tweets from twitter API

Twitter just upgraded the API from v1.0 to v2.0. To use v2.0 API, we need to use tweepy v4.0 which at this time is still in development phase in Github. Using twitter’s native API can work too. To use full archive API, we need to have an academic account. We applied for an academic account for this school project which granted us 10 million tweets quota each month and more query APIs.

We set some criteria for the full archive tweets. The tweets location should be singapore, language english and retweet was excluded. The start time and end time and other fields returned can be adjusted, in this case, the time was set as Apr 2021.

For the tweets published in April, we were able to extract about 48 thousand tweets that met the criteria.

Step 2: Batch Processing

In this step, we need to do some preprocess before modeling. We explored json structure of tweets. The nested json structure was flatten to one record per row and one attribute per column. After the flattening, some common preprocess steps like removing non-text, tokenization, lemmatization and removing stop words were performed.

Step 3: Topic modelling with grid search

After the text cleaning and tokenization, we used LDA for topic modeling. As we were not sure the optimal number of topics, we used grid search to determine. By simple elbow method, we found that there were six topics in April tweets.

Step 4: Create Bar chart race based on topics

After the topic modelling, we could have the topic distribution for each tweet and we assigned the tweet to one dominant topic. We aggregated the number of tweets under each topic for each day. We then used bar chart race package to demonstrate the dynamic change of topics in April.

Stream processing and sentiment analysis

Step 1: Filtered streaming of tweets and send to TCP socket

From first pipeline, we decided to focus on the topic about Myanmar coup. We used Myanmar as key word for streaming data. In this part we also created a TCP socket to receive the streaming data. The data would eventually be consumed by Spark streaming.

Step 2: Tweet sentiment analysis using Spark Streaming

In this part, we used spark streaming to process real time data. Spark streaming was an older API which used min-batch approach. We need to define the batch interval which would transform the streaming data into discretized stream named Dstream. We defined batch interval as 2 seconds. We would defined each tweets as positive, neutral or negative based on the sentiment polarity from text-blob package. We then counted the number of tweets under each category. We also counted the most used words and hashtags and showed the Geo-location of tweets for past one week. The reason why we used past one week’s tweets rather than real time tweets for Geo-location was that only 1–2% of tweets is Geo-tagged.

Step 3: Flask dashboard

In this part, we used MVC(Model-View-Controller) framework to design the dashboard. We defined how to display the data in each section on the template.

Real time dashboard


Related Articles