What are twitter users from a certain region concerned about and what are their reactions towards certain issues?

Introduction
This is a summary of school project completed with teammates Yishan, Xiaojing, Janice and Ranjit. Twitter sentiment analysis is kind of hello world project for people new to Big Data analysis.
In this project, we would like to know what Singapore users are concerned about on twitter and what are people’s reaction towards certain topic. To answer this, we designed two pipelines for twitter data analysis using Spark libraries like Spark SQL, Spark MLlib and Spark streaming.
The Architecture

Pipeline 1: Batch processing and topic modelling
In this part, we used tweepy to extract tweets from full archive API and saved them to hdfs. We then called Spark SQL and Spark MLlib to process the tweets for topic modelling and saved the result for visualization with matplotlib or tableau.
Pipeline 2: Stream processing and Sentiment Analysis
In this part, we used tweepy and the keywords from topic modeling to filter streaming tweets. We also created a TCP socket between tweepy and Spark. The TCP socket would send the tweets to Spark streaming to process. The result would then be sent to a dashboard built using Flask.
Batch processing and topic modelling
Step 1: Batch ingestion of tweets from twitter API
Twitter just upgraded the API from v1.0 to v2.0. To use v2.0 API, we need to use tweepy v4.0 which at this time is still in development phase in Github. Using twitter’s native API can work too. To use full archive API, we need to have an academic account. We applied for an academic account for this school project which granted us 10 million tweets quota each month and more query APIs.
We set some criteria for the full archive tweets. The tweets location should be singapore, language english and retweet was excluded. The start time and end time and other fields returned can be adjusted, in this case, the time was set as Apr 2021.
For the tweets published in April, we were able to extract about 48 thousand tweets that met the criteria.
Step 2: Batch Processing
In this step, we need to do some preprocess before modeling. We explored json structure of tweets. The nested json structure was flatten to one record per row and one attribute per column. After the flattening, some common preprocess steps like removing non-text, tokenization, lemmatization and removing stop words were performed.
Step 3: Topic modelling with grid search
After the text cleaning and tokenization, we used LDA for topic modeling. As we were not sure the optimal number of topics, we used grid search to determine. By simple elbow method, we found that there were six topics in April tweets.
Step 4: Create Bar chart race based on topics
After the topic modelling, we could have the topic distribution for each tweet and we assigned the tweet to one dominant topic. We aggregated the number of tweets under each topic for each day. We then used bar chart race package to demonstrate the dynamic change of topics in April.
Stream processing and sentiment analysis
Step 1: Filtered streaming of tweets and send to TCP socket
From first pipeline, we decided to focus on the topic about Myanmar coup. We used Myanmar as key word for streaming data. In this part we also created a TCP socket to receive the streaming data. The data would eventually be consumed by Spark streaming.
Step 2: Tweet sentiment analysis using Spark Streaming
In this part, we used spark streaming to process real time data. Spark streaming was an older API which used min-batch approach. We need to define the batch interval which would transform the streaming data into discretized stream named Dstream. We defined batch interval as 2 seconds. We would defined each tweets as positive, neutral or negative based on the sentiment polarity from text-blob package. We then counted the number of tweets under each category. We also counted the most used words and hashtags and showed the Geo-location of tweets for past one week. The reason why we used past one week’s tweets rather than real time tweets for Geo-location was that only 1–2% of tweets is Geo-tagged.
Step 3: Flask dashboard
In this part, we used MVC(Model-View-Controller) framework to design the dashboard. We defined how to display the data in each section on the template.