Twitter Sentiment Analysis: A tale of Stream Processing

What if we could see the sentiments of people through the breadcrumbs they leave on Twitter?

Published in

Towards Data Science

12 min readApr 16, 2020

In this article, we will get our hands dirty with building a micro-services architecture comprised of setting-up a stream processing pipe to fetch tweets from Twitter’s public API, queue it into a Kafka topic, and digest it with Natural Language Processing to get the polarity of each tweet. The final result will then be visualized on a dashboard with all the fancy charts I can think of ;-)

Quick Access Links

Dashboard: grafana.redouaneachouri.com.
Full source code can be found on GitHub: twitter-sentiment-analysis.
Pre-processing, Feature Extraction and Modelization are detailed in this notebook.

Technical Summary

Concepts: Natural Language Processing, Stream Processing, Data Visualization.
Technologies: Python v3.8, Kafka, InfluxDB, Grafana.
Deployment Platform: Google Cloud Platform.

Introduction
Create a Twitter Developer account to access the API
Micro-services Architecture:
1. Twitter API
2. Producer
3. ZooKeeper & Kafka
4. Consumer
5. Time-series Database
6. Grafana Dashboard
Deep Dive into Pre-processing and Feature Extraction
1. Tokenization
2. Normalization
3. De-noising or Noise Reduction
4. Determining Word Density in the Dataset
Data Modeling
Conclusion
References

Introduction

Twitter is a place where people share their thoughts and emotions on actual trends. It is also a strategic ground where all sorts of marketing operations happen, and where brands, including companies and public personalities, can get a generous insight on the audience’s perception from analyzing what people are tweeting about them.

With sentiment analysis, we can detect if a piece of text conveys a positive or a negative message. It is called polarity, and this is a game-changer when it comes to gathering consumer feedback from a massive amount of data and in a short stretch of time.

Analyzing sentiments can help getting feedback and finding strength and weakness points in the various aspects of a business —Source

To learn more about Sentiment Analysis and its applications, checkout this excellent article on MonkeyLearn.

With no further due, let’s dive into the technicalities!
Full source code available on GitHub: twitter-sentiment-analysis.

Create a Twitter Developer account to access the API

In order to start collecting all those tweets, we will need a Twitter Developer account. I will briefly describe here the steps taken from creating an account till obtaining an OAuth2 Bearer Token for our first app.

Head to the Twitter Developer page and apply for a developer account. The Standard APIs will do fairly enough. Feel free to use your actual Twitter account (if any), or create a new one.
Our next step is to get access to the Twitter Labs — where all the fun happens. For this we’ll need to register a Dev Environment and create an app. Our app will have a set of credentials called API key and API secret, which we will have to keep religiously secret because they are used to grant access to the app associated to our developer account.
Once granted access to the Developer Labs, we can start using the Sampled Stream endpoint https://api.twitter.com/labs/1/tweets/stream/sample:

The Sampled Stream endpoint delivers a roughly 1% random sample of all Tweet data, in real time, through a streaming connection.

4. The official API Sampled Stream documentation offers a solid basis to learn how to authenticate, and sample code snippets to start getting the flow of real-time tweets.

Micro-services Architecture

Architecture of this project, from left to right.

The architecture of this project is pretty straightforward, from the Twitter API to the Grafana dashboard, passing by the Kafka pipe and producer/consumer pair.

All architectures have their weaknesses, including this one, and this includes network failures, memory issues, bugs, misformatted text, … I try to tackle these problems by making main components such as the Producer and Consumer as resilient as possible to failures by taking measures such as:

At startup, continuously poll the Kafka broker for connections within intervals of a few seconds until the broker becomes available.
Run the main program in an infite while True loop that catches exceptions, performs logging or attempts re-connection, and gracefully continues the program.

N.B. Live deployment is done on a virtual machine in the Compute Engine of GCP — Google Cloud Platform.

Following is the description of each component:

Twitter API

Description: It’s an API endpoint provided by Twitter for free in order to get approximately 1% of the real-time tweets generated worldwide. Please refer to the previous part “Create a Twitter Developer account to access the API” for access details.
Endpoint: https://api.twitter.com/labs/1/tweets/stream/sample
Prerequisites: An API key and API secret that will be used to obtain an access token from the Twitter OAuth2 authentication endpoint https://api.twitter.com/oauth2/token.
Sample: Here is how a basic tweet and its meta-data look like. The most important parts for us are id, lang(language), and text :

{
    "data": {"attachments": {"media_keys": ["xxx"]},
    "author_id": "xxx",
    "context_annotations": [{"domain": {"description": "xxx",
                                        "id": "123",
                                        "name": "xxx"},
                             "entity": {"id": "xxx",
                                        "name": "xxx"}}],
    "created_at": "1970-01-01T00:00:00.000Z",
    "entities": {"mentions": [{"end": 42,
                               "start": 123,
                               "username": "xxx"}],
                 "urls": [{"end": 42,
                           "start": 123,
                           "url": "https://t.co/xxx"}]},
    "format": "detailed",
    "id": "123456789",
    "lang": "en",
    "possibly_sensitive": False,
    "referenced_tweets": [{"id": "123",
                           "type": "retweeted"}],
    "source": "<a href='http://twitter.com/download/android' rel='nofollow'>Twitter for Android</a>",
    "stats": {"like_count": 0,
              "quote_count": 0,
              "reply_count": 0,
              "retweet_count": 0},
    "text": "here comes the tweet !!!"}
}

Producer

Description: The producer, or publisher, gets the tweets from the Twitter API and sends — or rather, queues — the id and text of the tweets (English language only, and without the meta-data) into a Kafka topic, which will then be consumed down the line by the consumer. Also we store the id and language lang of each tweet in a time-series database for visualization purposes (see Grafana Dashboard below).
Language: Python (version 3.8).
Implementation details: We create a class TweetsProducer that inherits from the Kafka-python class KafkaProducer to facilitate the connection and interaction with our Kafka broker. This class will build a permanent connection with the Twitter API to fetch a continuous stream of tweets.

Zookeeper & Kafka

Description: The purpose of using Kafka is to have a queue where messages — tweets — can be safely stored while awaiting for processing by a consumer, since the processing part can be relatively slower than the fetching from Twitter’s API. It acts as a FIFO — First-in first-out — datastore.
Zookeeper is used to manage the state and configuration of large distributed systems, but in our case it doesn’t require much attention as we have only one Kafka instance in our cluster.
Requirements: Docker images for Zookeeper:3.4.6 and Kafka:2.4.0 (based on Scala version 2.12).
Instantiation details: While trying to get my first dockerized Kafka instance up and running, it appeared to be no easy task as it requires some inner- and outer-container networking configuration, and with some bushwhacking through the official Kafka docs and Wurstmeister’s Kafka docs, I found that the following config suited this project best:
KAFKA_ZOOKEEPER_CONNECT : Address and port of Zookeeper. Since Zookeeper and Kafka are on the same Docker network, we can set it as the Zookeeper service name and default port “zookeeper:2181”.
KAFKA_LISTENERS : The list of addresses (0.0.0.0:9093 and 0.0.0.0:9092) and their associated listener names (INSIDE and OUTSIDE) on which the Kafka broker will listen on for incoming connections. INSIDE means all the clients that are inside the same Docker network as our Kafka instance, e.g. the Producer and Consumer, and OUTSIDE means all clients on the host machine such as CLI tools for Kafka.
KAFKA_LISTENER_SECURITY_PROTOCOL_MAP : Map between listener names and security protocols. We’ll keep it simple with “INSIDE:PLAINTEXT, OUTSIDE:PLAINTEXT”.
KAFKA_ADVERTISED_LISTENERS : List of available addresses that point to the Kafka broker (will be sent to clients for their initial connection). We’ll set it as “INSIDE://kafka:9093, OUTSIDE://localhost:9092”.
KAFKA_INTER_BROKER_LISTENER_NAME: Name of listener used for communication between brokers.

Full configuration of Kafka can be found in the docker-compose.yaml file of this project.

Consumer

Description: The consumer simply loads the tweets one by one from Kafka, performs a pre-processing and feature extraction step, then infers the polarity of each tweet — positive or negative. This polarity is stored in the time-series database, along with the tweet’s ID, for visualization purposes (could also be further processed, or used to detect trends, …).
The Machine Learning model used in the inferring — or prediction — step is generated and stored in a binary .pickle format beforehand.
Pre-processing and Feature Extraction: Please refer to detailed description in the part “Deep Dive into Pre-processing and Feature Extraction” below.
Language: Python (version 3.8).
Implementation details: We create a class TweetsConsumer that inherits from the Kafka-python class KafkaConsumer to facilitate the connection and interaction with our Kafka broker.
After fetching the tweets comes the pre-processing step and classification. Once the polarity of each tweet obtained, we store it into our time-series database for visualization purposes.

Time-series Database

Description: A time-series database is a data store that uses timestamps to index records. This means that for each new record, the DB engine associates a timestamp, which is the amount of time in nano-second precision since the Unix epoch —00:00:00 UTC on Jan, 1st 1970.
This is perfect for the problem at hand, which is to store time-dependent sequential data that is coming at a high rate, and that also requires quick retrieval as a time series.
This DB will be accessed by Grafana to feed our analysis dashboard.
Technology: We use InfluxDB which is an open-source time-series database that has become an industry standard over the past years.

Grafana Dashboard

Link: grafana.redouaneachouri.com.
Description: Grafana is used by companies of all sizes worldwide to monitor anything that can change over time :) This goes from computer infrastructures, to atmospheric measurements, passing by factories and warehouse inventories.
We use it to create a dashboard in order to analyze the polarity of tweets and its variations over time. We also want to analyze other aspects of the tweets such as the distribution of languages during periods of the day or of the year.
Usage details: In order to create our dashboard, we need to plug Grafana to a data source, which is in our case InfluxDB, our time-series database. Once connected, we can start adding panels as needed, and for each panel we define a query to fetch the data that will be displayed.

Distribution of the polarity. We can see a symmetrical trend over time.

Full description of this architecture is available in the project’s docker-compose.yaml file.

Deep Dive into Pre-processing and Feature Extraction

Steps taken to reduce dimensionality, complexity, and noise in the data. This is required as the tweets 1) are in a sequential multi-line text that can’t be processed by a simple model, and 2) contain irrelevant information such as connecting words and URLs that add complexity and can bias the model

For implementation details please check notebook here.

Tokenization

A text is a sequence of characters separated by whitespaces and punctuation that carries a meaning understandable by the human brain, but that makes no sense to a computer. However, we can split the text into smaller strings or words, which are called tokens, based on the whitespace and punctuation separators.

We use a pre-trained NLTK model available in the package Punkt, which takes into consideration titles such as Dr. and Mr., and periods in names such as J.Doe.

Normalization

Words in a text can take different forms. Verbs can be conjugated — “arise”, “arose”, “arisen”, and nouns can be set to feminine and plural— “author”, “authoress”, “authors”, “authoresses”. Thus normalization can help group together words that have the same meaning by bringing them to their canonical form.
There are 2 popular types of normalization:

Stemming: In its simplest form, it consists of removing affixes from words and finding the stems in a lookup table.
Lemmatization: It is a more advanced process as it tries to find the lemma, or dictionary form, of a word by inferring the meaning from the context of the sentence — e.g. “meeting” can have different meanings in “I’m preparing an important meeting with the clients” and “I’m meeting with clients tomorrow”.

Lemmatization is slower but more accurate than Stemming, and since this slowness is not a big impediment to our real-time processing, we can afford to use Lemmatization for normalization.
Before using a lemmatizer however we need to determine the context of each word in the text, and for this we use a part-of-speech tagger. Check this link for a full list of possible tags.

De-noising or Noise Reduction

Noise is all data that doesn’t add information, but consumes time and resources, and could add bias to the model. Following is what we consider as noise in this project:

Stopwords: Most common words in a language, such as “a”, “the”, and “it”, generally don’t convey a meaning, unless otherwise specified.
Hyperlinks: Twitter uses t.co to shorten hyperlinks, which doesn’t leave any value in the information transformed as URLs.
Mentions: Usernames and pages that start with a @.
Punctuation: It adds context and meaning, but makes the text more complex to process. For simplicity, we’ll remove all punctuation.

We use Regular Expressions and dictionaries of stopwords and punctuation of the English language in order to perform the filtering.

Determining Word Density in the Dataset

By doing a quick detour, we can find what words are the most associated with positive or negative sentiments in our tweets dataset:

Positive: Emojis :) and :-), thank, thanks, follow, love, and good.
Negative: Emojis :( and :-(, go, get, please, want, and miss.

N.B. The Word Density is implicitly calculated by our modeling algorithm, so we don’t need to include this as a part of the pre-processing.

Data Modeling

For implementation details please check notebook here.

Here comes the part where we build a supervised learning model that can classify tweets into positive or negative, the two labels we have for our data.

For simplicity, and for the limited amount of data we have at our disposal (10k records), we will use a Naive Bayes Classifier.

Here are the training and testing metrics (accuracy), and a table of the 10 most informative features. Each row represents the ratio between the occurrences of a word in positive vs. negative tweets.

Training Accuracy: 0.99957
Testing Accuracy: 0.995Most Informative Features
              :(           Negati : Positi =   2081.8 : 1.0
              :)           Positi : Negati =   1656.3 : 1.0
             sad           Negati : Positi =     23.6 : 1.0
            sick           Negati : Positi =     19.2 : 1.0
          arrive           Positi : Negati =     18.8 : 1.0
            poor           Negati : Positi =     15.9 : 1.0
       community           Positi : Negati =     15.5 : 1.0
             x15           Negati : Positi =     13.2 : 1.0
             idk           Negati : Positi =     12.5 : 1.0
   unfortunately           Negati : Positi =     12.5 : 1.0

Improvement Suggestions

By testing our model on custom tweets, we can see that it fails at classifying sarcasm as it considers it a positive sentiment instead of negative.
This is because our model lacks the data complexity to discern more evolved emotions, and thus we can work on this aspect by using a more advanced classification algorithm that can cope with the complexity of a dataset containing evolved emotions such as Joy, Excitement, Sadness, and Fear.

Conclusion

In this article, I provided you with my 2 cents on how to design and create a micro-services architecture that can fetch a continuous stream of tweets from Twitter’s real-time API, then pipe it through Kafka before processing it to extract the polarity of each tweet. We then saw how to store the result in a time-series database for further analysis and visualization.

We also saw how to pre-process and extract features from the data before building a classification model that can infer the polarity of each tweet.

As a final outcome, we concluded that in order to be able to discriminate more evolved emotions such as Joy and Sarcasm, we’d need a more complex dataset and a classification algorithm that could cope with this complexity.

If you find this tutorial useful and you’d like to support the making of quality articles, consider buying me a coffee!

You can click on the “Follow” button to get my latest articles and posts!

References

Sentiment Analysis and its Applications — MokeyLearn
Twitter API documentation — Twitter Developer portal
Kafka concepts and documentation — Apache Kafka
Stemming — Wikipedia
Lemmatisation — Wikipedia
Sentiment Analysis with Python NLTK — DigitalOcean
Source of images are in their respective caption.