HANDS-ON TUTORIALS, MACHINE LEARNING PROJECT

How to Build a Twitter Sentiment Analysis System

Applying natural language processing for sentiment analysis

Ramya Vidiyala

Published in

Towards Data Science

12 min readNov 20, 2020

In the field of social media data analytics, one popular area of research is the sentiment analysis of Twitter data. Twitter is one of the most popular social media platforms in the world, with 330 million monthly active users and 500 million tweets sent each day. By carefully analyzing the sentiment of these tweets — whether they are positive, negative, or neutral, for example — we can learn a lot about how people feel about certain topics.

Understanding the sentiment of tweets is important for a variety of reasons: business marketing, politics, public behavior analysis, and information gathering are just a few examples. Sentiment analysis of Twitter data can help marketers understand the customer response to product launches and marketing campaigns, and it can also help political parties understand the public response to policy changes or announcements.

However, Twitter data analysis is no simple task. There are something like ~6000 tweets released every second. That’s a lot of Twitter data! And though it’s easy for humans to interpret the sentiment of a tweet, human sentiment analysis is simply not scalable.

In this article, we’re going to look at building a scalable system for Twitter sentiment analysis, to help us better understand the role of machine learning in social media data analytics.

Problem: Identifying Negative Sentiment in Tweets

In this article, we’ll learn how to identify tweets with a negative sentiment. To do so, we’ll create a sentiment analyzer to classify positive and negative tweets in text format. Though we’ll be using our classifier for Twitter data analysis, it can also be used to analyze text data from other sources as well.

Overview of the system. (Image by author)

Through the course of the article, we are going to look at datasets, various text processing, and embedding techniques, and then employ a machine learning model to process our data.

Twitter Sentiment Analysis Dataset
Text Processing
A. Cleaning of raw text
B. Tokenization
C. Stemming
Word Embedding Techniques
A. Bag of Words
B. Term Frequency — Inverse Document Frequency
C. Word2Vec
Model
Performance Metrics
Results
Summary

Twitter Sentiment Analysis Dataset

Let’s start with our Twitter data. We will use the open-source Twitter Tweets Data for Sentiment Analysis dataset. It contains 32,000 tweets, of which 2,000 contain negative sentiment.

The target variable for this dataset is ‘label’, which maps negative tweets to 1, and anything else to 0. Think of the target variable as what you’re trying to predict. For our machine learning problem, we’ll train a classification model on this data so it can predict the class of any new tweets we give it.

A snapshot of the data is presented in the image below.

Text Processing

Data usually comes from a variety of different sources and is often in a variety of different formats. For this reason, cleaning your raw data is an essential part of preparing your dataset. However, cleaning is not a simple process, as text data often contain redundant and/or repetitive words. This is especially true in Twitter sentiment analysis, so processing our text data is the first step towards our solution.

The fundamental steps involved in text processing are:

A. Cleaning of Raw Data
B. Tokenization
C. Stemming

A. Cleaning of Raw Data

This phase involves the deletion of words or characters that do not add value to the meaning of the text. Some of the standard cleaning steps are below:

Lowering case
Removal of mentions
Removal of special characters
Removal of stopwords
Removal of hyperlinks
Removal of numbers
Removal of whitespaces

Lowering Case

Lowering the case of text is essential for the following reasons:
The words, ‘Tweet’, ‘TWEET’, and ‘tweet’ all add the same value to a sentence.
Lowering the case of all the words helps to reduce the dimensions by decreasing the size of the vocabulary.

def to_lower(word): 
     result = word.lower() 
     return result

Removal of mentions

Mentions are very common in tweets. However, as they don’t add value for interpreting the sentiment of a tweet, we can remove them. Mentions always come in the form of ‘@mention’, so we can remove strings that start with ‘@’.

To achieve this on the entire dataset, we use the function below.

def remove_mentions(word):       
    result = re.sub(r"@\S+", "", word)       
    return result

Removal of special characters

This text processing technique will help to treat words like ‘hurray’ and ‘hurray!’ in the same way. At this stage, we remove all punctuation marks.

def remove_special_characters(word):       
    result = word.translate(str.maketrans(dict.fromkeys(string.punctuation)))    
    return result

Removal of stopwords

Stopwords are commonly occurring words in a language, such as ‘the’, ‘a’, ‘an’, ‘is’ etc. We can remove them here because they won’t provide any valuable information for our Twitter data analysis.

def remove_stop_words(words):       
    result = [i for i in words if i not in ENGLISH_STOP_WORDS      
    return result

Removal of hyperlinks

Now we can remove URLs from the data. It’s not uncommon for tweets to contain URLs, but we won’t need to analyze them for our task.

def remove_hyperlink(word):       
    return re.sub(r"http\S+", "", word)

B. Tokenization

Tokenization is the process of splitting text into smaller chunks, called tokens. Each token is an input to the machine learning algorithm as a feature. NLTK (Natural Language Toolkit) provides a utility function for tokenizing data.

from nltk.tokenize import word_tokenize
tweets_data['tweet'] = tweets_data['tweet'].apply(word_tokenize)

C. Stemming

Stemming is the process of removing and replacing suffixes from a token to obtain the root or base form of the word. This is called a ‘stem’. For example, the stem for the words, ‘satisfied’, ‘satisfaction’, and ‘satisfying’ is ‘satisfy’ and all of these imply the same feeling.

Porter stemmer is a widely used stemming technique. nltk.stem provides the utility function to stem ‘PorterStemmer’

from nltk.stem.porter import PorterStemmer
stemmer = PorterStemmer()def stem_words(text):       
    return " ".join([stemmer.stem(word) for word in text])
tweets_data['tweet'] = tweets_data['tweet'].apply(lambda text: stem_words(text))

Overview of Text Processing. (Image by author)

Word Embedding Techniques

There is a huge amount of data in text format. Analyzing text data is an extremely complex task for a machine as it’s difficult for a machine to understand the semantics behind the text. At this stage, we’re going to process our text data into a machine-understandable format using word embedding.

Word Embedding is simply converting data in a text format to numerical values (or vectors) so we can give these vectors as input to a machine, and analyze the data using the concepts of algebra.

However, it’s important to note that when we perform this transformation there could be data loss. The key then is to maintain an equilibrium between conversion and retaining data.

Here are two commonly used terminologies when it comes to this step.

Each text data point is called a Document
An entire set of documents is called a Corpus

Text processing can be done using the following techniques:

Bag of Words
TF-IDF
Word2Vec

Next, let’s explore each of the above techniques in more detail, then decide which to use for our Twitter sentiment analysis model.

A. Bag of Words

Bag of Words does a simple transformation of the document to a vector by using a dictionary of unique words. This is done in just two steps, outlined below.

Construction of Dictionary

Create a dictionary of all the unique words in the data corpus in a vector form. Let the number of unique words in the corpus be, ‘d’. So each word is a dimension and hence this dictionary vector is a d-dimension vector.

Construction of Vectors

For each document, say, rᵢ we create a vector, say, vᵢ.

This vᵢ which has d-dimensions can be constructed in two ways:

For each document, the vᵢ is constructed in accordance with the dictionary such that each word in the dictionary is reproduced by the number of times that word is present in the document.
For each document, the vᵢ is constructed in accordance with the dictionary such that each word in the dictionary is reproduced as:

1 if the word exists in the document or
0 if the word doesn’t exist in the document

This type is known as a Binary Bag of Words.

Now we have vectors for each document and a dictionary with a set of unique words from the data corpus. These vectors can be analyzed by,

plotting in d-dimension space or
calculating distance between vectors to get the similarity (the closer the vectors are, the more similar they are)

B. Term Frequency — Inverse Document Frequency

There are three elements here: word, document, corpus. Term Frequency — Inverse Document Frequency, or TF-IDF for short, uses the relationship between these elements to convert text data into vectors.

Term Frequency refers to the relationship between a word and a document. Whereas Inverse Document Frequency refers to the relationship between a word and the corpus.

Calculating Term Frequency

Term frequency is the probability of the word wⱼ in the document rᵢ. It is calculated as below.

The mathematical formula for calculating TF. (Image by author)

High Term Frequency of a word in a review implies the word is frequent in that review. Low Term Frequency of a word in a review implies the word is rare in that review.

Calculating Inverse Document Frequency

Inverse Document Frequency (IDF) says how frequently a word occurs in the entire corpus. This is calculated as below.

The mathematical formula for calculating IDF. (Image by author)

Low Inverse Document Frequency implies the word is frequent in the corpus. High Inverse Document Frequency implies the word is rare in the corpus.

We use logarithm instead of simple inverse ratio because of scaling. Term Frequency is a probability and ranges between 0 and 1. The inverse ratio of this can take values from 0 to infinity and can bias the IDF. Using log to solve this problem is one simple and highly accepted reason.

TF-IDF of a word in the review = TF(word, review) * IDF(word, corpus).

In the vector form of each document, we have this TF-IDF of each word. Converting a document into a vector, using TF-IDF values is called TF-IDF vectorization.

TF-IDF vectorization gives high importance to words which are:

frequent in a document (from TF)
rare in the corpus (from IDF)

C. Word2Vec

In Bag of Words and TF-IDF, we convert sentences into vectors. But in Word2Vec, we convert words into vectors. Hence the name, word2vec!

Word2Vec takes as its input a large corpus of text and produces a vector space, typically of several hundred dimensions, with each unique word in the corpus assigned to a corresponding vector in the space. The positioning of word vectors is done in such a way that words with common contexts in the corpus are located closer in space.

For example, the vector from man to woman is parallel to the king to the queen, etc.

Example of vectors and their representations using Word2vec. (Image by author)

When to use what?

When it comes to which embedding technique to use for a machine learning model, there is no obvious answer: it really depends on the use-case.

Bag of Words is commonly used for document classification applications where the occurrence of each word is used as a feature for training a classifier.

TF-IDF is used by search engines like Google, as a ranking factor for content.

Word2vec is great when an application requires a lot of information like translating documents.

For our Twitter sentiment analysis, we’ll use ‘Bag of Words’ as a word embedding technique. The Scikit learn library provides a ‘CountVectorizer’ function to perform Bag of Words. Using ‘CountVectorizer’, we transform our processed data into vectors.

from sklearn.feature_extraction.text import CountVectorizer
bow=CountVectorizer( min_df=2,max_features=100000)bow.fit(tweets_data['tweet'])
tweets_processed =bow.transform(tweets_data['tweet']).toarray()

Overview of Word embedding. (Image by author)

Model Fitting

Logistic Regression is a supervised machine learning classification algorithm widely used by internet applications. It is the simplest algorithm to solve classification problems, but highly efficient. We’ll use this to get a probability of sentiment in our Twitter data analysis.

Using sklearn.linear_model, we can implement logistic regression. This model outputs the probability of the input belonging to the class, making it possible for us to do a sentiment analysis of Twitter data on new tweets.

from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
model.fit(tweets_train, target_train) # training the model
prediction = model.predict_proba(tweets_test) 
# predicting on the test set
prediction_int = prediction[:,1] >= 0.3 
# if prediction is greater than or equal to 0.3 then 1 else 0
prediction_int = prediction_int.astype(np.int)

Performance Metrics for Twitter Sentiment Analysis

Now that we have a Twitter sentiment analysis model that can output a probability of a tweet belonging to a particular class, we need some way to judge its performance. Precision and recall are the two most widely used performance metrics for a classification model.

Precision is the fraction of the relevant instances from all the retrieved instances. It helps us to understand the usefulness of the results.
Recall is the fraction of relevant instances from all the relevant instances. Recall helps us understand the coverage of the results.

The F1 Score is the harmonic mean of precision and recall.

For example, consider that a search query results in 30 pages, of which 20 are relevant, but the results fail to display 40 other relevant results. In this case, the precision is 20/30, and recall is 20/60. Therefore, our F1 Score is 4/9.

Using F1-score as a performance metric for classification problems is a good choice.

from sklearn.metrics import f1_score
from sklearn.metrics import confusion_matrix,f1_score, precision_score,recall_scorecf_matrix =confusion_matrix(target_test, prediction_int)
tn, fp, fn, tp = confusion_matrix(target_test, prediction_int).ravel()
print("Precision: {:.2f}%".format(100 * precision_score(target_test,       prediction_int)))
print("Recall: {:.2f}%".format(100 * recall_score(target_test, prediction_int)))
print("F1 Score: {:.2f}%".format(100 * f1_score(target_test, prediction_int)))

import seaborn as sns
import matplotlib.pyplot as plt
ax= plt.subplot()#annot=True to annotate cells
sns.heatmap(cf_matrix, annot=True, ax = ax,cmap='Blues',fmt='');# labels, title and ticks
ax.set_xlabel('Predicted labels');
ax.set_ylabel('True labels');
ax.set_title('Confusion Matrix');
ax.xaxis.set_ticklabels(['Positive', 'Negative']); ax.yaxis.set_ticklabels(['Positive', 'Negative']);

Heatmap of the confusion matrix. (Image by author)

Results

A model with an F1 score of 73% is a good-to-go model using traditional machine learning algorithms. However, there are ways to improve the model. We can use deep learning techniques (though these are expensive), and we can respond to results and feedback by adding features and removing misspelled words.

Also, keep in mind that these results are based on our training data. When applying a sentiment analysis model to real-world data, we still have to actively monitor the model’s performance over time.

Summary: Tips for Twitter Sentiment Analysis

In this article, we learned various text processing and word embedding techniques and implemented a Twitter sentiment analysis classification model on processed data. Hopefully, this will give you an idea of how these social media data analytics systems work, and the sort of work required to prepare and deploy them.

The text processing techniques mentioned in this article are widely performed on text data. However, we don’t have to perform all the techniques all the time. It’s important to carefully choose the processing and embedding steps based on our use case; this will play an important role in the sentiment analysis data.

In the world of social media data analytics, and especially with Twitter data analysis, it’s often important to have the support of a domain expert for each step of your process. Vocabulary on social networks is often unique to particular communities, and domain experts can help you to avoid data bias and improve the accuracy of your dataset and analysis.

That said concepts and techniques learned in this article can be applied to a variety of natural language processing problems. Outside of Twitter sentiment analysis, you can also use similar techniques for building chatbots, text summarization, spam detection, and language translation models.

Thanks for reading! This article was originally posted here. If you would like to experiment with this custom dataset yourself, you can download the data and see the complete code on Github. If you’d like to experiment with other Twitter datasets, here’s a repository for a variety of different Twitter content.

I am going to write more beginner-friendly posts in the future too. Follow me up on Medium to be informed about them. I welcome feedback and can be reached out on Twitter ramya_vidiyala and LinkedIn RamyaVidiyala. Happy learning!