The world’s leading publication for data science, AI, and ML professionals.

Predicting Tweet Sentiment with Machine Learning

Sentiment analysis on Tweets

Photo by 🇨🇭 Claudio Schwarz | @purzlbaum on Unsplash
Photo by 🇨🇭 Claudio Schwarz | @purzlbaum on Unsplash

It’s undeniable that social media has now become an extremely important platform to aid businesses in increasing their online reach. Over the years, marketers have progressively become more and more aware of the importance of social media platforms such as Facebook, Twitter, LinkedIn, and the role they play in business growth via mouth-to-mouth marketing.

Many are also noticing, social media is a great place to meet your customers exactly where they are and it is fair to say, some businesses have got a decent sense of humor…

Source: Twitter Feed
Source: Twitter Feed

If we recall the purpose of social media, of which I will generalize to connecting people together from anywhere in the world, it makes sense why businesses have flocked to social media, as they are realising it is a great place for communication whether that communication is good news, bad news, fake news, or an emergency occurring somewhere in the world.

Additionally, the ubiquitousness of smartphones has enabled people to announce these events in real-time. If a customer finds a hair in their food, you can bet it is going to be on social media. As a result, an increasing amount of agencies, organizations, businesses, and curious individuals have developed an interest in programmatically monitoring various social media websites for what matters to them.

A great example of this may be disaster relief organizations that want to detect the occurrence of a disaster. This is evidently a very important task, but also extremely difficult since it is not always clear whether a person’s words are announcing a disaster or an exaggerated description of something attractive – in that respect, it’s fair to say that words can be quite ambiguous.

Source: Twitter Feed
Source: Twitter Feed

The image above shows a tweet by a user stating they’d like to set "the sky ablaze". When thinking of "ablaze" as a verb, this tweet could be considered a disaster since that would mean the sky is burning fiercely. However, we know by the context provided in the tweet, the user means the adjective version of "ablaze" which means they are stating that the sky is very brightly coloured or lighted. This is pretty obvious, but not so to a computer.

Hence, it’s our job as the Data Scientist (NLP expert) to come up with solutions to overcome try as best as possible to overcome the various ambiguities of language so the people we serve are not constantly bombarded with false claims of disasters due to language discrepancies.

On that note, we will do some sentiment analysis coupled with Machine learning to detect the sentiment of disaster tweets from Twitter. The goal is to determine whether there is a real disaster or not according to the tweet.

Note: For the followers of my posts, you’d recall that we have been working on Natural Language Processing together. If you are new, all this is about is the notes I have been documenting from the Natural Language Specialization on Coursera. To save me having to explain again what Sentiment Analysis is, please review the Getting Started with Sentiment Analysis post.


The Data

We could have retrieved the data ourselves using Twitter’s API. The Documentation states "Twitter’s Developer Platform enables you to harness the power of Twitter’s open, global, real-time and historical communication network within your own applications. The platform provides tools, resources, data, and API products for you to integrate, and expand Twitter’s impact through research, solutions and more.". Doing it this way would mean that we’d have to process and label the data ourselves, which is more suited to a real-world scenario.

Fortunately for us, a large portion of the hard work has been done since the data has already been retrieved and stored as comma-separated-value (csv) files on Kaggle. Therefore, our job is just to download the data to our local drive – To download the data, visit the Real or Not? NLP with Disaster Tweets getting started competition and download the data from the data section.

Disclaimer: The dataset for this competition contains text that may be considered profane, vulgar, or offensive.

The features we can expect to see in the data are as follows:

  • id – A unique identifier for each tweet.
  • text – The text of the tweet
  • keyword— A keyword from the tweet (although this may be blank!)
  • location— The location the tweet was sent from (may also be blank)
  • target – Denotes whether a tweet is about a real disaster (1) or not (0); Only in the train.csv and sample_submission.csv

The evaluation metric is F1 score – See story below for more.

Confusion Matrix "Un-confused"


Building the Model

All code can be accessed via Github.

kurtispykes/twitter-sentiment-analysis

Phase 1

When beginning on a project, getting fast feedback is extremely important, therefore, I am not to worried about getting the best accuracy immediately. Instead, I want to find the best way to get instant feedback from the data and then begin to dive in deeper to distinguish what improvements could be made.

In order to pass this phase, I begin by splitting my data into train, development, and test set – since this data is from Kaggle, we already have a test set so we only need a development set. However, instead of doing a holdout based development set, I will use stratified Kfold cross validation – For more on cross validation see my article on Cross validation.

Cross-Validation

To help us out with our code, I’ve made a configuration file that we will use so we don’t do any hard coding in our scripts…

Now let’s create our folds.

This function reads the original training data we received from Kaggle, creates a new column and populates all values with -1. After that, we shuffle the data, create the folds by labelling what fold each instance belongs to and finally saving it as a csv file.

The process to create folds is pretty standard for me. I do detail how I go about selecting my validation strategy in the Cross validation article, but more often than not, my go to is Stratified Kfold for classification problems – I even using it in my Using Machine Learning to Detect Fraud story.

Phase 2

The next phase of fast iteration is to train a model as quickly as possible, which means we aren’t going to worry to much about the best feature engineering practices or anything. We just want to train a simple model and do inference.

Essentially, we have 3 features (not including id) and they are text, keyword and location. The simplest way I could think of to deal with these is to simply append the keyword and location to the end of the text. After, we do some basic processing of our text, convert the text to a vector then build our model.

To convert our text to a vector, we use CountVectorizer which converts a collection of text documents to a matrix of token counts – See Documentation.

We will run our training script with 2 different models and use the model with the best average on each fold to submit to Kaggle when we do our script for inference. The two models we use are Logistic Regression and Naive Bayes.

Algorithms from Scratch: Logistic Regression

Algorithms From Scratch: Naive Bayes Classifier

If you don’t recall from above, the evaluation metric for this competition was F1 score:

Hence, we will use Naive Bayes for inference…

This outputs a CSV file that I will submit to Kaggle

File submitted to Kaggle (Image by Author)
File submitted to Kaggle (Image by Author)

Two things to notice about this… There is a large variance between our average dev set scores and our test score, so we would have to do some tuning to our cross-validation technique. The next thing is that our score is not so great. I am aiming to get around the 85 mark on our F1 score.

In another post, I will discuss error analysis so we can begin to work on ways to improve our model.

Thanks for reading to the end, let’s continue the conversation on LinkedIn…

Kurtis Pykes – Data Scientist – Freelance, self-employed | LinkedIn


Related Articles