The world’s leading publication for data science, AI, and ML professionals.

Sentiment Analysis: Predicting Whether A Tweet Is About A Disaster

Exploring Natural Language Processing

Natural Language Processing Notes

Photo by Joshua Hoehne on Unsplash
Photo by Joshua Hoehne on Unsplash

Introduction

I often take a Machine Learning first approach to the analysis of data, but after exploring my Github, I realized that I didn’t have many exploration notebooks, especially in the area of Natural Language Processing (NLP), which is an area I am keen on focusing my attention – More specifically, Conversational AI for those interested.

7 Free Online Resources For NLP Lovers

Sentiment analysis tasks involve interpreting and classifying subjective data using techniques from Natural Language Processing (NLP) and Machine Learning (ML). As a result of much of the world’s data being in a digital format, many businesses are leveraging the power of sentiment analysis to gather more information on brand reputation, customer’s thoughts towards various products, and many more.

Getting Started With Sentiment Analysis


The Problem Statement

I’ve been exploring a particular Dataset with Machine Learning and NLP scripts for some time. However, I’ve never actually explored the dataset with actual Exploratory Data Analysis (EDA) and Data visualizations, so I thought it wise to hone in on the same project to discover some insights that I never previously knew about the data.

Twitter has become an important communication channel in times of emergency. The ubiquitousness of smartphones enables people to announce an emergency they’re observing in real-time. Because of this, more agencies are interested in programmatically monitoring Twitter (i.e. disaster relief organizations and news agencies). But, it’s not always clear whether a person’s words are actually announcing a disaster. Take this example:

The author explicitly uses the word "ABLAZE" but means it metaphorically. This is clear to a human right away, especially with the visual aid. But it’s less clear to a machine.

In this competition, you’re challenged to build a machine learning model that predicts which Tweets are about real disasters and which ones aren’t. You’ll have access to a dataset of 10,000 tweets that were hand classified. [Source: Kaggle – Natural Language Processing With Disaster Tweets]

Links to other write-ups done using this dataset:

Note: All code that was used to generate the visualizations seen in this article can be found on Github.

kurtispykes/twitter-sentiment-analysis


Data Breakdown

Typical of Kaggle, our datasets were already divided into train and test sets. The train set consist of 7613 rows and 5 columns, whereas the test set consist of 3263 rows and 4 columns – Missing column is the target column, what we want to predict. Since we are only exploring the data, I’ve joined the train and test data together for further analysis.

Image By Author
Image By Author

Simple Exploratory Data Analysis (EDA)

The first thing I like to check when working on classification tasks is the target label because I want to know from the beginning whether we have an Imbalanced classification task on our hands – See Oversampling and Undersampling for more on imbalanced classification.

Image By Author
Image By Author

There is an imbalance toward the negative class (0 → tweet is not about a disaster) in our dataset. This information is important to know because this could have a severe impact on the final classifier we create. For example, the classifier may be inclined to predict the predominant class, which in turn would mean our classifier has high accuracy. This is a misleading metric for these types of problems because it does not reflect the actions being taken by the classifier, so when it’s deployed into a real-world environment it would be ineffective. In our case, the imbalance isn’t to serve yet it’s still something worthy of taking notes of.

Next, I generally want to know the amount of data that’s missing from each of the features in our dataset. Many machine learning algorithms cannot deal with missing data so we need a way to handle these instances before building our classifier (depending on what classifier we use) – Learn more about Handling Missing Data.

Image By Author
Image By Author

33.45% of values in the location column are missing in the training and test set which is quite significant. Since these are locations of tweets and users on Twitter have the option of turning off their location, I’m thinking that we could possibly impute something to indicate that the location is missing. Also, due to a large number of locations being missing, we can consider removing this feature from our pipeline entirely.

The top 10 locations from the tweets listed Countries as well as Cities so I converted the Cities to countries and calculated the number of tweets from each country.

Note: There are 4521 unique locations in the dataset, however, I only converted the Cities that appeared in the top 10 locations to Countries. Therefore, it is likely that there are still some Cities throughout the rest of the dataset and it’s something to consider if we keep this feature.

Gif By Author
Gif By Author

On the same topic of locations, I thought it would be interesting to find out if there were certain locations where disaster tweets were more likely to appear. With this in mind, I created a visualization that states the percentage of tweets from the top 10 locations that were disasters.

Image By Author
Image By Author

A very important question that is not often asked by beginner Data Scientists – and I don’t mean to speak lowly of aspirants, especially because I don’t blame any of you for not asking. Things like this are often not taught in many of the courses we take online and can only be learned by taking on hands-on work – is where on earth the data comes from, how it was collected, etc. These types of queries about the data are very important for various reasons beyond the scope of this post, but in the future, I would definitely revisit it.

Nonetheless, one of these reasons is because you would want to validate some things you are told about the data from the collection source. For example, I know this data was created by a company called figure-eight but I am unsure of how the data was labeled and how it was created. One thing I can assure you of is that if humans are involved in the process, there is always a large room for error since we can get fatigued – of course, computers can make errors too, but hopefully, you get my gist.

Since I don’t know how the data was collected or labeled, my first thought was to consider whether there are any duplicated tweets in our dataset. Here are the results:

  • 198 Duplicated tweets in the train and test data
  • 110 of those duplicates are in the training data
  • 88 are duplicate tweets from the test data

This has me worried. If there are duplicates in the data then it’s possible that if a human had labeled the instances there could be duplicate labels that were labeled differently. And that thought was right.

Image By Author
Image By Author

Kaggle has an unseen dataset that it uses to evaluate where your position on the private leaderboard would be, and it’s possible that labels were also labeled incorrectly on there. In a business environment, you would want to manually relabel them if it has a severe effect on your classifier results then it is definitely something that would require adjusting. Nevertheless, as a safety precaution, I manually changed these labels to fit the correct class.

When dealing with text, it’s quite difficult to visualize them on traditional plots (i.e. bar plots and line plots for example). Instead, we could use word clouds to visualize common words that occur in our data. I used this method to visualize the keyword features that are most associated with disasters…

Image By Author
Image By Author

and those not associated with disasters…

Image By Author
Image By Author

The words that are displayed in the much larger text are the ones that have more frequent counts in the dataset. A naive assumption we can make about new instances could be that any instance that has the keyword "outbreak" and "wreckage" in it are likely to be about disasters, whereas the ones with "armageddon" and "body" are likely to be non-disasters.

There are also many other factors to consider. For instance, if someone is in the middle of a disaster, are they going to be writing longer tweets or shorter tweets, and will that person use more punctuation or less. These ideas could be captured using stats.

Gif By Author
Gif By Author

Making Predictions

To determine which model I was going to go ahead and continue development with, I built various different models, only using the text as an input, and tested each model using 5 fold stratified cross-validation. The Models I used were:

To convert the text from natural language to computer language – simply just numbers – I used 3 different methods:

  • Term Frequency – Inverse Document Frequency
  • Word Counts
  • Word2vec

After running each model with my cross-validation strategy, I took the average of each model’s performance on all the folds and used this as the indicator to decide the model I was going to go ahead and use for further analysis.

Image By Author
Image By Author
Image By Author
Image By Author

Realistically, I do think this dataset is small enough to not require a recurrent neural network but either way, without any tuning, feature engineering, and only using the text feature, we got a very decent average F1 score (which is the metric used for this competition ) for the bidirectional LSTM – If you’re unfamiliar with the F1 score you can learn more about it in my Confusion Matrix article.

Evaluation

To evaluate the model, I split the full training data into 70% training data and 30% validation data so that I could build a single model and assess where the model was making errors. First of all, I noticed that my LSTM model was overfitting the training data quite badly:

  • Training F1 Score – 0.9019 (4 s.f.)
  • Validation F1 Score – 0.7444 (4 s.f.)

To reduce the overfitting, we could add some regularization as an immediate solution but I would explore this further in another blog post.

From the confusion matrix, I realized that the model was having a hard time predicting the positive class since it made 294 type II errors – meaning that it predicted negative when the tweet was actually positive. Another name for this phenomenon is called False Negatives. This issue may link back to my initial concern about the imbalanced classes.

Wrap Up

In this work, I’ve discussed my process when exploring text data from Twitter and how I go about building models to see which one I would use for further development. In future work, I plan to combat the overfitting issue that we are having with our data and build a very basic front end so that you could type in some of your own tweets and you’ll be able to see whether the system considers your tweet to be about a disaster or not.

Connect with me on LinkedIn and on Twitter to stay up to date with me and my posts about Artificial Intelligence, Data Science, and Freelancing.


Related Articles