Game of Thrones Twitter Sentiment with Google Cloud Platform and Keras

An end-to-end pipeline with AI Platform, Apache Beam / DataFlow, BigQuery and Pub/Sub

Thomas Dehaene
Towards Data Science

--

The final season of Game of Thrones apparently raised a lot of eyebrows, so I wanted to dig deeper on how people felt before, during and after the final episode of Game of Thrones by turning towards the ever non-soft-spoken Twitter community.

In this blogpost, we’ll look at how an end-to-end solution can be built to tackle this problem, using the technology stack available on Google Cloud Platform.

Let’s go!

source: created Blackbringer on GIFER (source)

The focus is more on realising a fully working solution, rather than perfecting a single component in the entire pipeline. So any of the individual blocks can certainly be perfected!

To keep it readable, I haven’t included all of the code, but everything can be found on this Github repo, fully commented.

The basic idea

The rough outline for the entire pipeline looks something like this:

general architecture (source: own creation)

Basically, want can be done is:

  1. Have a script running on a VM, scraping tweets on Game of Thrones
  2. Have a PubSub topic to publish messages to
  3. Have a served ML model to classify tweet sentiment
  4. Have an Apache Beam streaming pipeline pick up the tweets and classify them
  5. Output the classified tweets to BigQuery, to do analyses on

In the rest of the post, we’ll glance over all of the various components separately, to finalize with a big orchestra of harmonious pipelining bonanza!

We will be relying heavily on Google Cloud Platform, with the following components:

  • Compute Engine: to run the tweepy script on
  • Cloud PubSub: to buffer the tweets
  • Cloud Dataflow: managed Apache Beam runner
  • AI Platform: to serve our ML model via an API
  • BigQuery: to store our tweets in

1. Script on GCE to capture tweets

Google Compute Engine logo (source)

Capturing tweets related to several searchterms can easily be done using the tweepy API, like so:

To send it to Google Cloud PubSub, we can just use the client library:

So with this done, it’s just a simple as:

  • Setting up a VM on Google Compute Engine (I’ve used a simple n1-standard-1)
  • Copying the script to a bucket on Google Cloud Storage
  • SSH into the VM
  • Copy the script from the bucket to the environment
  • Install python3 on the VM
  • Run the python script

2. Cloud PubSub topic as message broker

Google Cloud PubSub (source)

Pub/Sub is a great piece of messaging middleware, which serves as the event ingestion and delivery system in your entire pipeline.

Especially in this case, where the tweets will potentially flow in much faster than the streaming pipeline can pick them up, it’s a great tool, given that ingestion and delivery are decoupled asynchronously.

Pub/Sub can also store the received messages for a number of days, so no worries if your downstream tasks struggle to keep up.

Creating a topic is extremely easy: just navigate to your GCP Console and go to the Pub/Sub menu:

PubSub console UI (source: own screen-capture)

From here on, just click the CREATE TOPIC button and fill in a name for your topic. For future reference, I’ve named mine ‘got_tweets’.

3. Served ML model on AI Platform

Google Cloud ML Engine (source)

For each tweet coming in, we want to determine if the sentiment expressed (presumably towards the episode) is positive or negative. This means we will have to:

  • look for a suitable dataset
  • train a machine learning model
  • serve this machine learning model

Dataset

When thinking about sentiment analysis, we quickly think of the ‘IMDB Movie Review’ dataset. For this specific purpose though, this classic seemed less suited, since we are dealing with tweets here.

Luckily, the Sentiment140 dataset, which contains 1.6 million labeled (positive and negative) tweets, seems to be perfectly suited for this case. More info, and the dataset, on this Kaggle page. Some examples:

sample from the Sentiment140 dataset

Preprocessing the text is done in a separate class, so that it can later be reused when calling the model:

Model

For the classification model itself, I based myself upon the famous 2014 Yoon Kim paper on Multichannel CNN’s for Text Classification (source). For ease of development (and later deployment), I used Keras as the high-level API.

CNN architecture overview (source)

A CNN-based model provides the additional benefit that training the model was still feasible on my little local workstation (NVidia GTX 1050Ti with 4GB memory) in a decent time. Whereas an RNN-based model (often used for sentiment classification) would have a much longer training time.

We can try to give the model some extra zing by loading some pretrained Word Embeddings. In this case: the Glove 2.7B Twitter embeddings seemed like a good option!

The full code can be found in this notebook.

We trained the model for 25 epochs, with two Keras Callback mechanisms in place:

  • a callback to reduce the LR when the validation loss plateaus
  • a callback to stop early when the validation loss hasn’t improved in a while, which caused it to stop training after 10 epochs

The training and testing curve can be seen here:

So we obtain an accuracy of about 82.5%.

Serving the model

AI Platform provides a managed, scalable, serving platform for Machine Learning models, with some nice benefits like versioning built into it.

Now for hosting, there’s one special aspect of our model which makes it a bit less trivial to serve it in AI Platform: the fact that we need to normalize, tokenize and index our text in the same way we did while training.

Still though, there are some options to choose from:

  • Wrap the tf.keras model in a TF model, and add a Hashtable layer to keep the state of the tokenization dict. More info here.
  • Go full-blown and implement a tf.transform preprocessing pipeline for your data. Great blog post about this here.
  • Implement the preprocessing later on, in the streaming pipeline itself.
  • Use the AI Platform Beta functionality of having a custom ModelPrediction class.

Given that there wasn’t time nor resources to go full-blown tf.transform, and that potentially overloading the streaming pipeline with additional preprocessing seemed like a bad choice, the last one looked like the way to go.

The outline looks like this:

overview of the serving architecture (source: own creation)

Custom ModelPrediction classes are easy enough, there’s a great blogpost by the peeps from Google on it here. Mine looks like this:

To create a served AI platform model from this, we just need to:

  • package up the custom prediction and the preprocessing .py file
  • upload this package, with a persisted model and preprocessing class instance to a bucket
  • from there on, create a model named whatever you want
  • in this model, create a new version, based on the uploaded items with some beta magic:

4. An Apache Beam streaming pipeline

Cloud Dataflow (source)

Tweets come in in a streaming fashion, it is literally an unbounded dataset. A streaming pipeline therefore seems like the perfect tool to capture tweets from a Pub/Sub topic and process them.

We will use Apache Beam as the programming model, and run the pipeline on a Dataflow runner (managed environment on Google Cloud for running Beam pipelines). For those of you who want to read more on Apache Beam and its paradigm can read more on the website.

Firstly, when streaming, we have to consider a Windowing strategy. Here, we just use a fixed window of 10 seconds.

Fixed windowing strategy (source)

Other strategies can be done as well, such as a moving window strategy. This would probably infer extra calls to the hosted ML model. So the fixed windowing seemed the easiest to get started with.

The main steps in our pipeline are:

  • Pull in Pub/Sub messages in 10-second intervals
  • Batch them up in batches of 50 messages (not too big, or the body of the request will be too large)
  • Classify them by making calls to the hosted ML model
  • Write them to a BigQuery collection
  • In parallel, group the mean sentiment on this 10-second, and write this to a second BigQuery collection

When running on Cloud Dataflow, it looks as follows:

screen capture of the created dataflow pipeline

The full code is a little long to paste here, but it can be found in full on my Github repo.

5. Have a BigQuery collection to stream results into

BigQuery (source)

As stated before, we have two BigQuery tables to stream the results into:

  • One for the individual posts, with the sentiment label, to perhaps relabel them in the future and finetune our classifier
  • One for the mean predicted sentiment per 10-second window
screen capture of the created dataset and tables

You can just create these from the UI, and specify a schema (which of course has to map to the specified schema in your Beam pipeline job).

The run

I ran the entire pipeline for a few hours, to capture both the sentiment leading up to, during and after the episode.

Given that the amount of tweets could quickly become fairly large, it was also good to observe the scaling capabilities of all of the components:

  • AI Platform: a real MVP in this story, scales really well in the backend when the load increases, to try and keep response times stable:
Requests per second for calls to AI Platform model
Response times during the run, nice n’ stable
  • Cloud Dataflow: in hindsight, Java streaming feels a bit more solid than Python streaming. Autoscaling does not currently work when streaming Python pipelines; this caused the system delay time to grow throughout the run:
System delay (in second, right hand side axis)
  • BigQuery: not a problem at all. BQ operates with a streaming buffer, and offloads data periodically to the table itself. Also for post-analysis, BQ is never an issue.

In total, about 500.000 tweets were collected in a 7-hour period. Here are some examples, with their predicted sentiment (warning: spoiler alert!)

(source)

The results

Now as for the main question, we could try to frame it as:

What is the average sentiment expressed in the tweets, per minute. In one hour before, during and one hour after the episode.

Simple enough with some SQL query magic (see the notebook in the repo), with some notes:

  • The scores were standardized to mean 0 and stddev 1
  • Both the moving average and raw mean sentiment are shown
  • Some key scenes from the show are mentioned

👉 So apparently, the community was very hostile towards GoT before the show, gradually putting down their pitchforks and torches towards the beginning of the episode.

👉 It could be stated that Bran being named king was well received, I too thought this was a very nice plot twist 😃!

👉 Another positive scene was when Brienne of Tarth was writing about Jaime in the book of knights.

👉 After the episode, the community seemed to be rather negative towards the final episode, changing their mind a little after about 45 minutes, before becoming negative once again…

They ended up being rather negative of the episode, which seems to be reflected in the IMDB score of only 4.4 😮. One could argue that the episode didn’t stand a chance, as the community was already rather negative before the episode, so that the sentiment somewhat started with a disadvantage bias.

Is this the ground truth though? Nobody knows for sure, but I’m quite happy with the results 👍.

(source)

So there we have it! An answer to our question, using the toolbox Google Cloud provides us. FYI: the total cost of the operation ended up being around $5, which I would say is fairly reasonable!

--

--

Information-addicted Machine Learning Engineer at ML6. Turning caffeine into Python code.