The world’s leading publication for data science, AI, and ML professionals.

xkcd.com + Artificial Intelligence

This article shows you how we created an xkcd.com comic classifier using data science, deep learning, and elbow grease. We can predict the…

This article shows you how we created an xkcd.com comic classifier using Data Science, deep learning, and elbow grease. We can predict the topic of the comic from the description of the comic.

xkcd is an excellent example of time well wasted.
xkcd is an excellent example of time well wasted.

My circle of friends has a huge nerd crush on Randall Munroe, the author of the xkcd comics, and books like what if?. Mary Kate MacPherson took the initiative, and scraped the transcripts for every comic, and used my patent analyzer code to turn the transcripts into embedding vectors. She then tossed the vectors and labels into tensorflow and crunched the data down into clusters using t-SNE. Those of you who know xkcd.com will be well aware that the comics are numbered sequentially. I have a bet going that Randall will mention when the comic number is the same as the current year (2018) and so I have to publish this article quickly! As of writing this sentence he is already at 2017!

Here is a video of how that dimensionality reduction looks, both in 3D and in 2D:

The output of t-SNE dimensionality reduction was a bunch of points in 2D space, crunched down from the original 300 dimensional embedding space.

Mary Kate MacPherson then looked through the clusters on the 2D result to understand what they mean. The figure below shows us that the articles cluster by topic and entity.

This is our look into Randall Munroe's brain.
This is our look into Randall Munroe’s brain.

Looking at the the big picture, we can see that comics about food clump together, as do articles about star wars, trends, wordplay, kids, charts/diagrams, and so forth. As humans we can identify small clusters of a dozen or less items (e.g. kid cluster), but in order to do a really good job, Deep Learning wants to see lots of examples, and so I will show you how to build an xkcd topic model below using fewer clusters. For context, xkcd.com has a total of about 2,000 comics.

Let’s have a look at how the comics cluster in the time domain:

Brighter dot = newer comic. Darker dot = older comic.
Brighter dot = newer comic. Darker dot = older comic.

Here is a look at the original code:

Original patent application to vectors code... but comics are way more fun!
Original patent application to vectors code… but comics are way more fun!

Applying t-SNE and GMM in python, we get the following clusters:

Not super nice clumps. Let's try again....
Not super nice clumps. Let’s try again….

When we see clusters like this one above, it is not a good result. We want to see clumps where there is data, and empty holes between clusters. It reminds me of a quote I often hear from a friend of mine " The future is already here – it’s just not very evenly distributed." And so when we see even distributions like this, it is usually a sign that we need to try again.

Here is an example of the type of separation we would like to see:

This MNIST example looks way better than what we got from our python code so far... Source: this online book.
This MNIST example looks way better than what we got from our python code so far… Source: this online book.
The color of each dot shows the cluster membership for that dot. There are still not as nice separations as we saw with the embedding projector...
The color of each dot shows the cluster membership for that dot. There are still not as nice separations as we saw with the embedding projector…

I think I found the reason for the mismatch between tensorboard and my python code. I had perplexity at 25 instead of 12. DOH. Now I changed the 2D and 3D perplexity values and reran the code. Here is what we get:

Each color is a different cluster. It looks quite similar to the original non-working example... But a later version worked amazing, as you will see...
Each color is a different cluster. It looks quite similar to the original non-working example… But a later version worked amazing, as you will see…
2D slice of the 3D t-SNE results. Clusters 4 and 6 are pretty mixed up.
2D slice of the 3D t-SNE results. Clusters 4 and 6 are pretty mixed up.

I am pretty happy with the 2D t-SNE result, but GMM didn’t do a great segmentation job:

More separation than the last time.
More separation than the last time.

Normalizing the 300 columns before t-SNE gave us this result:

Let's just go for it...
Let’s just go for it…

Let’s back out of nerding out on t-SNE and try to understand the clusters, but first here is the data used to do all this stuff…

Just to remind you, we have used comics to create a topic model. Here are the topic model vectors for each comic, and here are the topic labels (just integers) for each comic. I’ve organized stuff into a github repo for your coding pleasure.

Here are the cluster dataframes for each comic using the automated methods described above: 3D and 2D. And here is the notebook for applying t-SNE in 2D and 3D. Here are the cluster labels (just integers) for each comic (samples for 10 clusters) using the manual method described below.

What does it all mean?

Here is our interpretation of the clusters for the 3D t-SNE data clustered by GMM:

Cluster examples from this 3D dataset are: ZERO (questions), ONE (tech), TWO (graphs), THREE (dates+places), FOUR (time), FIVE (thinking stuff), SIX (hats! yes, hats!), SEVEN (bad stuff), EIGHT (math), and NINE (interwebz).

And here is our interpretation of the 2D cluster data we created using the embedding projector by hand:

Cluster examples from this dataset are: ONE (time), TWO (philosophy), THREE (science), FOUR (perspective), FIVE (excited), SIX (Megan), SEVEN (computerwebs), EIGHT (hyphens), NINE (tech), and TEN (thinking stuff).

A special thank you to Mary Kate MacPherson for doing all the manual labor on the cluster labeling, and Mathieu Lemay for letting her do it.

To make your own cluster labels manually…

To get the clusters (labels) manually instead, we run 10,000 iterations of t-SNE (learning rate = 1; perplexity = 12) and instead of using clustering (k-means or GMM) we just eyeballed it and pulled out the IDs for each cluster. I absolutely LOVE that you the reader can head over to projector.tensorflow.org and completely replicate this experiment.

Here is a big picture view of how to do this yourself the "manual" way:

This is the picture of the hand-made data we used.
This is the picture of the hand-made data we used.

Now let’s go through this step by step:

1) Load the data

Here is a link to an interactive version of the data in the embedding projector:

xkcd Embedding projector THIS-IS-SO-COOL

2) Run t-SNE. Stop when you feel like clustering looks good.

3) Create clusters by hand by dragging with the annotation tool

This is the tool you want
This is the tool you want
Drag the tool to make a box over the points you want to label. We can see that the cluster includes samples such as 1834, 128, 307, 459, 89, and 477. They are about disaster, pain, disaster, angry exclamation... I see a pattern.
Drag the tool to make a box over the points you want to label. We can see that the cluster includes samples such as 1834, 128, 307, 459, 89, and 477. They are about disaster, pain, disaster, angry exclamation… I see a pattern.
The points will show up in the list on the right hand side
The points will show up in the list on the right hand side

4) Inspect the source of the list on the right, and copy out the HTML DIV of the list (I know: gross, and it has a limit 100, so you don’t get all the data for big clusters)

5) replace all the HTML with spaces in notepad++. To do this, simply go to the replace menu (CTRL+H), and replace this regular expression with a space character:

<[^>]+>

Now replace all double spaces with single spaces until there are only single spaces, and then replace all the single spaces with newlines. Chopping off the first space character, we now have a list of IDs for this cluster. Repeat for each cluster until you covered all the clusters you care about. BAM. Done.

You can merge all the clusters into one file either in python or excel. For here it is pretty easy to see how we got to building the pandas dataframe of the topic model.

Predict Stuff!

It would be really great to build a topic model of xkcd and predict what kind of topic a given comic is about.

We know the input (X) is the topic model vector for the comic and the ascending ID of the comic, and the prediction (Y) is the cluster label. So we know this is a categorical problem that will need a softmax, and an MLP (DNN) seems up for the job.

Here is the network we used on the 3D t-SNE dataset (random_state=42!):

model = Sequential()
model.add(Dense(128, activation='relu', input_dim=X_train.shape[1]))
model.add(Dropout(0.5))
model.add(Dense(128, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(128, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(y_train.shape[1], activation='softmax'))

And we got very good predictions from this network (89% accuracy):

Confusion Matrix
[[11  0  0  1  0  0  0  0  0  0]
 [ 0 27  0  0  0  0  1  2  1  1]
 [ 0  2 11  0  0  0  0  0  4  2]
 [ 0  1  0 16  0  0  0  0  0  0]
 [ 0  0  0  0 24  0  1  1  0  0]
 [ 1  0  0  1  0 25  0  0  0  0]
 [ 0  0  0  0  1  0 18  0  0  0]
 [ 0  0  0  0  0  0  0 12  0  0]
 [ 0  0  0  0  0  0  0  0 23  0]
 [ 0  2  0  0  0  0  0  0  0 13]]

If you do play around with this dataset it would be wonderful to hear about your results! I’m sure Randall will be pleased that we are all nerding out on his years of doodling.

Happy xkcd coding,

-Daniel

[email protected] ← Say hi. Lemay.ai 1(855)LEMAY-AI

Other articles you may enjoy:


Related Articles