Did Melania Really Tweet That?

My models say uh-uh.

Jason Peterson
Towards Data Science

--

You are what you tweet. Wordclouds formed from the most significant words in the POTUS (orange) and FLOTUS (cyan) tweet corpora. (Want to make wordclouds like these? Use this code example that shows how to do masking and custom colorization.)

Melania is back! After 24 days out of the public eye, she attended a White House ceremony earlier this week, or at least her body double did. ;)

Her absence had stirred a lot of speculation in the media, which she addressed in a tweet on May 30th.

The mystery tweet.

The trouble was that the tweet didn’t quash speculation as to her whereabouts because it didn’t sound like her. Uncharacteristically angry in tone, it sounded a lot more like Trump. (Or at least it sounded more like Trump’s people than her people.)

It reminded me of the 90s mystery around who wrote the anonymous book Primary Colors, a thinly-veiled work of fiction — too thin for some—set in the Clinton White House. A Vassar lit prof and self-styled “forensic linguist” and “stylometrician” made the case that it was the columnist Joe Klein who wrote the book. And later Klein admitted that he did.

That prof is circumspect about his exact methodologies for inferring authorship (and he’s been wrong as often as right), but it has something to do with tallying word frequencies. (Ring any bells?)

It all inspired me to conduct a quick Python experiment to try to answer the question of who really wrote the angry tweet above.

Step 1: Get the Data

Twlets.com makes this step very easy. You install a Chrome app, visit the Twitter user’s page who’s tweets you want, and click the toolbar icon to download.

There’s a max number of tweets you can download before you hit a paywall, but its high enough that I was able to download all official POTUS and FLOTUS tweets.

A Snag: Data Imbalance

As of a couple days ago, Melania had only 307 tweets from her official FLOTUS account, versus Trump’s 3,259 tweets from his POTUS account.

Graphically (just for fun), that imbalance looks like this.

All POTUS/FLOTUS tweets plotted after TF-IDF vectorization and PCA dimension reduction (we’ll get to all that below).

That’s a lot of orange marbles for The Donald and too few cyan ones for Melania. A more than 10–to-1 imbalance.

An imbalance like this makes it hard to gauge the accuracy of a model in development. The simplest model, given this imbalance, is to, by default, classify all inputs as belonging to the Trump class, the majority class.

All of Melania’s tweets will be misclassified by this simple model, but you’ll still have an accuracy rate that is better than 91%.

We need then to make a model that is much more accurate than 91%. I’m not sure we’ll get there for this little experiment, but that would be the goal.

(Better would be to balance the data. Were I doing this again, and had I not already blown through my free tweet downloads via Twlets, I would mix some of Melania’s personal tweets in with her official tweets, so as to have as many Melania samples as we have Trump samples. I’ll leave that exercise to future researchers.)

Step 2: Clean and Divvy Up the Data

We need to remove special characters and capitalization from the tweets before vectorizing them. I just used the simple function called clean_str found in many Github NLP repos such as this one.

I used 85% of my data for training and the remaining 15% for testing (I didn’t use a dev set for this little experiment). That gave me a training set with 2,769 samples and a test set with 489 samples, each a labeled, randomized mix of Trump and Melania tweets.

The tweet who’s authorship is in question we’ll withhold from both the training and test set. We’ll ask for a prediction for it at the very end of this exercise.

Step 3: Vectorize the Data

We need to convert the words in these tweets to numbers so we can feed a classifier of some sort.

I tried two vectorization techniques, one old, one newer.

Term Frequency-Inverse Document Frequency (TF-IDF)

This technique dates from the 1950s. That Vassar prof from the 90s on the trail of Joe Klein would have had this technique at his disposal and almost certainly made use of it.

The Wikipedia definition for TF-IDF is easy enough to understand:

The tf-idf value increases proportionally to the number of times a word appears in the document and is offset by the frequency of the word in the corpus, which helps to adjust for the fact that some words appear more frequently in general.

So, the more times a word appears in a document the higher will be its score. But we subtract from that score if that same word is also common throughout all the documents (the corpus). This weeds out very common words (articles, prepositions, to-be conjugations, etc.), leaving us with high scores for what should be the significant words in a given document.

This was the technique used to make the pretty wordclouds at the intro. Each word represents a whole tweet from either Trump or Melania. That word was chosen because it has the highest TF-IDF score of all the words in the tweet.

The sklearn library in Python makes TF-IDF vectorization dead simple. In a couple of lines we can turn the training set into a big sparse array with dimensions that equal number of samples X number of features. Each of those features stands for a word that appears somewhere in the corpus, represented as a TF-IDF score.

For most documents, most features will have null values because the associated word appears nowhere in the document. Non-null features represent words that appear somewhere in the document, weighted according to how many times they appear in the given document versus the larger corpus.

These arrays are grist that any model can chew on. That model can be an older-style support vector machine (SVM) or a newer style neural network, a convolutional neural network (ConvNet) or a recurrent neural network (RNN).

Word Embedding Vectorization

TF-IDF is basically word tallying. It does not capture meaning. Word embeddings attempt to capture meaning.

They’re complicated to explain, but this post does a good job of it. The gist is that deep learning techniques can be used to digest a whole corpus of words into a low-dimensional embedded matrix. That matrix will be as wide as the number of words in the corpus, but reduced in height to 100 or 300, say. It will, given enough words, almost magically preserve the context in which words tend to get used.

If we query the model holding this matrix with a word, we’ll get back its 100 or 300 high column vector. These are the embeddings the model has learned for this particular word.

In the case of documents, such as our tweets, we can query the model for every word in the tweet and then do something as simple as averaging the vectors to create the final input.

Caveat: We Need a Lot of Words to Properly Learn Embeddings

The Gensim library makes it easy-peasy to to learn embeddings in Python.

Trouble is our training corpus is tiny (7,583 words). That means it’s going to be very dumb in the semantic sense.

If we ask the model for words similar to, say, “media” (yeah, you can do that with this sort of model!), it gives us back some not very semantically related results.

The cool thing is though that we can load pre-trained models that have been fed very large corpora (Wikipedia pages, say). If we do that, we do get back semantically related results

Step 5: Train a Couple of Classification Models

We’ve got words now turned into numeric features. We can use those features to train any sort of classification model we like. I used sklearn’s stochastic gradient descent (SGD) classifier, because I wanted to do some super fast training, but, again, you could use something more advanced as well (a ConvNet, say).

Let’s train an SGD classifier on both the TF-IDF sparse matrices and the pre-trained word embedding vectors.

Step 5: Test the Classification Models

We’ve got a couple of trained classification models now. Let’s throw our test samples at the them and see how they perform. Remember we want better accuracy than our baseline of 91%.

TF-IDF-Based Classification

We TF-IDF vectorize the test samples, feed them to the SGD classifier and then evaluation the predictions versus ground-truth labels.

95.09%! Not amazing, but better than our baseline. And that’s with the old tech.

Word-Embedding-Based Classification

How about with the new tech that uses semantically smarter input features?

A mere 88.75%. Worse than our baseline in fact. The semantic context to the features didn’t help much with our classification challenge. And, if you glance around the Internet, you’ll see quite a few posts (like this one here) where old-skool TF-IDF beats the newer techniques.

Step 6: Answer Our Research Question: Did Melania Write that Tweet or Not?

We’ve got one classification model that beats our baseline accuracy and another that doesn’t. Let’s ask them both who they think wrote the tweet.

First, the TF-IDF fed model . . .

We vectorize the single tweet in question and get back the prediction.

Note from the printout of the input tweet, it was labeled with a “1”, as were all Melania tweets.

But the classifier predicts a “0” label for it, meaning that it thinks Trump wrote it.

From the printed log probabilities, we can see that it says there’s 71.08% chance that he did.

What about that lower performing network? We don’t really trust it because of it’s lower-than-baseline accuracy, but let’s ask it anyway.

Again, it says Trump, this time with 90.74% certainty.

Conclusion

There are plenty of caveats to all this.

  • Our data are imbalanced, which perhaps makes it more likely that our models will predict Trump even for Melania inputs. (Time for a confusion matrix, but this is getting long enough already.)
  • Depending on the randomization of the training set and the hyper-parameters chosen for the classification model, I can occasionally get back predictions that Melania, not Trump, is the author of the tweet.
  • We’re trying to ascertain authorship based on a tiny sample of words (42 or so).

In short, none of this would hold up in a court of law. This is meant as fun for data geeks.

All we can say for sure is that a lot of people’s intuition (mine included) said that the tweet didn’t sound like a Melania tweet, and that one reasonably accurate model produces a data point in favor of these intuitions.

--

--