Who’s Tweeting from the Oval Office?

Greg Rafferty
Towards Data Science
18 min readFeb 17, 2018

--

Did Trump type out that tweet? Or was it an aide in Trump clothing?

Update: I’ve written a follow-up post to this one which details how I deployed my model by building a Twitter bot.

I’ve built a Twitter bot @whosintheoval which retweets each of Donald Trump’s tweets and offers a prediction for whether the tweet was written by Trump himself or by one of his aides. Be sure to follow the bot on Twitter and read on to learn how I built the model!

I’m Greg Rafferty, a data scientist in the Bay Area. You can check out the code for this project on my github. Feel to contact me with any questions or feedback!

Motivation

On December 1st, 2017, Michael Flynn pleaded guilty to lying to the FBI. The next day, Trump’s personal Twitter account tweeted:

This was quite controversial because on February 14th of that year, the day after Flynn resigned, Trump had asked James Comey, then the director of the FBI, to back off any investigations of Flynn. If Trump knew at the time of his request to Comey that Flynn had indeed lied to the FBI, then Trump’s tweet could be seen as evidence that Trump attempted to obstruct justice. After several legal experts argued this point, Trump defended himself by claiming that his lawyer John Dowd wrote and posted the tweet. But did he really?

This post is split into four sections:

  • Background
  • Feature selection
  • Models
  • Results

The middle two sections (especially the Models section) get a bit technical; so if that doesn’t interest you and you’d just like to skip ahead to the results and see who actually posted the Flynn tweet, feel free to do so!

Background

Forensic text analysis was an early field in machine learning and has been used in cases as varied as identifying the Unabomber to discovering J.K. Rowling as the true identity of the author Robert Galbraith to determining the specific authors of each of the Federalist Papers. This project is an effort to use machine learning and these same techniques to identify tweets on @realDonaldTrump as written by Trump himself or by his staff while using his account. This task, however, is unique and particularly challenging due to the short nature of a tweet — there just isn’t much signal to pick up in such a short text. In the end, I did succeed though with almost 99% accuracy. Go ahead and follow my Twitter bot @whosintheoval to watch it post in real-time with predictions whenever Trump tweets.

The Data

Prior to March 26, 2017, Trump was tweeting using a Samsung Galaxy device while his staff were tweeting using an iPhone. From this information provided in the metadata of each tweet, we know whether it was Trump himself or his staff tweeting (see these links for some articles discussing this assumption). After March however, Trump switched to using an iPhone as well, so identification of the tweeter cannot come from the metadata alone and must be deduced from the content of the tweet.

I used Brendan Brown’s Trump Tweet Data Archive to collect all tweets from the beginning of Trump’s account in mid-2009 up until the end of 2017. This set consists of nearly 33,000 tweets. Even though I know from whose device a tweet originated, there is still some ambiguity around the authorship because Trump is known to dictate tweets to assistants, so a tweet may have Trump’s characteristics but be posted from a non-Trump device, and also (especially during the campaign) to write tweets collaboratively with aides, making true authorship unclear.

From the beginning of Trump’s Twitter account, on May 4th, 2009, until he stopped using an Android device in early 2017, there are over 30,000 tweets of which I know (or at least have a good guess about) the author (crucially, the Flynn tweet doesn’t fall into this date range so I had my models make their best guess as to the true tweeter — more on this in the results section later in this article). These 30,000 tweets are fairly evenly split between Android / non-Android (47% / 53%) so class imbalance wasn’t an issue. This was my training data. Using several different techniques, I created almost 900 different features from this data which my models could use to predict the author.

Choosing Features

So many important decisions!

I looked at six broad categories of features to build my model:

  • Trump quirks
  • Style
  • Sentiment
  • Emotion
  • Word choice
  • Grammatical structure

Trump quirks

Data science can sometimes be more art than science. To start off my model, I first thought about how I as a human would identify a tweet as Trumpian. I then did my best to translate these “feelings” into rule-based code. Some obvious quirks, for example, that can identify if Trump himself is behind the keyboard are an abuse of ALL CAPITAL LETTERS in his tweets, Randomly Capitalizing Specific Words, or gratuitous! use of exclamation points!!!!!

In fact, one of the most influential features in my model was what I came to refer to as the quoted retweet. Trump, it seems, does not know how to retweet someone on Twitter. In the entire corpus of 33,000 tweets, there is only one single proper retweet that comes from an Android device. Instead, Trump copies someone else’s tweet, @mentions the user and surrounds the tweet in quotation marks, then posts it himself:

These are often, but not always, self-congratulatory tweets like this one, which is why, as you’ll see in my next post discussing results, Donald Trump tends to @mention himself a lot.

Style

Stylistic features are those which aren’t specific to Trump’s own personal style, but instead could be used to identify any Twitter user. These types of features include the average length of a tweet, of sentences, and of words. I also looked at how many times various punctuation marks are used (Trump hardly ever uses a semi-colon; his aides do quite a bit more often). The number of @mentions in a tweet, the number of #hashtags, and the number of URLs all turned out to be strongly predictive features. Finally, the day of the week and the time of the day in which the tweet was posted were quite revealing.

Sentiment

I used C.J. Hutto’s VADER package to extract the sentiment of each tweet. VADER, which stands for Valence Aware Dictionary and sEntiment Reasoning (because, I suppose, VADSR sounded silly?), is a lexicon and rule-based tool that is specifically tuned to social media. Given a string of text, it outputs a decimal between 0 and 1 for each of negativity, positivity, and neutrality for the text, as well as a compound score from -1 to 1 which is an aggregate measure.

A complete description of the development, validation, and evaluation of the VADER package can be read in this paper, but the gist is that the package’s authors first constructed a list of lexical features (or, “words and phrases” in simple English) correlated with sentiment and then combined the list with some rules that describe how the grammatical structure of a phrase will intensify or diminish the sentiment. When tested against human raters, VADER outperforms with accuracy scores of 96% to 84%.

Emotion

The National Research Council of Canada created a lexicon of over 14,000 words, each scored as either associated or not-associated with any of two sentiments (negative, positive) or eight emotions (anger, anticipation, disgust, fear, joy, sadness, surprise, trust). They kindly provided me access to the lexicon, and I wrote up a Python script which looped over each word in a tweet, looked it up in the lexicon, and output whichever emotions the word was associated with. Each tweet was then assigned a score for each emotion corresponding to how many words associated with that emotion it contained.

Word choice

To analyze word choice, I used a technique called tf–idf, which stands for Term Frequency — Inverse Document Frequency. It’s basically a measure of how descriptive and unique a word is to a document. Let’s say you want to group some news articles together so you can recommend similar articles to a reader. You set your computer up to read each article and one of them features the word “baseball” 10 times. That must be a pretty significant word in the article! That’s the Term Frequency part.

But now, that same article also has the word “said” 8 times. That seems to also be a pretty significant word. But we humans know otherwise; we know that if several articles mention “baseball,” they’re probably about the same topic, but if several articles mention “said,” that doesn’t tell us much about the articles’ similarity. So we then look at all of the articles in the collection and count how many of them have the words “baseball” and “said.” Out of, say, 1000 articles, only 30 have the word “baseball” but 870 have the word “said.” So we take the inverse of that count — 1/30 and 1/870 — and multiply that by the Term Frequency — 10 and 8. This is the Inverse Document Frequency part. So the word “baseball” gets a score of 10/30 = 0.333 and the word “said” gets a score of 8/870 = .009. We do this for every word in every document and, in a nutshell, look which articles have the same high-value words. This is tf–idf.

In order to reduce the computing needs of my model, I only looked at unigrams (single words) instead of bigrams and trigrams (tf–idf handles these small phrases exactly the same way it would handle a single word). Each n-gram requires exponentially more processing time and I figured that “Crooked Hillary” or “Lyin’ Ted Cruz” would still be picked up by the words “crooked” and “lyin’” on their own. I also ignored words that came up in over 99% of the tweets (known as corpus-specific stop words) and fewer than 1% of the tweets. I heavily used Python’s scikit-learn package throughout this project and that includes their tf–idf implementation.

Grammatical structure

One of the main challenges of using natural language processing on current events is that the events change over time. While the phrases “Crooked Hillary” and “Lyin’ Ted Cruz” came up a lot during Trump’s presidential campaign, they’re all but absent in current tweets. I wanted to capture a more basic form of Trump’s tweets, so I converted each tweet to a part-of-speech representation using the Natural Language Toolkit.

This essentially converts each word into its part of speech, staying aware of the word’s role in the sentence so as to differentiate the noun “insult” in the sentence “‘Crooked Hillary’ is used as an insult when Trump refers to his political opponent” from the same word used as a verb in the sentence “You insult the political process by reducing it to childish name-calling.

This changes the phrase “I had to fire General Flynn because he lied to the Vice President and the FBI” to its more basic part-of-speech form as “PRP VBD TO VB NNP NNP IN PRP VBD TO DT NNP NNP CC DT NNP”, using the Penn part of speech tags (PRP = personal pronoun, VBD = verb, past tense, TO = to, VB = verb, base form, NNP = singular proper noun, etc). Using the same tf–idf process as before, but this time ignoring unigrams and focusing instead on bigrams and trigrams, I could extract a more general way either Trump or his aides tweet.

Lastly, I used the Stanford Named Entity Recognition (NER) Tagger to replace all names with “PERSON”, all locations with “LOCATION” and all organizations as “ORGANIZATION.” This was yet another attempt to generalize the tweets away from specifics that might change over time. This NER process was by far the most computationally expensive process during the handling of these tweets and, if I were to do this project again, I would seriously consider a less state-of-the-art NER tagger that does not rely on an advanced statistical learning algorithm, and would speed up processing time significantly. You’ve been warned!

How Did the Models Do?

They did well, very well

To begin, as is standard in the field, I split my data into an 80% training set and 20% testing set. I set aside the testing set until I was satisfied that all of my models were as accurate as possible, and then sent the testing set through them to get the performance measures I’ll be reporting here.

Feature importances

One of the more important tasks I did was to sort my features in order of their influence on the outcomes of the models. To do this, I used scikit learn’s Ridge Classifier. Ridge regression is a form of logistic regression which includes a regularization factor, alpha. At alpha = 0, ridge regression is the same as an unregularized logistic regression; at low alpha levels, the coefficients of the least-influential features are forced to zero, effectively removing them from the model; at higher alpha levels, many more features are removed. I recursively iterated over every alpha level, dropping out features one by one, until none remained.

As you can see in the plot above, at an alpha level just above 10²², the first (least influential) feature drops out. In the range of 10²⁵, feature dropout rapidly increases, leaving only the most influential features left at alpha levels above 10²⁶.

The individual models

In total, I built 9 models: Gaussian Naive Bayes, Multinomial Naive Bayes, K Nearest Neighbors, Logistic Regression, Support Vector Classifier, Support Vector Machine and the ensemble methods of AdaBoost, Gradient Boosting, and Random Forest. Each model was carefully tuned using 10-fold cross validation on the training data alone, and evaluated on the test data.

Cross validation is an effective technique for training these models without biasing them too much towards the specific data they’re being trained on; in other words, allowing them to generalize to unseen data much better. In 10-fold cross validation, the data is split into 10 equally sized groups, groups 1–10. In the first training iteration, the model is trained on groups 1–9 and tested on group 10. The process repeats, but this time it is training on groups 1–8 and 10, and tested on group 9. This training step is repeated 10 times in total, so each group is withheld from the training set one time and used as an unseen test set. Finally, the combination of model parameters which had the best average performance across all 10 folds is the set of parameters to use in the final model.

The algorithms behind these models are all very fascinating; they each have their own strengths and weaknesses, having different balances along the bias — variance tradeoff and sometimes vastly different processing times (training naive Bayes, for instance, took fractions of a second whereas the support vector classifier and the gradient boosting methods both took an entire weekend each to perform a grid search). If you’re interested in learning more, I would start with the Wikipedia entries for these models:

Furthermore, using those feature importances generated above, I trained each model on a subset of the total of almost 900 features. Naive bayes, for instance, performed best with only the top 5 features whereas both boosting models were happiest when crunching through the top 300. This is partly due to the curse of dimensionality; the fact that in higher-dimensional space, two points which seem to be near each other (when imagined in our 3-dimensional minds), can be actually very, very far apart indeed. In particular,the k-nearest neighbors model (knn) is highly sensitive to too many dimensions, so I also applied principal component analysis (PCA) to the data fed into this model.

PCA is a technique which can both reduce dimensionality and eliminate any collinearity between the features. If you can imagine a set of vectors in higher-dimensional space, PCA will twist and massage these vectors so that each and every one of them is perpendicular to all of the others. If these vectors represent features, then by forcing them all to be orthogonal, we’ve also ensured that no collinearity exists between them. This will vastly improve the predictions of a model such as knn, and can allow us to reduce the number of features sent to the model without reducing the amount of information. In short, this enabled me to get much better performance out of my knn model.

The ensemble

Lastly, I created two different ensembles of each of these models. The first one was a simple majority vote: with an odd number of models and a binary output, there will never be a tie between the models in disagreement, so I simply added up all of the predictions for Trump and all of the prediction for an aide, and for my final prediction offered whichever was greater. My second ensemble was a bit more sophisticated: I took the results of those first nine models and fed them into a new decision tree. This final model had near-perfect accuracy on my test set.

And now, finally, the results..

Results

As you can see, the gradient boosting model and random forest performed best, with an error rate of only 1 out of 20.

The other models performed less well individually, but contributed a great deal to the final ensemble. The decision tree that I built from the results of the first set of 9 models had an accuracy score of over 99%!

If you’re unclear what all those measures are, here’s an brief explanation. Accuracy is the most intuitive of these measures, it is simply the number of guesses that were correct divided by the total number of guesses, ie, out of all my guesses, how many were correct? Precision answers the question, out of all tweets I guessed to be Trump, how many actually were Trump? Recall is almost-but-not-quite the opposite of precision; it answer the question, out of all tweets that actually were written by Trump, how many did I get right? F1 score is a blend of precision and recall, technically the harmonic mean (a type of average) of the two. It is not nearly as intuitive to understand as accuracy but when the class imbalance is large, f1 score is a much better measure than accuracy. In the case of this tweet data though, my classes were very well balanced which is why all of the measures are more-or-less equal in the above chart. If this is at all confusing to you, or you’d just like to learn more, here is an excellent blog post about these measures.

So what characterizes a Trump tweet?

  • Quoted retweet
  • @mentions
  • Between 10pm and 10am
  • Surprise, anger, negativity, disgust, joy, sadness, fear
  • Exclamation points
  • Fully capitalized words
  • @realDonaldTrump

As I expected, the quoted retweet I described in in the feature selection section of this post was highly predictive of a Trump tweet. So were @mentions of other users. Trump often tweets during the night and early morning, and on weekends. He displays surprise, anger, negativity, disgust, … in fact all of the emotions, not just the negative ones emphasized so much in the press. He does indeed use exclamation points and fully capitalized words more than is grammatically necessary. And lastly, he mentions himself an awful lot.

His aides, on the other hand, post tweets characterized by:

  • True retweets
  • The word “via”
  • Between 10am and 4pm
  • Semicolons
  • Periods
  • URLs
  • @BarackObama

If a tweet is a proper retweet, you can bet confidently it was posted by an aide. Interestingly, the word “via” came up a lot in aides’ tweets — they often would quote an article or image and attribute it with that word. Predictably, they tweet during the workday and not very often outside of it. Their grammar is more sophisticated, with better sentence structure and punctuation, and they post URLs to other sources very frequently. Interestingly, if Barack Obama’s Twitter username is mentioned in a tweet, it’s usually an aide. Trump would mention him by name, but not by @mention.

With regards to the parts-of-speech tags, Trump’s most frequent combination is NN PRP VBP, or a noun, personal pronoun, and verb. These tweets frequently take the form of an @mention followed by “I thank…” or “I have…” Aides often write NNP NNP NNP, three proper nouns in a row, which is often the name of an organization. They also use #hashtags following text whereas Trump uses #hashtags following an @mention.

I was a bit disappointed that the parts-of-speech tags weren’t more significant to the model. I knew that the specific vocabulary in a tweet would change over time and so I wanted to capture more grammatical structure which I reasoned would be more constant. However, the main challenge of this project is the short nature of a tweet and this did greatly reduce the amount of grammatical signal my models could pick up. What this means for my model is that although it has an almost perfect accuracy rate on historical tweets, that accuracy drops off a bit on current tweets.

Additionally, three features which were highly predictive on historical tweets were tweet length, number of times favorited, and number of times retweeted. However, I had to drop all three of these features and retrain my model for deployment on real-time tweets. For the second two features, favorite count and retweet count, the reason is a bit obvious: I’m trying to predict the author immediately after the tweet is posted, so it has not been favorited or retweeted yet. Tweet length, however, was dropped for a different reason. In all 33,000 tweets in my training data, Twitter had limited the character count to 140. But only recently has Twitter increased this count to 280. This means all that training on this feature had to be thrown away.

A little game

So with those characteristics in mind, let’s play a little game. I’ll offer a tweet and I invite you to guess the author.

Is it Trump or one of his aides?

Don’t scroll down too far, because the answer will be right below! Here’s the first one; who wrote this, Trump or an aide?

This is a bit easy. What do you see? There’s that word “via,” highly indicative of an aide tweet. It includes a link, again another telltale sign of an aide. It’s posted in the middle of the day (I scraped this tweet from California, so the timestamp is 3 hours behind Washington DC), and it’s very formal and unemotional: all signs of an aide.

And yes, you’re correct, that was posted by an aide! OK, here’s another one:

Is that Trump or an aide? Again, let’s go over it together. This tweet contains more emotion than the other, that’s usually a Trump sign. There’s that exclamation point: another Trumpian touch. Remember to add 3 hours to the timestamp; that puts it at 7:30pm, after the workday has ended. With that in mind, we can confidently guess that this was written by…

Trump! Yep, correct again!

The Flynn Tweet

So, this is the big one, the tweet that started this whole project:

Now, this tweet came after March 26, 2017, which if you remember from earlier is the date after which there are no labels to identify the true tweeter. All we’ve got to go on is my model. In truth, this is a difficult tweet to guess. It contains the words “lied,” “guilty,” “shame,” and “hide.” Those are all very emotionally charged words — possibly indicating Trump as the author. But it’s also somewhat formal; the grammar is well composed and it contains some longer-than-average words: those are signs of an aide. It was tweeted around midday, also suggesting an aide. But it’s very personal, suggesting Trump. So what did the models say? Here’s the raw output:

rf [ 0.23884372  0.76115628]
ab [ 0.49269671 0.50730329]
gb [ 0.1271846 0.8728154]
knn [ 0.71428571 0.28571429]
nb [ 0.11928973 0.88071027]
gnb [ 0.9265792 0.0734208]
lr [ 0.35540594 0.64459406]
rf [1]
ab [1]
gb [1]
knn [0]
nb [1]
gnb [0]
svc [1]
svm [0]
lr [1]
([1], [ 0.15384615, 0.84615385])

That “rf” at the top, that’s the random forest. It predicted a 1, or Trump, with 76% probability (the first seven rows show probabilities of first an aide and then Trump; the next nine rows show the prediction: 0 for aide, 1 for Trump). “ab” is AdaBoost, which also predicted Trump, but with only 51% to 49% probability — not very confident at all. The gradient boosting model was more confident, 87% likelihood it was Trump. KNN however disagreed: 71% probability the tweet was written by an aide. The multinomial naive Bayes predicted Trump, but the Gaussian naive Bayes predicted an aide. There was also disagreement in the two support vector machine models: SVC predicted Trump and SVM predicted an aide (due to the way these models are created, they cannot output a probability estimation, which is why they’re absent in the top half of the results). Logistic regression was a bit on the fence with 64% probability of Trump and 36% probability of an aide. That’s 6 models for Trump, 3 for an aide.

In reality, after spending weeks reading over and analyzing thousands of Trump tweets, I think this tweet is one of the best examples of a collaboratively written tweet. Topically and emotionally, it’s 100% Trumpian. But stylistically and grammatically, it appears to have come from an aide. In my opinion, Trump probably worked together with Dowd to craft the tweet. Trump told Dowd what he wanted to say and how he wanted to say it, and Dowd composed the actually tweet. That’s my best guess.

This just goes to show that these models aren’t perfect, there’s a lot of disagreement; and also that a tweet contains very little information for machine learning to train on. My final ensemble, the decision tree, which was over 99% accurate on my testing set, did offer a final prediction of Trump, with 85% probability (that’s the last line in the output above). So that’s what we’ll go with: Trump. Not John Dowd, his lawyer. So their claim that Dowd wrote the tweet and not Trump, we can only assume that it’s:

--

--