The Face Behind the Handle— Using neural networks to distinguish Donald Trump’s tweeting habits
Nearly every monumental declaration has had a face to it. In our age, we have a Twitter handle.
Throughout history, Presidents have made platforms for themselves to announce that major changes were coming. Franklin D. Roosevelt used his personal Fireside Chats to announce war on Japan and the US entrance into World War II, Ronald Regan reunited Germany through his iconic speech at the Berlin wall, and through a televised broadcast, John F. Kennedy announced that humanity’s dream of going to the moon would become reality. These moments hallmark world history, showing that an iconic message from iconic people never truly die.
However, times have changed:
While yes, this is a monumental event for global relations, it is also President Trump’s personal Twitter — his primary means of communication to the world.
From recognizing Israel’s sovereignty in Golan Heights to announcing $200 billion in new tariffs against China, these actions on Twitter have proven time and time again to have major impacts worldwide. In fact, J.P. Morgan even created a ‘Volfefe’ index to track volatility in US bond markets caused by tweets from @realDonaldTrump.
In the iconic televised events like the one where JFK brilliantly declares that yes, we ARE going to the moon, you can clearly see a high school AP Lang teacher’s checklist: What literary devices does he use? How does he present himself to the crowd? What is his ethos? (I apologize if I gave you a flashback)
However, there is one question that is almost never talked about but is all too vital: WHO IS SPEAKING?
Just because it comes from @realDonaldTrump does NOT mean that it is from (yes I’m doing this) the real Donald Trump. Don’t you think it would be important to know who is actually behind the wheel of the world’s most powerful Twitter account? Well, fortunately, there’s a theory.
A popular (conspiracy?) theory on the source of Donald Trump’s tweets is that his more diplomatic tweets are composed by staffers while he personally pens more divisive content. In order to identify who really composes his tweets, we trained several neural networks based on the existing theory.
Before 25 March 2017, tweets from @realDonaldTrump were posted from both and iPhone and an Android device. A popular theory was that Trump’s staffers posted tweets with an iPhone while Trump himself tweeted from his Android. Although Trump switched to an iPhone in 2017, we decided to test this theory out and see if we could determine the source of each tweet based on the text.
Here’s a decent example of two differently labeled tweets:
Quite the contrast isn’t it?
What we did
In order to classify tweets as staff-tweeted or Trump-tweeted, we used a catalogue of tweets archived by the Trump Twitter Archive and separated them into two sets, pre- and post- March 2017. Within the pre-March 2017 set, we filtered out tweets that were posted from other sources like Twitter for Web and used the Android and iPhone tweets to build our ground truth dataset.
Here we have a plot of the frequency of tweets based on device type and we find that in general, tweets from an Android device (Trump himself) are less frequent than those from an iPhone (his staff members).
Inspecting the distribution plot, we note the big peak around 140 characters from both devices but more so from Androids. This can be attributed to the character limit on tweets — it was kept at 140 characters per tweet since the founding of Twitter in 2007, largely influenced by the 160-character limit for SMS messages. However, since late 2017, the limit was doubled to 280 characters which is evident by the rightmost peak for iPhone tweets in the distribution plot. These two plots speak to the apparent bias in our dataset, heavily skewed towards iPhones with almost three times as many data points as well as a heavier distribution past the old 140-character limit.
Preparing Data
As with any neural network, some preprocessing is in order:
Tokenizing & Formatting
Tokenization maps each word to an index in a dictionary. In order to prepare the data to be passed into the neural network, the tweets are first tokenized using Keras’s Tokenizer
module. Once the test and training set are tokenized, we had to decide on a way to keep the input size consistent. We thought that cropping the tweets by a length would cause us to lose data, so we chose to pad the tweets. We chose to pad the input by a length of 65 words since the longest tweet we found was 50 words and we wanted a comfortable margin for future tweets.
Embedding
In order to construct our word embeddings, we found a preprocessed global vector (GloVe) word model on Kaggle. This type of model maps the relationship between two words in a matrix and scores the relationship on a value of -1 to 1. This is a very common method for generating the weights for a word embedding layer. First, we created a dictionary of words that appeared within our dataset, then created an embedding matrix by parsing those words and their embedding values out of the GloVe text file.
Balancing
As seen previously, our data is highly skewed. There are significantly more tweets from iPhones compared to Androids. Training on this skewed dataset runs the risk of a biased model, meaning that the model would over-predict iPhones and still be correct most of the time. Thus, we attempted to tackle this problem by training our model with balanced batches. We did this by using the under sampling technique which takes smaller samples from iPhone tweets and this helps our models make more reliable predictions since it is exposed to a balanced set of training data.
Train/Test Split
After all the data has been scraped into the JSON file, the text and source from each tweet is passed into sklearn’s train_test_split
in order to generate a training and validation set.
Modeling
We selected five different architectures to train our dataset on. We first looked at a basic feed forward network to set as a baseline, then looked at four other models that are popular in NLP. This wasn’t done to fine-tune hyperparameters, but rather to explore various popular architectures which could later be expanded into more elaborate and precise models.
The first layer of each network is the embedding layer with the dimensionality of the padded word vector (65). The embedding layer’s weights are pre-instantiated with the given GloVe text file that we used. By using a pre-made word embeddings file, this is an application of transfer learning. This first layer is standard in each model regardless of the following layers.
When training the model, we set our testing set as the validation data using the built in validation_data
parameter in Keras. We measure the validation accuracy for an improvement each epoch and train the models until the validation accuracy stops improving for 50 epochs.
If you want to follow along, check out our GitHub page!
Feed Forward
We wanted to try with a feed forward model first because we wanted to see how a basic model would perform compared to some more complex ones.
Model: "sequential_1"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
embedding_1 (Embedding) (None, 65, 100) 2441300
_________________________________________________________________
flatten_1 (Flatten) (None, 6500) 0
_________________________________________________________________
dense_1 (Dense) (None, 1) 6501
=================================================================
Total params: 2,447,801
Trainable params: 6,501
Non-trainable params: 2,441,300Test Score: 0.4719
Test Accuracy: 0.7909
1D Convolutional Neural Network (CNN)
A 1D CNN will scan over a sequence tokenized words. The filter length determines how many words are looked at in a single convolution.
Model: "sequential_8"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
embedding_8 (Embedding) (None, 65, 100) 2441300
_________________________________________________________________
conv1d_4 (Conv1D) (None, 61, 128) 64128
_________________________________________________________________
global_max_pooling1d_4 (Glob (None, 128) 0
_________________________________________________________________
dense_15 (Dense) (None, 128) 16512
_________________________________________________________________
activation_6 (Activation) (None, 128) 0
_________________________________________________________________
dense_16 (Dense) (None, 1) 129
=================================================================
Total params: 2,522,069
Trainable params: 80,769
Non-trainable params: 2,441,300
_________________________________________________________________
NoneTest Score: 0.8156
Test Accuracy: 0.7650
Long Short Term Memory (LSTM)
LSTM is a recurrent neural network that is good at finding patterns in sequences of words in both a smaller (short term) and larger (long term) context.
Model: "sequential_10"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
embedding_10 (Embedding) (None, 65, 100) 2441300
_________________________________________________________________
lstm_3 (LSTM) (None, 128) 117248
_________________________________________________________________
dense_20 (Dense) (None, 128) 16512
_________________________________________________________________
activation_8 (Activation) (None, 128) 0
_________________________________________________________________
dense_21 (Dense) (None, 1) 129
=================================================================
Total params: 2,575,189
Trainable params: 133,889
Non-trainable params: 2,441,300
_________________________________________________________________
NoneTest Score: 0.5700
Test Accuracy: 0.7748
Bidirectional LSTM
This LSTM variant works in both forwards and backwards, evaluating the sequence of words in reverse order as well.
Model: "sequential_13"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
embedding_13 (Embedding) (None, 65, 100) 2441300
_________________________________________________________________
bidirectional_2 (Bidirection (None, 256) 234496
_________________________________________________________________
dense_26 (Dense) (None, 128) 32896
_________________________________________________________________
activation_11 (Activation) (None, 128) 0
_________________________________________________________________
dense_27 (Dense) (None, 1) 129
=================================================================
Total params: 2,708,821
Trainable params: 267,521
Non-trainable params: 2,441,300
_________________________________________________________________
NoneTest Score: 2.1170
Test Accuracy: 0.7552
Gated Recurrent Units (GRU)
GRU is newer recurrent neural structure similar to LSTM.
Model: "sequential_16"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
embedding_16 (Embedding) (None, 65, 100) 2441300
_________________________________________________________________
gru_1 (GRU) (None, 128) 87936
_________________________________________________________________
dense_30 (Dense) (None, 128) 16512
_________________________________________________________________
activation_12 (Activation) (None, 128) 0
_________________________________________________________________
dense_31 (Dense) (None, 1) 129
=================================================================
Total params: 2,545,877
Trainable params: 104,577
Non-trainable params: 2,441,300
_________________________________________________________________Test Score: 0.6943
Test Accuracy: 0.2448
What we found
The difference in the validation accuracy between the different models is very small. Although our feed forward network had the best validation accuracy, the LSTM model was only two percent less accurate.
LSTM predicts: Not Trump
LSTM predicts: Not Trump
LSTM predicts: Not Trump
LSTM predicts: Trump
After testing on a few random tweets, we decided to try our LSTM model on all of our post-March 2017 tweets to see if the ratio of Trump-Staff tweets changed over time. This is what we found:
Just for comparison’s sake, this is what the graph looked like before March 2017:
Concluding Reflections
Based on our results, we found that the feed forward network performed the best, which could be due to the simple nature of Trump’s tweets. Ideally, our model should take capitalization and punctuation into consideration, but because of the constraints of our embedding file, we weren’t able to do so. If time permitted, we could have trained our own embeddings file on only Trump’s tweets instead of using a pre-trained one. Despite these limitations, we found our results to be rather insightful.
So next time you see a tweet like this,
[[0.1183652 0.8849327]]
# 88.49% Not Trump
… maybe take it with a grain of salt. Maybe it isn’t actually the president of the United States making major declarations about foreign policy.
Curious about our data? Want to play around with it yourself? Check out our GitHub repo.
Special thanks to Dr. Ulf Aslak for his guidance ❤