The Face Behind the Handle— Using neural networks to distinguish Donald Trump’s tweeting habits

Nearly every monumental declaration has had a face to it. In our age, we have a Twitter handle.

Nathan Siu
9 min readDec 3, 2019

--

Derek Albosta, Caleb Grenko, Woo Seok Jung, Nathan Siu

Throughout history, Presidents have made platforms for themselves to announce that major changes were coming. Franklin D. Roosevelt used his personal Fireside Chats to announce war on Japan and the US entrance into World War II, Ronald Regan reunited Germany through his iconic speech at the Berlin wall, and through a televised broadcast, John F. Kennedy announced that humanity’s dream of going to the moon would become reality. These moments hallmark world history, showing that an iconic message from iconic people never truly die.

Three iconic presidential speeches throughout history. Left: FDR fireside chat where he announces war on Japan. Middle: Ronald Reagan’s “Ich Bin Ein Berliner.” Right: JFK giving his moonshot speech. All thumbnails taken from the YouTube videos linked above.

However, times have changed:

While yes, this is a monumental event for global relations, it is also President Trump’s personal Twitter — his primary means of communication to the world.

From recognizing Israel’s sovereignty in Golan Heights to announcing $200 billion in new tariffs against China, these actions on Twitter have proven time and time again to have major impacts worldwide. In fact, J.P. Morgan even created a ‘Volfefe’ index to track volatility in US bond markets caused by tweets from @realDonaldTrump.

In the iconic televised events like the one where JFK brilliantly declares that yes, we ARE going to the moon, you can clearly see a high school AP Lang teacher’s checklist: What literary devices does he use? How does he present himself to the crowd? What is his ethos? (I apologize if I gave you a flashback)

However, there is one question that is almost never talked about but is all too vital: WHO IS SPEAKING?

Just because it comes from @realDonaldTrump does NOT mean that it is from (yes I’m doing this) the real Donald Trump. Don’t you think it would be important to know who is actually behind the wheel of the world’s most powerful Twitter account? Well, fortunately, there’s a theory.

A popular (conspiracy?) theory on the source of Donald Trump’s tweets is that his more diplomatic tweets are composed by staffers while he personally pens more divisive content. In order to identify who really composes his tweets, we trained several neural networks based on the existing theory.

Before 25 March 2017, tweets from @realDonaldTrump were posted from both and iPhone and an Android device. A popular theory was that Trump’s staffers posted tweets with an iPhone while Trump himself tweeted from his Android. Although Trump switched to an iPhone in 2017, we decided to test this theory out and see if we could determine the source of each tweet based on the text.

Here’s a decent example of two differently labeled tweets:

Twitter for Android
Twitter for iPhone

Quite the contrast isn’t it?

What we did

Screenshot of the Trump Twitter Archive

In order to classify tweets as staff-tweeted or Trump-tweeted, we used a catalogue of tweets archived by the Trump Twitter Archive and separated them into two sets, pre- and post- March 2017. Within the pre-March 2017 set, we filtered out tweets that were posted from other sources like Twitter for Web and used the Android and iPhone tweets to build our ground truth dataset.

Here we have a plot of the frequency of tweets based on device type and we find that in general, tweets from an Android device (Trump himself) are less frequent than those from an iPhone (his staff members).

Inspecting the distribution plot, we note the big peak around 140 characters from both devices but more so from Androids. This can be attributed to the character limit on tweets — it was kept at 140 characters per tweet since the founding of Twitter in 2007, largely influenced by the 160-character limit for SMS messages. However, since late 2017, the limit was doubled to 280 characters which is evident by the rightmost peak for iPhone tweets in the distribution plot. These two plots speak to the apparent bias in our dataset, heavily skewed towards iPhones with almost three times as many data points as well as a heavier distribution past the old 140-character limit.

Preparing Data

As with any neural network, some preprocessing is in order:

Tokenizing & Formatting

Tokenization maps each word to an index in a dictionary. In order to prepare the data to be passed into the neural network, the tweets are first tokenized using Keras’s Tokenizer module. Once the test and training set are tokenized, we had to decide on a way to keep the input size consistent. We thought that cropping the tweets by a length would cause us to lose data, so we chose to pad the tweets. We chose to pad the input by a length of 65 words since the longest tweet we found was 50 words and we wanted a comfortable margin for future tweets.

Embedding

In order to construct our word embeddings, we found a preprocessed global vector (GloVe) word model on Kaggle. This type of model maps the relationship between two words in a matrix and scores the relationship on a value of -1 to 1. This is a very common method for generating the weights for a word embedding layer. First, we created a dictionary of words that appeared within our dataset, then created an embedding matrix by parsing those words and their embedding values out of the GloVe text file.

Balancing

As seen previously, our data is highly skewed. There are significantly more tweets from iPhones compared to Androids. Training on this skewed dataset runs the risk of a biased model, meaning that the model would over-predict iPhones and still be correct most of the time. Thus, we attempted to tackle this problem by training our model with balanced batches. We did this by using the under sampling technique which takes smaller samples from iPhone tweets and this helps our models make more reliable predictions since it is exposed to a balanced set of training data.

Train/Test Split

After all the data has been scraped into the JSON file, the text and source from each tweet is passed into sklearn’s train_test_split in order to generate a training and validation set.

Modeling

We selected five different architectures to train our dataset on. We first looked at a basic feed forward network to set as a baseline, then looked at four other models that are popular in NLP. This wasn’t done to fine-tune hyperparameters, but rather to explore various popular architectures which could later be expanded into more elaborate and precise models.

The first layer of each network is the embedding layer with the dimensionality of the padded word vector (65). The embedding layer’s weights are pre-instantiated with the given GloVe text file that we used. By using a pre-made word embeddings file, this is an application of transfer learning. This first layer is standard in each model regardless of the following layers.

When training the model, we set our testing set as the validation data using the built in validation_data parameter in Keras. We measure the validation accuracy for an improvement each epoch and train the models until the validation accuracy stops improving for 50 epochs.

If you want to follow along, check out our GitHub page!

Feed Forward

We wanted to try with a feed forward model first because we wanted to see how a basic model would perform compared to some more complex ones.

Model: "sequential_1"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
embedding_1 (Embedding) (None, 65, 100) 2441300
_________________________________________________________________
flatten_1 (Flatten) (None, 6500) 0
_________________________________________________________________
dense_1 (Dense) (None, 1) 6501
=================================================================
Total params: 2,447,801
Trainable params: 6,501
Non-trainable params: 2,441,300
Test Score: 0.4719
Test Accuracy: 0.7909

1D Convolutional Neural Network (CNN)

A 1D CNN will scan over a sequence tokenized words. The filter length determines how many words are looked at in a single convolution.

Model: "sequential_8"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
embedding_8 (Embedding) (None, 65, 100) 2441300
_________________________________________________________________
conv1d_4 (Conv1D) (None, 61, 128) 64128
_________________________________________________________________
global_max_pooling1d_4 (Glob (None, 128) 0
_________________________________________________________________
dense_15 (Dense) (None, 128) 16512
_________________________________________________________________
activation_6 (Activation) (None, 128) 0
_________________________________________________________________
dense_16 (Dense) (None, 1) 129
=================================================================
Total params: 2,522,069
Trainable params: 80,769
Non-trainable params: 2,441,300
_________________________________________________________________
None
Test Score: 0.8156
Test Accuracy: 0.7650

Long Short Term Memory (LSTM)

LSTM is a recurrent neural network that is good at finding patterns in sequences of words in both a smaller (short term) and larger (long term) context.

Model: "sequential_10"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
embedding_10 (Embedding) (None, 65, 100) 2441300
_________________________________________________________________
lstm_3 (LSTM) (None, 128) 117248
_________________________________________________________________
dense_20 (Dense) (None, 128) 16512
_________________________________________________________________
activation_8 (Activation) (None, 128) 0
_________________________________________________________________
dense_21 (Dense) (None, 1) 129
=================================================================
Total params: 2,575,189
Trainable params: 133,889
Non-trainable params: 2,441,300
_________________________________________________________________
None
Test Score: 0.5700
Test Accuracy: 0.7748

Bidirectional LSTM

This LSTM variant works in both forwards and backwards, evaluating the sequence of words in reverse order as well.

Model: "sequential_13"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
embedding_13 (Embedding) (None, 65, 100) 2441300
_________________________________________________________________
bidirectional_2 (Bidirection (None, 256) 234496
_________________________________________________________________
dense_26 (Dense) (None, 128) 32896
_________________________________________________________________
activation_11 (Activation) (None, 128) 0
_________________________________________________________________
dense_27 (Dense) (None, 1) 129
=================================================================
Total params: 2,708,821
Trainable params: 267,521
Non-trainable params: 2,441,300
_________________________________________________________________
None
Test Score: 2.1170
Test Accuracy: 0.7552

Gated Recurrent Units (GRU)

GRU is newer recurrent neural structure similar to LSTM.

Model: "sequential_16"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
embedding_16 (Embedding) (None, 65, 100) 2441300
_________________________________________________________________
gru_1 (GRU) (None, 128) 87936
_________________________________________________________________
dense_30 (Dense) (None, 128) 16512
_________________________________________________________________
activation_12 (Activation) (None, 128) 0
_________________________________________________________________
dense_31 (Dense) (None, 1) 129
=================================================================
Total params: 2,545,877
Trainable params: 104,577
Non-trainable params: 2,441,300
_________________________________________________________________
Test Score: 0.6943
Test Accuracy: 0.2448

What we found

The difference in the validation accuracy between the different models is very small. Although our feed forward network had the best validation accuracy, the LSTM model was only two percent less accurate.

LSTM predicts: Not Trump

LSTM predicts: Not Trump

LSTM predicts: Not Trump

LSTM predicts: Trump

After testing on a few random tweets, we decided to try our LSTM model on all of our post-March 2017 tweets to see if the ratio of Trump-Staff tweets changed over time. This is what we found:

Just for comparison’s sake, this is what the graph looked like before March 2017:

Concluding Reflections

Based on our results, we found that the feed forward network performed the best, which could be due to the simple nature of Trump’s tweets. Ideally, our model should take capitalization and punctuation into consideration, but because of the constraints of our embedding file, we weren’t able to do so. If time permitted, we could have trained our own embeddings file on only Trump’s tweets instead of using a pre-trained one. Despite these limitations, we found our results to be rather insightful.

So next time you see a tweet like this,

[[0.1183652 0.8849327]]
# 88.49% Not Trump

… maybe take it with a grain of salt. Maybe it isn’t actually the president of the United States making major declarations about foreign policy.

Curious about our data? Want to play around with it yourself? Check out our GitHub repo.

--

--

Bioinformatics student at Davidson College, currently studying abroad in Copenhagen