The world’s leading publication for data science, AI, and ML professionals.

Trump’s Twitter Network

Social Media: The Challenge in Taking Action

Credit: Charles Deluvio
Credit: Charles Deluvio

Why is it so hard to predict the content that will cause real world harm?

You might have read that and thought— no it isn’t! If you leave content from the Proud Boys on Parler, then of course they’ll storm the capitol.

Unfortunately, it’s not that simple. At the moment, we produce 500 million tweets per day so it’s incredibly difficult to avoid the debilitating problem of information overload, where this simply too much material to sift through.

Let me spell that out a bit more with an example… The Proud Boys represents just one far right group in America, of which there are many more. Supposing the Proud Boys had been removed from Parler and all other social media platforms, they’d still be just one needle in an almost endless haystack of far right groups that create content on social media platforms.

So, I wanted to explore if there is a way of confronting this problem through Data Science; I wanted to see if there was a way to help separate the signal from all the noise.

Hypothesis: how does Trump’s interlocutor change his choice of word?

One way to test this, I thought, would be to limit the dataset to the most powerful person in the world: US president Donald J Trump.

So to investigate this, I came up with the hypothesis statement above which in plainer English you could understand as the below:

Is Trump more or less likely to incite a mob to break into the capital if he’s been retweeting Sean Hannity for the last week? Or if he’s been retweeting Breitbart a lot and has descended into Twitter war with Nancy Pelosi?

Data Sources

To explore this, I made a bunch of twitter API calls to enrich the already very valuable dataset at the Trump Twitter Archive.

This was particularly important since this dataset didn’t include the urls that were or weren’t present in Trump’s tweets, which was exactly the sort of feature that I was interested in!

Here’s the code I wrote to achieve that:

For this to work, you’ll need a twitter developer account so you can get API credentials.

I’ve also only requested extra information on urls, quote, and reply count but there’s a fair bit more that you can retrieve.

The other dataset I worked with on this project was from hatespeechdata.com, which was a little bit more straightforward and it consisted of posts on reddit and gab that users classified as hateful or not.

The point of this was to provide a dataset independent of Trump, upon which I could train a model that would classify speech as hateful or not, and then apply this to classify Trump’s tweets with a hatefulness score.

Once that’s out the way, I could then get to work on building a model that would try to predict this hatefulness score and see which features were important in doing so.

Building a Hate Speech Policy Violation Classifier

There‘s a fair bit of code behind this which you can see here in full; what follows below is a quick summary of the main steps.

The dataset for hate contained 90% hateful samples and 10% non-hateful samples, so I had to do a few things to mitigate this class imbalance.

One was to pick appropriate metrics which could account for this; in addition to looking at root mean squared error, I also assessed this model based on its F1 score, which combines precision and recall metrics to account for the rate of false positives and negatives.

Formal definitions of Precision, Recall, and a F1 Score
Formal definitions of Precision, Recall, and a F1 Score

The other approach I took here was to randomly oversample the non-hateful data; that was fairly straightforward as you can see from the code below.

Here, the training datasets on the right consisted of a y_train array containing 0 or 1 values for whether the speech was hateful or not and a sparse matrix containing vectorized natural language data and features.

I tried out a few models for this using a TfidfVectorizer, since I imagined the presence of swear words or other abusive language would be more significant than general word count.

Out of these, a Support Vector Machine model performed the best but was very slow to load. A Multinomial Naive Bayes model performed badly, which is perhaps to be expected since features in hateful language are unlikely to be conditionally independent of one another – much of the data consisted of multiple posts in the same thread.

So in the end I opted for a simple logistic regression model that gave a F1 Score or .91 and performed well on some random out of sample data too.

Here’s a code snippet on that:

And the output that generated…

However, when I applied this model to classify Trump’s tweets as hateful or not I experienced some significant data drift.

Trump’s tweets, incendiary though they are, do not actually use racial slurs or swear words as frequently as non-political figures of hate do on gab or reddit.

So after some significant manual testing, I concluded that when applied to Trump’s tweets, my model picked up Trump’s divisiveness. My model scored one of Trump’s tweets as more likely to be hateful when it contained phrases like ‘North Korea’, ‘immigration’, ‘Make America Great Again’.

Why? Because these phrases frequently co-occurred with hate speech in the dataset of hate on gab and reddit that was used to train the model.

To give you an example:

Democrats are far more concerned with Illegal Immigrants than they are with our great Military or Safety at our dangerous Southern Border

According to my model, this speech was more than 70% likely to be hateful.

This is my 500th Day in Office and we have accomplished a lot – many believe more than any President in his first 500 days

While this tweet was less than 30% likely to be hateful.

Of course, neither of them are in fact hateful but you can see how the former tweet is more divisive because it’s associated with the kind of rhetoric that frequently drives hate speech.

Modelling Trump’s Divisiveness

So, I could now add this hatefulness score to my dataset of Trump’s tweets and thereby create a target variable.

However, there is still a lot of work to be done! Apart from the normal data cleaning and natural language pre-processing work, I also had to extract some key features – did Trump’s tweet contain a retweet? If so, to whom? Or did it contain a url? If it did, what was the root domain?

Let me just pause for a second on the domain – if I want to understand if Trump becomes more divisive after talking to Breitbart all day on Twitter, then I need to encode a url like https://www.breitbart.com/article-1 with the same value as https://www.breitbar.com/article-2 since they are both from Breitbart.

This was all a bit fiddly; on the other hand that made it interesting. Have a look at this notebook to see what this looked like in practice:

After cleaning up some more of the data, I used a regression tree to determine which features were predictors of Trump’s divisiveness and measured this with a R2 score.

Why a R2 score? Because I wanted to see which features predicted Trump’s divisiveness. At this point I wasn’t too concerned with accuracy but just in seeing if who Trump retweets or cites affects how divisive he gets.

For context, here’s a formal definition of a R2 score applied in the context of my project:

It was pretty hard to predict Trump’s divisiveness, even with ensemble or random forest models but I did get a r2 score of .2 which was at least an improvement on a baseline r2 score of 0, i.e. the r2 score you’d get would you predicted just the mean divisiveness score.

However, much more interestingly, these models all showed that one feature was overwhelmingly more important than the others. Was it Sean Hannity? Or Nancy Pelosi? Fox News?

No! It was time…

Trump Over Time

Trumps divisiveness over time. The y axis denotes Trump's mean overall divisiveness score and the x axis a quarter of every year. In plain English - this shows you how divisive Trump was in each quarter.
Trumps divisiveness over time. The y axis denotes Trump’s mean overall divisiveness score and the x axis a quarter of every year. In plain English – this shows you how divisive Trump was in each quarter.

This graph is a time series that shows the exponentially weighted mean of Trump’s quarterly divisiveness score. Plotting it like this as opposed to the raw data smooths out a lot of the noise and reveals some of the trends in the data.

What’s fascinating about this is how much it tracks Trump’s political career and election cycles.

The first uptick in divisiveness comes in 2011, when Trump kicks off his political career by talking at the Conservative Political Action Conference.

From then on, it also broadly tracks election cycles in 2012 and 2016 before then spiking upwards in 2020 around Black Lives Matter protests and the COVID-19 pandemic.

Here’s the code behind all this:

Conclusions

So, what does this say about preventing real world harm on social media and how Machine Learning and data science may assist?

Well, the first thing to say is that my hypothesis was wrong. It doesn’t look like Trump’s interlocutor affects his choice of word – it looks like overwhelmingly the time period in which he is tweeting does.

But I think this helps arrive at a much more powerful solution.

As you can see from the graph above, Trump‘s divisive language is incredibly time bound. That may seem obvious but it actually provides a very useful way to filter the dataset.

If we know that Trump’s rhetoric in 2020 was similarly divisive to that in 2012, this is helpful because we already know what happened in 2012. If we know that these periods are similar based on a metric we care about – how divisive the president of the United States is – then we can explore that period in the data and see what happened there.

Did anyone Trump tweet at go on to do something terrible? Did a publication that Trump consistently retweeted spread terrible misinformation and is it still active?

If the answers to those questions is a yes, then we can take action knowing there’s strong evidence to support our decision. The value of this information is hard to overstate – it’s the power to act before it’s too late.


Related Articles