Data for Change

The Askeladden Algorithm

Using Machine Learning to Understand Election Interference

Anna Jacobson

Published in

Towards Data Science

11 min readSep 16, 2020

By Laura Pintos, Ramiro Cadavid, and Anna Jacobson

Image via Shutterstock (obtained with Standard License)

Part I | Introduction

Overview

In February 2019, as part of special counsel Robert Mueller’s investigation of the Russian government’s efforts to interfere in the 2016 presidential election, the United States Department of Justice charged 13 Russian nationals with illegally meddling in American political processes. The defendants worked for a well-funded, Russian state-backed “troll factory” called the Internet Research Agency (IRA), which reportedly had 400 employees (known as “trolls”) working 12-hour shifts in a nondescript office building in St. Petersburg. The IRA ran a sophisticated, coordinated campaign to spread disinformation and sow discord into American politics via social media, including Facebook, Instagram, and Twitter.

Twitter identified and suspended thousands of these malicious accounts, deleting millions of the trolls’ tweets from public view on the platform. While other news outlets have published samples, it was difficult to understand the full scale and scope of the IRA’s efforts, as well as the details of its strategy and tactics. In the words of Alina Polyakova, a foreign policy fellow at the Brookings Institution,

“Wiping the content doesn’t wipe out the damage caused, and it prevents us from learning about how to be better prepared for such attacks in the future.”

To address this problem, and “in line with our principles of transparency and to improve public understanding of alleged foreign influence campaigns,” in late 2018 Twitter made publicly available archives of Tweets and media that it believed resulted from potentially state-backed information operations.

According to a December 2018 United States Senate Select Committee on Intelligence (SSCI) briefing, among the suspended IRA troll accounts there were approximately 109 accounts masquerading as news organizations, including U.S. local news organizations. Our goal for this project was to develop a machine learning algorithm that could be implemented to strengthen Twitter against attempted media manipulation by organizations like the IRA, as well as other activities that violate Twitter’s Terms of Service. Specifically, we wanted to develop a machine-learning algorithm to predict these “fake news” troll tweets. The algorithm that we created is named after Askeladden, a boy who outwits and defeats trolls in Norwegian folklore.

Motivation

Except within the narrowly defined boundaries of misuse laid out in the Twitter Rules, the company does not take responsibility for either the veracity or the intent of its users’ posts. This is almost certainly due to the practical challenge of monitoring and moderating a huge and constantly increasing volume of content. However, the choice is also likely ideological; Twitter affirms throughout its policies its support for freedom of expression and open dialogue. There are many individual “real” Twitter users who post similar content as the IRA trolls, possibly with similar intent. For the most part, these users are in compliance with the Twitter User Agreement. However, we believe that the IRA trolls are materially different from “real” trolls and should not be allowed the same freedoms.

The same SSCI report referenced above tells us that the IRA had clearly defined and coordinated intent to influence the 2016 election, exhibiting a strong and consistent preference for Donald Trump and negative content about a wide range of other Republican candidates, including Ted Cruz, Marco Rubio, Lindsay Graham, John McCain, and Ben Carson - as well as his general election opponent Hillary Clinton. Troll tweets employed deliberate voter suppression tactics that included malicious misdirection, candidate support redirection, and voter turnout depression. That the IRA was directed by a hostile foreign government and its actions carried out by foreign nationals gives additional cause for concern. The central tenet of democracy — that a people should be able to select for themselves the leaders who can best govern and meet their political needs — is violated by any foreign election interference. Covert, duplicitous, and deliberately disinformative foreign interference such as the IRA’s is a particular affront to an institution that is founded on public trust.

The “fake news” trolls, who purported to be legitimate news organizations, perpetrated an especially egregious fraud through their impersonation and manipulation of the media. Exploiting the fragmentation of the modern media environment, the “fake news” trolls weaponized one of democracy’s key pillars against it. In addition, the magnitude of the reach of these accounts far surpasses most individual accounts. Prior to their suspension, the 44 US “fake news” troll Twitter accounts had amassed 660,335 followers between them, with an average of 15,000 followers each (versus the average Twitter user with fewer than 1,000 followers). Many of these accounts behaved similarly, posting links to articles and local content dozens of times per day. Many other legitimate users, including several high-profile Trump campaign members (Donald Trump Jr., Eric Trump, Kellyanne Conway, Brad Parscale, and Michael Flynn), linked or reposted material from these accounts, legitimizing their content and amplifying their influence far beyond their own followers. We felt that for these reasons, the “fake news” trolls merited the special attention of our project.

Perhaps the most convincing motivation for this project is that the trolls’ threat remains; the SSCI report states that there is evidence of continued interference operations across social media platforms. In order to prevent the interference that we now know took place during the 2016 election from happening again in future elections, measures must be taken to proactively protect the United States’ democratic institutions.

Part 2 | Data Sourcing & Exploration

“Fake News” Troll Tweets

Twitter’s collection of IRA datasets includes all public, undeleted tweets, and media for 3,613 accounts that Twitter believes are connected to the IRA. Tweets deleted by these users prior to their suspension (which are not included in these datasets) comprise less than 1% of their overall activity.

For our project, we decided to focus on the _tweets dataset, which includes 8,768,633 unique tweets from May 2009 to June 2018. The dataset includes 31 variables, including tweet identification number, user identification number (anonymized for users which had fewer than 5,000 followers at the time of suspension), the Twitter handle of the user (same as userid for anonymized users), the language of the tweet, the text of the tweet (mentions of anonymized accounts have been replaced with anonymized userid), and the time when the tweet was published. After filtering for English language tweets only, there are just under 3 million unique tweets in the dataset (2,997,181). These tweets come from 3,077 unique user accounts.

*Figure 1: First ten tweets in the English-language tweet training data.*

We defined our period of interest as 2016–2018 in order to align with the timeframe of our real news dataset (see below). We also felt that this timeframe represented an important period of escalating political divisiveness in the run-up to and aftermath of the 2016 presidential election.

In order to segregate the “fake news” trolls from other trolls, we created a subset based on user screen names containing the words ‘Daily’, ‘New’, ‘Today’, and ‘Online’. This yielded 296,949 unique tweets from 33 unique user accounts with screen names such as TodayNYCity, ChicagoDailyNew, and KansasDailyNews.

Real News Tweets

Harvard Dataverse published a dataset containing the tweet IDs of 39,695,156 tweets collected from the Twitter accounts of approximately 4,500 news outlets (i.e. accounts of media organizations intended to disseminate news). The media organizations included a wide spectrum of outlets, from local U.S. newspapers to foreign television stations. They were collected between August 4, 2016 and July 20, 2018 from the Twitter API.

Twitter’s Developer Policy (with which compliance is required in exchange for the keys for the Twitter API) places limits on the sharing of datasets. If you are sharing datasets of tweets, you can only publicly share the ids of the tweets, not the tweets themselves. Thus, this dataset contained only the tweets’ IDs. Based on this information, we retrieved the complete tweet from the Twitter API. We selected a variety of English language news outlets across the ideological spectrum, including Politico, Fox News, CNN, The Economist, and MSNBC. In total, we included 153,188 unique tweets from 49 unique user accounts.

All the News

For this analysis, we randomly sampled the “fake news” troll tweets dataset to get the same number of tweets as our real news tweet dataset. We concatenated the troll tweets and the real news tweets, resulting in a combined dataset of 306,376 tweets equally balanced between the two classes. We included the text of each tweet and its category (‘real’ or ‘troll’). No other identifying information was included.

Feature Extraction

Using CountVectorizer, we extracted 470,051 unique features from our dataset. After “https” and “co”, which are by far the most common features as they appear in all links to other tweets, many of the overall most common features are standard English stop words such as “to”, “the”, “in”, and “of”. Surprisingly, we see that the feature “trump” occurs much more frequently in real news tweets than troll tweets, while the feature “news” occurs much more frequently in troll tweets than in real news tweets.

One interesting observation of our extracted features is that standard stop words tend to occur much more frequently in the real news tweets than in the troll tweets. These words are typically not considered useful in Natural Language Processing (NLP) because of their ubiquity and lack of association with specific subject matter. However, in this particular analysis, we are comparing a class in which the text was presumably mostly written by native English speakers to a class in which the text, though in English, was presumably mostly written by native Russian speakers. The omission of stop words, such as the definite article “the”, the prepositions “to” and “of”, and the verb “is”, is characteristic of many non-native English speakers. For instance, the Russian language has no articles, so native Russian speakers can have difficulty with this concept when learning English and often omit stop words such as “the”. The use of prepositions in Russian is very different than in English, which can lead to errors such as confusion between the stop words “on” and “at” and omission of the stop word “for”. Russian also has no copulas (linking verbs), which can lead to the omission of stop words such as “is” in English.

Setting aside stop words, we see another interesting difference: of the top fifty most frequent features for each class, there are three notable non-stop words that occur in the real news features that do not occur in the fake news features, “president”, “house”, and “white”. However, there are many more non-stop words that occur in the fake news features only, including words that could be used in sensational contexts, such as “police”, “shooting”, “killed”, “fire”, and “crash”. This may point to an imbalance of national news versus local news in our dataset. Alternatively, it may be a characteristic of different tweeting styles, where troll tweets might be more overtly trying to grab the reader’s attention.

Part 3 | Modeling

Borrowing concepts from sentiment analysis, we tested a number of different models including Bernoulli Naive Bayes, Logistic Regression, Ensemble (Logistic Regression, Linear SVC, Bernoulli Naive Bayes, Ridge, and Passive Aggressive), Random Forest, and Doc2Vec, using both CountVectorizer and TFIDF Vectorizer. Although our Ensemble model performed slightly better, we selected the Logistic Regression CountVectorizer bigram model for further development based on its parsimony and high accuracy.

We explored various optimizations, including preprocessing the data using stop word removal as well as a number of other customized preprocesses, and fine-tuning the model on C Value, min_df, and max_features. In the end, however, the original, un-optimized model using un-preprocessed data provided the best performance.

Part 4 | Model Interpretation

Observing the statistical predictors of “fake news” troll tweets gives us some insights into their tactics.

Camouflage

Many troll tweets appear innocuous, focused on seemingly uncontroversial topics. 8 of the 12 “trolliest” tweets — the tweets that were most strongly predicted by our model to be by trolls — are about sports.

Figure 3: The top 12 “trolliest” tweets. Sports-related tweets are shown in blue.

Confusion

Troll tweets often sound like real news tweets. Of the 20 most confused — or incorrectly predicted — tweets, 80% are troll tweets misclassified as real news.

Feature Variation

The top fifty most predictive features of a troll tweet are seemingly varied. However, upon closer examination, patterns emerge.

Figure 5: The top 50 most predictive features of troll tweets.

Generic Words

The first is the use of generic words, such as politics, news, and local. It turns out that real news tweets don’t need to use these kinds of labels to tell people what they are — for instance, CNN doesn’t need to label its tweets as news — you know that they’re news because you know CNN. The trolls, on the other hand, do seem to use these labels — which is a logical move if you’re trying to convince someone that you are something that you actually are not.

Political Slogans

A second pattern is political slogans, both pro-Trump and anti-Hillary. This is unsurprising; as we mentioned earlier, the IRA has now been proven to have been trying to influence the 2016 election in Trump’s favor. It also makes sense that legitimate news organizations would not use these phrases.

Syria

More surprising is the third pattern, the prevalence of words related to Syria. Perhaps this is because the trolls were actively trying to influence events in Syria — or perhaps it is because real American news coverage of Syria has been quite sparse. Alternatively, these tweets may have been intended for a European audience, since Syria was a more controversial issue in Europe at the time.

Part 5 | Conclusion

The evidence gathered since the 2016 election shows irrefutably that Russia waged a coordinated, professionalized campaign to subvert the integrity of the media and to weaken the democracy of the United States. In August 2020, American intelligence officials publicly announced that this interference continues, either in an attempt to help President Trump win a second term or simply to erode confidence in the American electoral system.

The statement from the National Counterintelligence and Security Center did not provide details about Russia’s tactics in the run-up to the 2020 election, merely describing them as “spread[ing] disinformation in the U.S. that is designed to undermine confidence in our democratic process”. But what we do know is that the types of tactics employed by the IRA in 2016 can only influence the elections if voters are misled by the Russian trolls’ disinformation. It is our profound hope that projects such as the Askleladden Algorithm can help raise Americans’ awareness of how hostile foreign governments might try to manipulate them and to inspire vigilance in sourcing and fact-checking the information we use to decide our votes.

This project was created in the Master of Information and Data Science program at the UC Berkeley School of Information.