How we Audited Twitter’s Timeline Curation Algorithm

Published in

Towards Data Science

5 min readApr 20, 2021

Following The Markup’s example, I split this blog into the main findings and this “show your work” piece.

We tested Twitter’s algorithm by creating a group of “puppet” accounts, then comparing their “latest tweets” chronological timelines to their “top tweets” algorithmic timelines.

This piece summarizes the technical details from my forthcoming paper auditing Twitter’s timeline curation algorithm. The main findings are in this blog and the full details are in the research paper, but here I will summarize the following:

🧦 How we set up “sock puppet” accounts to emulate typical users
🦾 How we ran automated timeline collection
🦠 How we clustered covid-19 tweets by topic
🟦 🟥 How we generated partisan labels for accounts
💬 Other frequently-asked questions

🧦 Setting up Sock Puppets

Sock-puppet auditing involves emulating real-world archetypal accounts with automated “puppet” accounts. Here are the four steps we used to find archetypal accounts on Twitter:

Define an initial pool of potential users (in our case, all accounts that followed U.S. congresspeople on Twitter)
Detect communities within the pool of users (using the Louvain algorithm)
Select one archetypal user from each community (based on degree centrality)
Validate that the archetypal users are not bots (in our case, using Botometer)

In February 2020, I collected all accounts that followed U.S. congresspeople on Twitter, for a total of 20 million accounts. Due to Twitter API constraints, we used a random sample of 10,000 accounts for the next steps.

After detecting communities in these 10,000 random accounts, we took each community to be its own network, then selected the most central user in the community based on normalized degree centrality. The goal was to select a user that followed many of the same accounts as other users, without just selecting an account that followed a ton of users.

Once we validated the archetypal users were not bots, we set up a puppet for each one, and followed all the same accounts as the archetype. Half the puppets emulated users from left-leaning communities (for example, “Logan Left”), and half emulated users from right-leaning communities (for example, “Rebecca Right”).

After setting up these accounts, it was time to collect their timelines.

🦾 Automated Timeline Collection

I wrote an automated Python script that used Selenium to visit Twitter, log in, and collect the timelines. All puppets collected timelines twice per day for one month, each time collecting the first 50 tweets in the chronological timeline and the first 50 tweets in the algorithmic timeline.

Our data collection did fail once on April 11, when Twitter requested verification for one of the puppet accounts. We excluded that data point from our analysis.

🦠 Clustering COVID-19 Tweets

We used a method called topic modeling to cluster all the tweets we collected. The algorithm we used is called GSDMM, a modification of the standard LDA approach that is better suited for short documents like tweets.

The GSDMM output (after fine-tuning the parameters) was 134 clusters of tweets, which I manually inspected and labeled, finding four large clusters related to Covid-19. Here is what they looked like:

Table 5 from the paper, detailing the four clusters of tweets we analyzed

🟦 🟥 Generating Partisan Labels

Labeling partisanship is a difficult and complicated task, and there are many different methods for doing so. In our case, we were most interested in exposure, so we labeled account partisanship based on which communities followed (and thus would be exposed to) a given account.

As some examples, our scoring system labeled Ben Shapiro and Jim Jordan as “influencers in right-leaning communities” because the left-leaning communities rarely followed their Twitter accounts. For similar reasons, Pete Buttigieg and Kamala Harris were labeled “influencers in left-leaning communities,” as right-leaning communities rarely followed them.

Notably, accounts like Barack Obama, Donald Trump, and Hillary Clinton were labeled as “bipartisan” in our scoring system. This is because they were commonly followed in both left-leaning communities and right-leaning communities. Basically, because our scoring system measures whether right-leaning and left-leaning communities follow these accounts, it does not necessarily reflect the politics of these people.

💬 FAQ

Here are responses for some questions I often receive (or anticipate receiving) about the research.

Why did you only make eight puppet accounts?

The community detection algorithm yielded eight major communities, and we created one puppet per community.

Eight accounts is too small a sample size, right?

The effects we observed, including fewer external links, lots of suggested tweets, and increased source diversity, were consistent across the eight puppets. In other words, we have eight accounts that provided evidence of the effects, and zero accounts that provided counter-evidence.

As I suspect many will point out, eight accounts is not enough to capture macro-level patterns on Twitter. But that was not the goal of the study. Rather, it was more like what The Markup calls “digital shoe-leather reporting,” kind of like getting the view from 10 feet rather than the view from 10,000 feet.

Even with the view from 10 feet, we present empirical evidence for a number of important patterns. Understanding how these patterns generalize to macro-level patterns would require a different approach and more resources.

Did the puppets like or click on tweets?

No, the puppets simply scrolled through their timelines twice per day. Twitter does personalize based on likes, clicks, and other behavior, so this is a limitation of our study.

Why not look at all tweets instead of just the first 50?

We did originally use all tweets from followed accounts as the baseline (similar to the “potential from network” baseline in the Facebook algorithm study by Bakshy et al.). However, the metrics of interest (e.g. external link rate) were indistinguishable between this baseline and the baseline comprising the most recent 50 tweets in the chronological timeline. Also, it made for a more appropriate comparison to look at 50 chronologically-curated tweets and 50 algorithmically-curated tweets.

Why did you only collect the first 50 tweets, why not 100 or more?

Whether people are looking at search results, advertisements, or social media posts, they tend to engage most with items at the top. For example, in a study analyzing the Facebook algorithm, Bakshy et al. reported that click rates for the first item in the News Feed was around 20%, but for the 40th item, it dropped to about 5%. This phenomenon is known as “position bias.” We decided to focus our study on the critical window of the first 50 tweets in the timeline, which users are much more likely to engage.

I have another question…

Ask away in the comments or send me an email!