Learn how to easily hydrate tweets

Using the Hydrator app and twarc tool by DocNow

Aruna Pisharody
Towards Data Science

--

Photo by Alexander Shatov on Unsplash

1. Introduction

Growing use of social media as a communication tool has led to social media research gaining significant traction in recent years. Twitter is one such popular form of online communication that offers both an easily accessible API to gather data and numerous tools for analyzing this collected data. Demographics, sentiment, social trends; the types of information that can be gleaned from Twitter seem almost endless!

As with any data analysis, the first step is to collect relevant data (in our case, tweets). However, oftentimes it is not always feasible to collect the tweets you need yourself (e.g., if you need historical tweets). Thankfully, due to the popularity of Twitter, it may be possible for you to find suitable twitter dataset online. Having said that, there is a catch to this: Twitter restricts the redistribution of Twitter content to third parties, so what we can get online are simply datasets consisting of tweet IDs of relevant tweets (or, dehydrated tweets). We have to then hydrate these tweet IDs to obtain the tweet content.

Restricted Use Cases for Twitter APIs (For more details, visit https://developer.twitter.com/en/developer-terms/more-on-restricted-use-cases)

I was looking into how to go about hydrating such datasets and came upon two tools provided by Documenting the Now (DocNow) for this purpose: a desktop application Hydrator and a command line tool twarc. I thought I’ll write down my experience using these tools here in case it helps others looking to do the same.

2. Getting Started

Step 1: Find a dataset to work on!

I’ll be using the dataset provided by Chen et al. [1] for this article. If you wish to work with only this dataset, they have already provided a script for hydrating the tweets collected by them (written by Ed Summers). Since my main goal is getting familiarized with the hydrating tools, I’ll not be using their script. To get started, I decided to use only the tweet IDs for January 2020 from their dataset. Note: If you wish to download only specific folders from a GitHub Repository, you can check out DownGit. However, always remember to cite the repository you’re using!

Step 2: Create a Twitter Developer Account

Working on anything related to Twitter requires you to create an account on the Twitter Developer Portal. After signing up on the portal, you can create a new “Application” specifying the purpose of your research. All you have to do now is note down your consumer (API) key, consumer (API) secret, access token, access token secret, and bearer token (optional).

Step 3: Let’s get the files ready!

The twitter dataset we’ll be using for this article follows the naming convention “coronavirus-tweet-id-YEAR-MONTH-DATE-HOUR” and the tweet IDs we are interested in is distributed among 242 files! Therefore, we’ll need to first combine all these tweet IDs into one file. Since I’m using Windows, this can be done by simply typing in copy *.txt merged_tweet_ids.txt in the command prompt. Looking at the merged tweets file, we have around 11 million tweets just for the month of January 2020!

Now, we’re all set to start hydrating!

3. Hydrating Tweets

Tool 1: DocNow/Hydrator:

Let’s begin by looking into the desktop application. You can download the latest release of the Hydrator app here.

Hydrator app launch screen (image by author)

As can be seen from the launch screen of the app, we’ll need to link our twitter account to the app for retrieving the tweet content. Clicking on “Link Twitter Account” opens up a link in your browser as shown in the image below. You just have to grant permissions to the Hydrator app to get a PIN. Enter this PIN in the app and you’re all set to start hydrating your tweet IDs dataset (see images below for details).

Hydrator screen after clicking on “Link Twitter Account” (image by author)
Granting permissions to Hydrator (image by author)

Next, you’ll have to click on the “Add” tab in the app and select your tweet ids dataset. Once the file has been verified (you can also confirm the number of tweet IDs read by the app), you can add in information about this dataset if you wish to (see image below). All that’s left now is to click on “Add Dataset”!

Adding a new Tweet IDs dataset (image by author)

And we’re all set! Now, all we have to do is click on “Start” (you can stop hydrating any time you wish) and the app will start hydrating your tweet IDs. Note: All the hydrated tweets are stored in a .jsonl file specified by you (when you click “Start” for the first time).

Start/Stop hydrating tweets (image by author)

Tool 2: DocNow/twarc

Let’s now move on to the command line tool, twarc. I’m working on a Windows PC and using Anaconda for this purpose. Let’s begin by creating a virtual environment for our project (and avoid any future package conflicts!).

First, let us navigate to our project folder using cd path . Now, we create a virtual environment inside this directory using the command conda create --name env_name python=3.10 . Here, I have also specified the python version that I wish to be installed in the environment (you can skip it and Anaconda will install a default version of python). We can now activate our virtual environment by typing in the command conda activate env_name . Once the environment has been activated you can see the environment name inside parenthesis as shown in the image below.

Activated conda environment ‘twitter_env’ (image by author)

Next, let us install twarc inside our environment. I used pip to install the twarc package (pip install twarc ). Now, there are two ways to go about hydrating tweets using twarc .

  1. You can use twarc directly from the command line: In this case, you’ll need to first configure the twarc tool with your credentials (type in twarc configure ; see figure below). Once configured, hydrating tweets is as simple as typing in the command: twarc hydrate merged_tweets_ids.txt > hydrated_tweets.jsonl .
twarc configure (image by author)

2. You can use twarc as a library: For this method, you can create an instance of twarc like shown below.

from twarc import Twarct_inst = Twarc(consumer_key, consumer_secret, access_token, access_token_secret)

I prefer using the second method because:

(a) You can simply add in your credentials into a .env file, which is safer (and, frankly, easier) than entering in your credentials via command line. All you have to do is create a .env file in your directory. A sample .env file in our case will look like this:

BEARER_TOKEN=BEARER_TOKEN
CONSUMER_KEY=CONSUMER_KEY
CONSUMER_SECRET=CONSUMER_SECRET
ACCESS_TOKEN=ACCESS_TOKEN
ACCESS_TOKEN_SECRET=ACCESS_TOKEN_SECRET

(b) You can add your hydrated tweets into a database directly, making it easier for you to explore the data

If you also wish to follow the second method, here is a sample script that you can use:

# import required libraries
import os
from dotenv import load_dotenv
from twarc import Twarc
from pymongo import MongoClient
# load your environment variables
BASEDIR = os.path.abspath(os.path.dirname(__file__))
load_dotenv(os.path.join(BASEDIR, '.env'))
load_dotenv('.env')
# access credentials from environment variables and provide them to the twarc instancet_inst = Twarc(os.getenv("consumer_key"), os.getenv("consumer_secret"), os.getenv("access_token"), os.getenv("access_token_secret"))# start hydrating tweets and storing them into a MongoDB database (Database: twitter_db, Collection: covid_tweets)num_tweets=0for tweet in t_inst.hydrate(open('merged_tweet_ids.txt')):
client = MongoClient('localhost', 27017)
db = client.twitter_db
db.covid_tweets.insert_one(tweet)
num_tweets += 1
if num_tweets % 10 == 0:
print('Number of tweets added into database:{}'.format(num_tweets))

4. My thoughts

If you simply need to hydrate a tweet IDs dataset into JSON, then the Hydrator app seems to be the most straightforward approach. The option to start/stop hydrating based on your convenience is a definite advantage as well. However, if you wish for more flexibility (e.g., working with databases) or you wish to do things other than simply hydrating tweet IDs (e.g., filter/search for tweets using Twitter API), then using twarc appears to be the way to go. Additionally, the twarc tool also offers many other utilities (e.g., creating wordclouds, tweet walls, etc.) which would be rather interesting to explore.

Of course, this is my first time using twarc, so I’m no authority on using this tool most effectively. You can check out these links for a more comprehensive guide on using twarc:

  1. https://scholarslab.github.io/learn-twarc/
  2. https://github.com/alblaine/twarc-tutorial
  3. https://ucsb-collaboratory.github.io/twarc/

Thanks for reading! :)

And that’s a wrap for this article!

We’re all set to start analyzing our hydrated tweets now!

Happy learning!

Reference:

[1] Chen, E., Lerman, K., and Ferrara, E., Tracking Social Media Discourse About the COVID-19 Pandemic: Development of a Public Coronavirus Twitter Data Set JMIR Public Health Surveillance, 2020, 6(2):e19273 (DOI: 10.2196/19273, PMID: 32427106)

--

--