The world’s leading publication for data science, AI, and ML professionals.

How to Extract Data from Twitter through Cloud Computing

By using Python and Tweepy

By using Python and the Tweepy library.

Photo by dole777 on Unsplash
Photo by dole777 on Unsplash

Twitter is a great source of information for aspiring Data Analysts and Data Scientists. It is also the place of debates and polemics. Moreover, Twitter can be a good resource for text data; it has an API, the credentials are easy to acquire and there is a number of Python libraries available to help make calls to Twitter’s API.

A way to build up your portfolio is getting your hands-on natural language processing projects, but like on every project, the first step is getting hold of the data and then save these on your local computer or on the cloud, by using Amazon Web Services (AWS).

Overall, the resources are great to set up this pipeline, but they have a few quirks that can easily cause headaches. In this article, I will show you how to set up a simple pipeline process and I will help you navigate through it and avoid those headaches. Consider this article your pregame Tylenol!

Which is good, right? So, hang on with me.

Through four steps I will demonstrate how you can build the pipeline below. The first step consists of collecting the Data using the API from Twitter, followed by showing you how to launch an EC2 instance, connect it to kinesis and finally transfer the data into your S3 (Data Lake) on AWS by using a Python script.

Pipeline from Twitter to S3
Pipeline from Twitter to S3

Are you ready to master all of these?

Step 1: Set up your developer account on Twitter and get your credentials

Get yourself an account: A Twitter user account is a requirement to have a "developer" account, so either register a new account or log-in to an existing one, it is a piece of cake.

Create an app: Register an app at https://developer.twitter.com/en/apps. After you set it up, go to ‘Token and Keys’ to generate an access token and secret token. Along with the consumer API keys (already generated), these are the credentials we will use to communicate with the Twitter API. I used these credentials for the Python script as well. I recommend saving the credentials in a text file since we will use these credentials in later stages.

Step 2: Create an AWS account

In this project, you need to use three different services from Amazon: EC2 instance, Kinesis (Data Stream) and S3 (Data Lake) to set up the pipeline. You need to create an account and register your credit card under this account; Amazon charges based on your utilization of the services. Regarding the cost, no worries. The services you use in this project are inexpensive; in my project, I am spending less than CAN$5–6/month. If you need to use other services, I recommend you check the cost beforehand. Note that it is always a good practice to look at your bill once in a while to make sure you don’t get surprised by it and don’t forget to terminate the service or instance if you don’t need it anymore.

First, in the EC2 management console, create a private key pair, which you will use to set up the EC2 instance. Also, set the "Security Group" properly to allow your IP for SSH access in the next steps.

Next, go through the 7 steps to launch an EC2 instance. In my project I used a simple t2.micro, which is enough to extract tweets, you can use only one instance and leave the size at 8Gib. If you want to know more details on how to set up an EC2, I suggest you check this good article from Connor Leech below. Note that your first year EC2 t2.micro instance will be free until a certain threshold.

How to Launch an Amazon EC2 Instance

An EC2 instant is a processor or CPU, and in this project, we collect the data by using the python script that you will use (I will explain that below).

An EC2 instance is a processor or CPU and, in this project, your instance collects the data by using the Python script to extract the data from Twitter (I will explain this below).

S3 is a data lake, for the newbies reading this article, this is the service to store the raw data collected by the EC2 instance. In my project, I decided to leave my S3 public in order to connect other software, such as Apache Spark, and reduce the access issues. I am aware that this is not the best practice since other people can access your data, but in this project it is OK, I am not working with sensitive data.

Amazon Kinesis Data Streams (KDS) is an extremely scalable and sustainable real-time data delivery service. This service is the "pipe" between the EC2 instance and S3. Therefore, during the setup of this service you need to link it to your S3 by indicating the name of the bucket which is the folder and indicate how you want the data collected to be saved.

Once all the services are ready and linked, we’re ready to move to the next step.

Step 3: Understand the API and Tweepy

The documentation about the API on Twitter is reasonably good but it can be a bit confusing for beginners in the field. I had a few issues to understand how the API is working, and I hope that your mind will be clear after reading this section.

You will find most of the information to conduct this project on the link below:

Tweet updates

Twitter’s API allows you to extract different data fields containing the tweet content, usernames, location, date and number of followers of each user. In this project, I will extract all these fields.

In order to extract the tweets, you need we’ll use the package Tweepy. It is a Python library for easily working with the Twitter API. First, you need to set the API object with the authentication credentials that you have previously created and stored in a text file. This is a Tweepy model class instance with the data and some helper methods.

API Reference – tweepy 3.5.0 documentation

Photo by Luca Bravo on Unsplash
Photo by Luca Bravo on Unsplash

The first step in the script consists of loading the different Python libraries that you want to use, which are tweepy, json, boto3 and time. Then, you enter the credentials from Twitter.

Next, enter the code to parse and export tweets into a json/text file. The script is organized into two main parts.

The first part is called a listener. It is in this block that we extract the data from the API and print them. Here is the information the script will record for this project:

  • Username
  • User Screen Name
  • Tweet content
  • User followers count
  • User location
  • Geolocation
  • Tweet Time

Depending on the project you conduct, I encourage you to adapt this section for your own convenience (for instance, removing unwanted fields).

One curious aspect I learned during this project is that Twitter had initially limited its users to tweet only 140 characters, and in November 2017 Twitter started allowing its users to tweet a maximum of 240 characters.

In order to adapt the API to this new change, Twitter created another new field called "extended tweet" where all content of the long tweet is displayed. So, if a tweet is longer than 180 characters and you want to extract the whole tweet, you need to extract this field, whereas if the tweet is shorter than 180 characters you need to extract the initial tweet field, named "text". Not quite instinctive, right?

Now let’s go back to the script. To accommodate and extract the extended tweet when present, I created an if/elif condition. If the API has an "extended tweet" field present (which means that the tweet has more than 180 characters), extract it; otherwise, extract the "text" field.

The first part is authentication and the connection to Twitter Streaming API, then below enter your AWS credentials (please note this is not recommended) and also enter the name of the kinesis you set up in the previous step.

Then, you need to enter which keyword you want to track and collect, and in which language. In my project, I decided to follow tweets containing #giletsJaunes in English and in French. You are able to track multiple hashtags.

Here is the link to the full script.

Step 4: Run your code on AWS by using screen

Now that the pipeline and the Python script are ready, you can start to run the script by using the EC2 instance you created previously in step 2.

Start by logging to the EC2 with the command below:

Then, install a few packages on your EC2: python3 and tweepy, in order to run your Python script smoothly. Then, upload your Python script located on your local machine to your EC2 by using the command below on the terminal:

Since you need to run the script for a long period of time you will probably want to open a "screen" on your EC2, to run the job on the background. This allows you to close the terminal of your EC2 and run it without checking.

Here are the main commands to use screen:

So, once your "screen" is set up you will just need to run your script, by entering the command below:

python3 my_script.py

Once you enter it, you should see all the tweets being collected by your EC2 instance, which will be saved in your S3.

Step 5: Monitor your pipeline

By logging in to your AWS you will be able to see the activity of your pipeline. A cool feature is the tab "monitoring" in kinesis.

Incoming Records in Kinesis Firehose Delivery Streams
Incoming Records in Kinesis Firehose Delivery Streams

In this plot, you see the activity of tweets being collected across a period of time. In my project, it looks like the hashtag #giletsjaunes is more active on Saturday and Sunday than during the week. You can also check the files created in your S3 DataLake, by downloading the file into your local machine and see the tweets saved.

Now you are ready to move on to the next step and analyze the data in your local machine, by using another AWS service or by using Apache Spark.


Related Articles