Scraping Twitter data using python for NLP

Learn how to make your own dataset from twitter with a few lines of code.

Anshaj Khare
Towards Data Science

--

Photo by Joshua Aragon on Unsplash

“In God we trust. All others must bring data.” — W. Edwards Deming

If you’re starting in the incredible field of NLP, you’ll want to get your hands dirty with real textual data that you can use to play around with the concepts you’ve learned. Twitter is an excellent source of such data. In this post, I’ll be presenting a scraper that you can use to scrape the tweets of the topics that you’re interested in and get all nerdy once you’ve obtained your dataset.

I’ve used this amazing library that you can find here. I’ll go over how to install and use this library and also suggest some methods to make the entire process faster using parallelization.

Installation

The library can be installed using pip3 using the following command

pip3 install twitter_scraper

Creating a list of keywords

The next task is to create a list of keywords that you want to use for scraping twitter. You’ll be searching twitter for these keywords and hence it is important that you make a comprehensive list of keywords that you’re interested in.

Scraping tweets for one keyword

Before we run our program to extract all the keywords, we’ll run our program with one keyword and print out the fields that we can extract from the object. In the code below, I’ve shown how to iterate over the returned object and print out the fields that you want to extract. You can see that we have the following fields that we extract

  1. Tweet ID
  2. Is a retweet or not
  3. Time of the tweet
  4. Text of the tweet
  5. Replies to the tweet
  6. Total retweets
  7. Likes to the tweet
  8. Entries in the tweet

Running the code sequentially for all keywords

Now that we’ve decided what kind of data we want to store from our object, we’ll run our program sequentially to obtain the tweets of topics we’re interested in. We’ll do this using our familiar for loop to go over each keyword one by one and store the successful results.

Running the code in parallel

From the documentation,

Multiprocessing is a package that supports spawning processes using an API similar to the threading module. The multiprocessing package offers both local and remote concurrency, effectively side-stepping the Global Interpreter Lock by using subprocesses instead of threads. Due to this, the multiprocessing module allows the programmer to fully leverage multiple processors on a given machine.

First, we’ll implement a function to scrape the data.

Next, we’ll create subprocesses to run our code in parallel.

As you can see, we reduced our process time to almost 1/4th of sequential execution. You can use this method for similar tasks and make your python code much faster.

--

--

I’m a data scientist and a coder. I use data for finding insights and for building smart data products.