Simple Twitter analytics with twitter-nlp-toolkit

Eric Schibli
Towards Data Science
4 min readJul 7, 2020

--

Twitter is one of the richest sources of data for both business analysis and academic or pedagogical natural language processing; many of the top datasets on Kaggle were collected on Twitter, one of the most popular text embedding was trained using Twitter, and nearly every company actively monitors Twitter.

There are a number of reasons for this: Twitter has an accessible API and its hashtag system makes data collection and sorting easy. It is also because Twitter is an incredibly polarized and reactionary platform; when something is “trending on Twitter,” it is usually a product announcement blowing people’s minds, a new release being panned by critics, or somebody saying something inflammatory. This makes Twitter an extremely powerful source of information on public sentiment. My collaborators and I noticed that we were often tasked with collecting and analyzing Twitter data, so we decided to build a user-friendly package for doing so — twitter-nlp-toolkit. Here, I will show you how to use it by walking through a visualization I produced earlier this year, showing the impact of several of Elon Musk’s inflammatory comments on Twitter.

To begin, you will need to need to register for a Twitter API key, install the package, and install Spacy’s en_core_web_sm models. You can do the first easily here — you will simply be asked to provide a quick description of what you are doing that requires one — and the second two are even easier; simply run pip install twitter-nlp-toolkit and python -m spacy download en_core_web_sm in your terminal. twitter-nlp-toolkit requires
tensorflow ≥2 and scikit-learn ≥0.22, so you may want to install it in a new virtual environment.

Listening

The Twitter API’s primary offering is a listener, which allows you to stream tweets to disk in real time. The search logic is documented here. First, save your Twitter API key to disk as keys.key in .json format in your working directory, like this:

They you can build your listener. This will listen for tweets containing “Musk” and/or “Tesla:” (If you want both keywords, just set target_words = [‘Musk Tesla’].)

This will continuously stream all incoming tweets containing ‘Musk’ and/or ‘Tesla’ to musk_tweets.json. This file can get quite large and contains a lot of likely extraneous information, so disk usage should be monitored. For example, this tweet:

“My opinion of @elonmusk is now zero, kelvin.” — @AltPublicLands

Is saved as over 10kB of information:

Yikes. The parser will convert a .json file containing tweets to a much more manageable .csv file:

Producing parsed_musk_tweets.csv :

(Because we don’t have a commercial Twitter API key, we don’t get to see the user’s exact location.)

We can also grab the last 200 or so of Musk himself’s tweets using the bulk downloader tool:

This will produce tweets already parsed in.csv format.

Sentiment Analysis

twitter_nlp_toolkit also includes a sentiment analysis package that can estimate the positivity or negativity of a tweet:

Code for sentiment classification

The predict()function produces binary predictions — 0 for negative, or 1 for positive — while predict_proba() produces continuous predictions.

Our model realized that wasn’t a compliment

Plotting tweet volume and average sentiment as a function of time in Tableau produced the following visualization, showing that while Twitter was somewhat alarmed by tweets downplaying the severity of COVID-19, they were much more alarmed by a tweet about Tesla’s high stock price.

Tweet volume, coloured by average sentiment, over time. Contrast was enhanced for visibility.

The user account location can be leveraged to produce an estimated geographic map as well. Here I used Tableau’s geocoding service; however, Tableau was able to interpret fewer than 20% of the user-provided locations, and some of these interpretations were implausible or inaccurate. Google Maps would likely do a much better job, but is rate-limited to about 2000 requests per day at the free tier:

It should also be noted here that the sentiment analysis is only about 85% accurate in our testing, and may be less accurate out in the wild; our sentiment analysis tools are only trained on text content using the Sentiment140 dataset, which is quite old. While we are working to improve the models, they should probably only be used for observing trends rather than categorizing individual tweets unless they can be fine-tuned on domain data.

Feature Wishlist

The package is still under active development, but functionality shown here should not break. Current priorities include improving the sentiment analysis models, improving the efficiency and customization of the tweet parsing and language preprocessing, and integrating Google Geocoding API.

All code in the package, including some more examples, is hosted on Github here. Readers are encouraged to contact the developers with questions or feature requests.

--

--