The world’s leading publication for data science, AI, and ML professionals.

How to monitor a political movement on Twitter using AWS

From #giletsjaunes on Twitter to a dashboard

From #giletsjaunes to a dashboard on Kibana

In a previous article published earlier this year I explained how to set up a pipeline in AWS (Amazon Web Services) from the Twitter API to a data lake aka S3, How to Extract Data from Twitter through Cloud Computing. The aim of this new article is to cover the different services used to create a near-live dashboard and why to use them, this article would require basic knowledge in IT and AWS.

Feel free to contact me if you have any questions. Let’s start now!

The Pipeline

First, check below the full pipeline I will cover in this article.

The Pipeline - From Twitter to Kibana - Diagram by Author
The Pipeline – From Twitter to Kibana – Diagram by Author

Twitter API

The data must come from a "Data Source", which is the Twitter API or Application Programming Interface. To access the API, the first step is to create a developer account on the platform here, and in order for the API to authenticate me as a user, I use an access token and secret token.

Photo by ev on Unsplash
Photo by ev on Unsplash

EC2 instance

The service that I virtually plug into Twitter is an EC2 instance, by using a Python script and the token, I capture all the tweets with; #giletsjaunes in this case. The benefits of the EC2 is that the capacity increases or decreases every minute depending on the Twitter API activity and it’s also very easy to set up and access through a terminal on a local machine, like a desktop. For this case, I use the smallest EC2 instance available: the t2.micro, which is free under the AWS free tier account.

Kinesis – Bring the pipe

Photo by Quinten de Graaf on Unsplash
Photo by Quinten de Graaf on Unsplash

Once tweets are collected by the EC2 instance, I need to store them somewhere, as the EC2 is not designed to store data. Thus, I use a data lake, S3 in AWS. To move the data from the EC2 I would need to use another service called Amazon Kinesis Data Streams. This service is a pipe and it delivers the data to S3. When setting up the Kinesis service I also decide how I would like the data to be organized and saved into S3.

S3 – The Data Lake

Now moving to the next service. As mentioned before, the tweets are stored in an S3 bucket, a data lake is a centralized repository. In S3, I can store either structured or unstructured data, there are no capacity limits. You can store all kinds of data (websites, mobile app, social media, corporate applications) without careful design or the need to know what questions I need to answer in the future and the price is also very low cost. S3 was perfect for my usage as when I first started this project, I was not sure how to use the data I was collecting and which services to use next, therefore using an S3 as a backup is great, as I can now run different projects (real-time analysis vs history analysis).

For example, in the future I could use the data stored in S3 to run a glue job to structure the data and next analyze them using Athena and publish the data into another or the same Dashboard, isn’t a good idea?

Photo by Matthew Fournier on Unsplash
Photo by Matthew Fournier on Unsplash

AWS Lambda – The powerful Python script

Next, to put the data into a dashboard, I use the service AWS lambda, this service copies each tweet arriving into the S3 bucket and moves them to the next service ElasticSearch. AWS Lambda is running on two Python scripts that I have written. The service will perform the following:

  • It will be triggered once a new file arrives into S3 and it will copy the data created into the S3 bucket I setup as a target.
  • It will parse the data into ElasticSearch following a specific structure, in order to be used in the ElasticSearch.

At first, I had a lot of issues understanding how to use this service and the Python scripts associated, I even took a course fully focus on AWS lambda, but with the help of internet I managed to make it run. This service is fully based on the script(s) the user will include, therefore AWS lambda is used for many usages. You can check the scripts used here.

AWS Lambda, as you understand, is flexible and you can customize it based on your needs and the price is also really low. For this service AWS is charging your account based on the data going through.

ElasticSearch – "Open source" search and analytics

ElasticSearch or ES is the last service I use in this project. Originally, ES was a search engine and it is able to analyze petabytes of data and it is way easier to use than Apache Spark. You can use ES for log analytics, application monitoring, or security monitoring. For example, Adobe is using ES to track error rate or Expedia for price optimization. This service is managed by AWS, and there is no need to perform any maintenance. However, it is not a "serverless" service, meaning that I decide which cluster I want to use.

Photo by Andrew Neel on Unsplash
Photo by Andrew Neel on Unsplash

For the visualization, I use Kibana, which is a tool "sitting" on ES. As this service is amazing and can ingest tons of data, the price is not cheap and AWS is charging for the usage of the cluster even though there is no data going through, watch out for your bill! Do not forget to terminate the clusters if you don’t use them.

Kibana – The dashboard

Now that the pipeline is ready, the data will flow into Kibana and it will be ready to be used. The first stage is to create an index to tell Kibana how to structure the data, then create meaningful visualizations and add them into a dashboard.

As I haven’t used Kibana in the past, I took a short course on Udemy about it for $15CAN. Note that most of the courses on Udemy are at this price once in a while. Put them on your wishlist and wait until the discount, here is the course if you are interested: Data Visualization with Kibana.

Let’s check the data from last week (Dec 7- Dec 14). The first plot is showing the number of tweets per hour. As I would expect, the #giletsjaunes is used during the day and more on the weekend, when I guess people have more free time to tweet.

The second plot on the right is a cloud of words. Indeed, Twitter users can locate themselves. I see here that the #giletsjaunes hashtag is used first in "France" and then "Paris". As the location is a free text, some users used some exotic places like "France ALLEZ L’OM" or "La Réole 33190", all French users I think =).

(Left) - Hourly count of tweets in the last week, screenshot by Author - (Right) Cloud of Words of the locations, screenshot by Author
(Left) – Hourly count of tweets in the last week, screenshot by Author – (Right) Cloud of Words of the locations, screenshot by Author

Moving now to the next plot, I can look at the data of the last 2 weeks to observe a trend.

The plot on the left is showing the daily count and each color is representing a location. By looking at today’s data (December 13 2020) I am seeing that "Belgium" has been really active and found out that someone tweeted more than 100 times today from there.

The plot on the right is showing the top 10 users during this period, along with the number of followers. The first user is DessaigneLamber (193 tweets), followed by 4Tchat (113 tweets) and Sylvinfo (79 tweets). 4Tchat is the user that tweeted from Belgium, and if you go on the profile you see that it is "Twitter Bot".

(Left) - Daily count per location, screenshot by Author - (Right) Top 10 user, screenshot by Author
(Left) – Daily count per location, screenshot by Author – (Right) Top 10 user, screenshot by Author

As you now understand, AWS Lambda and ElasticSearch are two powerful services. I hope that through this post you have an overview on how to set up a dashboard and all the services needed and why I used them. This is the first time I set up a fully automated pipeline. At first, it seems pretty simple, but using AWS lambda was a bit of a struggle. In case you wonder about the cost, if you are under the free tier account, this pipeline will cost you around $6CAN/month and around $40CAN/month with a regular user account; ElasticSearch represents 90% of this cost.

I hope you had some fun reading this article. If you made it until here, you can also check the GitHub repo and try to run this pipeline yourself, and enjoy working with big data and building pipelines.


Related Articles