Building a network graph from Twitter data

Writing Java apps to collect Twitter data and visualising it in a graph.

Mananai Saengsuwan
Towards Data Science

--

In this article, we will build a data science project. We collect data from Twitter because it has an enormous amount of data and it allows us to get those. We prefer Java because it’s a compiled language and it has a strong concurrency library. Finally, we summarise those data by using Gephi which is an open-source graph platform.

Graph generated by using Gephi from sample tweets on 9/14/20. Language was TH

We need the followings to do the project:

  • Java IDE. Our choice is Eclipse.
  • Twitter4j libraries. Get the jar files and the tutorial from here.
  • Twitter developer account. We need this in order to call Twitter API. There are a few resources mentioned how to get access.
  • Any JDBC compliant database. We use Sqlite. It’s very light-weighted. No software installation required. No daemon process. Just copy Sqlite jar file into the project. However, there are some limitations which require workarounds.
  • Gephi which is an open source graph tool. Download it from here.

By the way, readers could use whatever language or platform they like, Python or Node.js. Our sample code is in Java.

The followings are steps to build the Twitter network graph:

  • Collect tweets and users and save them into a database.
  • Retrieve users’ friends. From the list of users from the previous step, get friends of those. We will save these into tables
  • Filter for the data we’d like to see in the graph
  • Export the data to CSV files
  • Import CSV files to Gephi. Do some formating, layout. We will get a twitter social graph

Collect tweets and users

For the first step, we collect sample tweets and then write those to tables. To do this:

  • Create a twitter stream object. Sample the stream. The API provides random subset of all tweets.
  • For each tweet received, submit a callable task to an executor service. The task will do database operations and/or further processing.

Below is the code:

The code in the callable task will save tweets and related objects like users to tables. By using executor service, we decouple tweet processing and database related task. Even if sometimes tweets are coming faster than the database could process, our app still would not miss anything. Also, since we use Sqlite database and there could be only one writing to the database at a single moment, the executor service must be Single Thread Executor. The following is a part of code of the task:

Retrieve users’ friends

From the previous step, we get a list of users we’d like to know all their friends. Twitter API returns friend IDs of a specified user but no more than 5,000 IDs at a single request. We need to call multiple times if that user has more than that. Also, Twitter has rate limits. It allows only 15 requests per 15-minute window. Basically, 1 request per minute.

So, we get friend IDs. Need other API calls to convert user IDs to user objects. Twitter provides API for this. For each request, we could query up to 100 user IDs. The rate limit for this is 300 request per 15-minute window. So, it’s 20 requests per minute.

More details of Twitter API and rate limits here.

To handle rate limits effectively, we will have 2 threads. the 1st thread will invoke friend IDs query. The 2nd thread will do users look up part. The friend finder thread will pass the user ids to the user lookup thread via a blocking queue. Basically, we use producer-consumer pattern here.

Friends Lookup Thread

The following code is partial of FriendsLookupRunnable.

Some key points:

  • The run method of this runnable would poll a user id from a blocking queue of user ids to process.
  • For each id, call getFriendIds method. This method returns list of friend ids. Each user id and friend id pair is inserted into User_friend table.
  • The result friend ids are also put into another blocking queue. These IDs would be retrieved by the other thread to process.
  • The getFriendIds method keeps track when was the last time it was called and make sure there is enough delay between each call (1 minute) by using Thread.sleep().
  • Even though we do that, there are very rare cases that rate limit exceeded exception occur. So, we catch TwitterException and compare the exception status code. If rate limit did exceed, we just retry the query.
  • There are some other exceptions. For instance, when a user is protected, twitter API will give you an unauthorised error.

The following is the command to create User_Friend table which stores result of the 1st thread:

CREATE TABLE User_Friend (    user_id       INT (8),    friend_id     INT (8),    PRIMARY KEY (user_id,friend_id));

Users Lookup Thread

The following code is UsersLookupRunnable class.

Here are some key points:

  • In the run method, there is a while loop to retrieve user ids from the queue. It then will call lookupUsers method to do the actual lookup
  • Since Twitter lookupUsers API could handle no more than 100 user ids at a time, we will chop a user IDs array into arrays of 100 or lesser elements before calling the Twitter API.
  • The lookupUsers method keeps track when was the last time it was called and make sure there is enough delay between each call (3 seconds) by using Thread.sleep().
  • The method returns users list which would be inserted into User table. The structure of the table should be similar to Twitter user interface.

The following is the command to create User table which stores the result of the 2nd thread:

CREATE TABLE User (    id              INT (8)       PRIMARY KEY,    name            VARCHAR (100),    screen_name     VARCHAR (100),    description     VARCHAR (255),    email           VARCHAR (50),    favorites_count INT,    followers_count INT,    friends_count   INT,    statuses_count  INT,    lang            VARCHAR (10),    location        VARCHAR (255),    url             VARCHAR (255),    imageurl        VARCHAR (255),    is_protected    INT (1),    is_verified     INT (1),    created         VARCHAR (20),    last_modified   VARCHAR (20));

The main method would do these:

  • Setup database connection
  • Creates 2 blocking queues
  • Prepares the user IDs list. Adds it to the first blocking queue.
  • Creates 2 runnable and 2 threads.
  • Starts the 2 threads.
  • Adds shutdown hook. So, when the process get kill, it will interrupt both threads.
  • Wait until both 2 threads finish.
  • Cleanup the database

The code should look something like this:

Filter the data (Optional)

Sometimes, we’d like to see only part of the whole data. It’s quite simple to do this because data is in SQL tables already. Let’s say we’d like to see how top 100 users with most followers in our sample tweets follow each others. Here are what to do:

  • Create tables for storing the results. Below is the SQL statements used:
CREATE TABLE Graph_Friend_Edge (    Source      INT,    Target      INT );
CREATE TABLE Graph_Friend_Node (
id INT PRIMARY KEY, label VARCHAR (50), name VARCHAR (100),);
  • Populate the edge table with only top users. The following is the SQL:
insert into graph_friend_edge(source, target)select user_id, friend_id from user_friendjoin user u1 on friend_id=u1.idjoin user u2 on user_id=u2.idwhere user_id in(select friend_id from user_friendgroup by friend_id order by count(*) desc limit 100)and friend_id in(select friend_id from user_friend group by friend_id order by count(*) desc limit 100);
  • Then, populate the node table with this SQL:
insert into graph_friend_node(id, label, name)select n.id, u.screen_name, u.namefrom(select source id from graph_friend_edgeunionselect target id from graph_friend_edge) n join user u on n.id = u.id;

Export data to CSV files

This part is simple. Use database tool to export data to CSV files.

  • Export user_friend table to edge CSV file.
  • Export user table to node CSV file.

Create a network graph

Gephi is an open-source graph analysis and visualisation. There are many Gephi tutorials out there. Take a look here. For importing CSV files tutorial, find it here.

The followings are overview of steps to do in our project:

  • Open Gephi. Create a new project.
  • Import edge and node CSV files. The initial graph may look unimpressed like this:
Top 100 users’ friends graph before apply layout from sample tweets on 9/14/20. Language was TH

We need to show node labels. Configure node size and colours. Apply some layout.

  • Enable node labels
  • Configure node size and label size proportional to in-degree (number of incoming edges)
  • Choose layout to “ForceAtlas2” and run it.
  • Run community detection algorithm
  • Set node colour according to Modularity Class. This will colour node according to the community that it detected.

After these are done, the graph looks more meaningful:

  • User’s screen names show up as node labels.
  • Nodes with more followers in this group appear bigger.
  • Edges, arrow lines, represent follow relationships
  • Nodes with the same colours are in the same communities according to the graph algorithm.
Top 100 users’ friends graph generated by using Gephi from sample tweets on 9/14/20. Language was TH

Conclusions

We have built Java applications to collect tweets, users and friends data from Twitter and put into a relational database. We’ve done some filter on the data. Then, imported to Gephi, graph platform and visualisation tool and produce a social network graph.

This is a very small part of what we could have done with the Twitter data. Gephi has a lot more to offer us. Also, there are more Graph analytics platforms out there. Neo4j, for example, could allow us to store data to its database and do run graph algorithms.

--

--