Building a network graph from Twitter data
Writing Java apps to collect Twitter data and visualising it in a graph.
In this article, we will build a data science project. We collect data from Twitter because it has an enormous amount of data and it allows us to get those. We prefer Java because it’s a compiled language and it has a strong concurrency library. Finally, we summarise those data by using Gephi which is an open-source graph platform.
We need the followings to do the project:
- Java IDE. Our choice is Eclipse.
- Twitter4j libraries. Get the jar files and the tutorial from here.
- Twitter developer account. We need this in order to call Twitter API. There are a few resources mentioned how to get access.
- Any JDBC compliant database. We use Sqlite. It’s very light-weighted. No software installation required. No daemon process. Just copy Sqlite jar file into the project. However, there are some limitations which require workarounds.
- Gephi which is an open source graph tool. Download it from here.
By the way, readers could use whatever language or platform they like, Python or Node.js. Our sample code is in Java.
The followings are steps to build the Twitter network graph:
- Collect tweets and users and save them into a database.
- Retrieve users’ friends. From the list of users from the previous step, get friends of those. We will save these into tables
- Filter for the data we’d like to see in the graph
- Export the data to CSV files
- Import CSV files to Gephi. Do some formating, layout. We will get a twitter social graph
Collect tweets and users
For the first step, we collect sample tweets and then write those to tables. To do this:
- Create a twitter stream object. Sample the stream. The API provides random subset of all tweets.
- For each tweet received, submit a callable task to an executor service. The task will do database operations and/or further processing.
Below is the code:
The code in the callable task will save tweets and related objects like users to tables. By using executor service, we decouple tweet processing and database related task. Even if sometimes tweets are coming faster than the database could process, our app still would not miss anything. Also, since we use Sqlite database and there could be only one writing to the database at a single moment, the executor service must be Single Thread Executor. The following is a part of code of the task:
Retrieve users’ friends
From the previous step, we get a list of users we’d like to know all their friends. Twitter API returns friend IDs of a specified user but no more than 5,000 IDs at a single request. We need to call multiple times if that user has more than that. Also, Twitter has rate limits. It allows only 15 requests per 15-minute window. Basically, 1 request per minute.
So, we get friend IDs. Need other API calls to convert user IDs to user objects. Twitter provides API for this. For each request, we could query up to 100 user IDs. The rate limit for this is 300 request per 15-minute window. So, it’s 20 requests per minute.
More details of Twitter API and rate limits here.
To handle rate limits effectively, we will have 2 threads. the 1st thread will invoke friend IDs query. The 2nd thread will do users look up part. The friend finder thread will pass the user ids to the user lookup thread via a blocking queue. Basically, we use producer-consumer pattern here.
Friends Lookup Thread
The following code is partial of FriendsLookupRunnable.
Some key points:
- The run method of this runnable would poll a user id from a blocking queue of user ids to process.
- For each id, call getFriendIds method. This method returns list of friend ids. Each user id and friend id pair is inserted into User_friend table.
- The result friend ids are also put into another blocking queue. These IDs would be retrieved by the other thread to process.
- The getFriendIds method keeps track when was the last time it was called and make sure there is enough delay between each call (1 minute) by using Thread.sleep().
- Even though we do that, there are very rare cases that rate limit exceeded exception occur. So, we catch TwitterException and compare the exception status code. If rate limit did exceed, we just retry the query.
- There are some other exceptions. For instance, when a user is protected, twitter API will give you an unauthorised error.
The following is the command to create User_Friend table which stores result of the 1st thread:
CREATE TABLE User_Friend ( user_id INT (8), friend_id INT (8), PRIMARY KEY (user_id,friend_id));
Users Lookup Thread
The following code is UsersLookupRunnable class.
Here are some key points:
- In the run method, there is a while loop to retrieve user ids from the queue. It then will call lookupUsers method to do the actual lookup
- Since Twitter lookupUsers API could handle no more than 100 user ids at a time, we will chop a user IDs array into arrays of 100 or lesser elements before calling the Twitter API.
- The lookupUsers method keeps track when was the last time it was called and make sure there is enough delay between each call (3 seconds) by using Thread.sleep().
- The method returns users list which would be inserted into User table. The structure of the table should be similar to Twitter user interface.
The following is the command to create User table which stores the result of the 2nd thread:
CREATE TABLE User ( id INT (8) PRIMARY KEY, name VARCHAR (100), screen_name VARCHAR (100), description VARCHAR (255), email VARCHAR (50), favorites_count INT, followers_count INT, friends_count INT, statuses_count INT, lang VARCHAR (10), location VARCHAR (255), url VARCHAR (255), imageurl VARCHAR (255), is_protected INT (1), is_verified INT (1), created VARCHAR (20), last_modified VARCHAR (20));
The main method would do these:
- Setup database connection
- Creates 2 blocking queues
- Prepares the user IDs list. Adds it to the first blocking queue.
- Creates 2 runnable and 2 threads.
- Starts the 2 threads.
- Adds shutdown hook. So, when the process get kill, it will interrupt both threads.
- Wait until both 2 threads finish.
- Cleanup the database
The code should look something like this:
Filter the data (Optional)
Sometimes, we’d like to see only part of the whole data. It’s quite simple to do this because data is in SQL tables already. Let’s say we’d like to see how top 100 users with most followers in our sample tweets follow each others. Here are what to do:
- Create tables for storing the results. Below is the SQL statements used:
CREATE TABLE Graph_Friend_Edge ( Source INT, Target INT );
CREATE TABLE Graph_Friend_Node ( id INT PRIMARY KEY, label VARCHAR (50), name VARCHAR (100),);
- Populate the edge table with only top users. The following is the SQL:
insert into graph_friend_edge(source, target)select user_id, friend_id from user_friendjoin user u1 on friend_id=u1.idjoin user u2 on user_id=u2.idwhere user_id in(select friend_id from user_friendgroup by friend_id order by count(*) desc limit 100)and friend_id in(select friend_id from user_friend group by friend_id order by count(*) desc limit 100);
- Then, populate the node table with this SQL:
insert into graph_friend_node(id, label, name)select n.id, u.screen_name, u.namefrom(select source id from graph_friend_edgeunionselect target id from graph_friend_edge) n join user u on n.id = u.id;
Export data to CSV files
This part is simple. Use database tool to export data to CSV files.
- Export user_friend table to edge CSV file.
- Export user table to node CSV file.
Create a network graph
Gephi is an open-source graph analysis and visualisation. There are many Gephi tutorials out there. Take a look here. For importing CSV files tutorial, find it here.
The followings are overview of steps to do in our project:
- Open Gephi. Create a new project.
- Import edge and node CSV files. The initial graph may look unimpressed like this:
We need to show node labels. Configure node size and colours. Apply some layout.
- Enable node labels
- Configure node size and label size proportional to in-degree (number of incoming edges)
- Choose layout to “ForceAtlas2” and run it.
- Run community detection algorithm
- Set node colour according to Modularity Class. This will colour node according to the community that it detected.
After these are done, the graph looks more meaningful:
- User’s screen names show up as node labels.
- Nodes with more followers in this group appear bigger.
- Edges, arrow lines, represent follow relationships
- Nodes with the same colours are in the same communities according to the graph algorithm.
Conclusions
We have built Java applications to collect tweets, users and friends data from Twitter and put into a relational database. We’ve done some filter on the data. Then, imported to Gephi, graph platform and visualisation tool and produce a social network graph.
This is a very small part of what we could have done with the Twitter data. Gephi has a lot more to offer us. Also, there are more Graph analytics platforms out there. Neo4j, for example, could allow us to store data to its database and do run graph algorithms.