Twitchverse: Constructing a Twitch Knowledge Graph in Neo4j

Learn how to design and construct a knowledge graph in Neo4j that describes the Twitch universe

Tomaz Bratanic
Towards Data Science

--

I was inspired by Insights from Visualizing Public Data on Twitch post. The author uses Gephi to perform graph analysis on the Twitch network and visualize the results. Twitch is an online platform that allows users to share their content via live stream. Twitch streamers broadcast their gameplay or activity by sharing their screen with fans who can hear and watch them live. I wondered what kind of analysis we could make if we used a graph database instead of Gephi to store the network information. This blog post will show you how to design and construct a knowledge graph in Neo4j. The data will be fetched via the official Twitch API.

Environment setup

You will need credentials for the Twitch API to follow this blog post. If you have created a user on Twitch, you can get the access token and the client id on Twitch token generator site, which is the easiest way to get credentials. Once you have completed this step, you should have the client id and the access token ready (remember, an access token is not a client secret).

Next, you will need to get access to a Neo4j database instance. The most straightforward solution is to use the Neo4j Sandbox, a free cloud instance of a Neo4j database. If you choose this route, select to use a blank Sandbox project. You can also install Neo4j Desktop locally if you wish.

Import information about current top viewed streams

To begin, we will import information about current live streams from the Twitch API. The API documentation is available here. We can fetch the data about the 1000 most viewed streams that are currently live. The API response contains the following information:

  • streamer name and id
  • the game they are playing
  • the language of the stream

Think about how we should store this information as a graph. Which information do we use as nodes and which as relationships? To answer some of your questions, a user can play many games and use many languages over time. Also, a streamer can behave like a regular user and subscribe to other streamers, talk in their chat, etc. More often than not, it takes more than a single iteration of a graph modeling process to get it right. The following graph model is my second iteration.

Graph model to store information about streamers. Image by the author.

I have used the arrows application to draw all the diagrams in this blog post.

As you can see, a streamer is just a Twitch user who also broadcasts their content. I have decided to use a User label for all users in the Twitch network and add a secondary Stream label to users who also stream. We know how many followers a streamer has, where to find the stream, and when the user was created. We store this additional information as node properties.

The language and game played information is stored in the form of categorical nodes. A categorical node is used to store information about variables that have a limited amount of values. In our example, the games played and the languages have a limited amount of values they can take. A user can have one or many relationships to game nodes. Maybe on Friday, they prefer to play Valorant, and on Sunday they like to play Poker. Our graph model ignores the time component of this information. We also don’t store the number of times a streamer has played a game as a relationship weight. We ignore these two data points because we would have to optimize the data collection process to distill this information.

Before continuing, make sure to define the unique constraints in Neo4j to optimize the performance of the import queries.

CREATE CONSTRAINT ON (s:Stream) ASSERT s.name IS UNIQUE;
CREATE CONSTRAINT ON (u:User) ASSERT u.name IS UNIQUE;
CREATE CONSTRAINT ON (g:Game) ASSERT g.name IS UNIQUE;
CREATE CONSTRAINT ON (l:Language) ASSERT l.name IS UNIQUE;
CREATE CONSTRAINT ON (t:Team) ASSERT t.id IS UNIQUE;

I have been working with Neo4j databases for about five years now. After about a month or two, I learned that the APOC library is a mandatory plugin to use in combination with the Neo4j database. It features many utility functions that can help you solve your problem in no time. My favorite APOC procedure from the start was and still is the apoc.load.json procedure. It allows to open JSON files, and more importantly, fetch data from any API endpoint that returns a JSON. Using only Cypher, you can scrape various API endpoints and construct a knowledge graph without any external tools. How awesome is that! It also supports custom headers and payloads in the requests. Learn more about the apoc.load.json procedure in the documentation.

To import the information about the streamers, you will need to have the Twitch API client_id nearby. The streams endpoint supports pagination and allows the export of up to 1000 active streams with a limit of 100 streams per request. The endpoint has an offset parameter that can be used for pagination. To perform ten requests with increasing offset parameter, we use the

UNWIND range(0, 900, 100) as offset

This statement will perform a request for every value of the offset. The range clause is very similar to the range function in Python. We are telling it we want to create a list that starts from 0 and ends at 900 with a step of 100.

If we put the pagination, requests, and storing the response in a single Cypher statement, we would end up with the following query:

WITH $client_id as client_id
//prepare pagination
UNWIND range(0,900,100) as offset
//Make an API request
WITH "https://api.twitch.tv/kraken/streams/?limit=100&offset=" + toString(offset) AS url, client_id
CALL apoc.load.jsonParams(url,{Accept: "application/vnd.twitchtv.v5+json", `Client-ID`:client_id},null) YIELD value
//Iterate over results in the response
UNWIND value.streams as stream
//Store streamer information
MERGE (s:User{name:stream.channel.name})
SET s.followers = stream.channel.followers,
s.url = stream.channel.url,
s.createdAt = datetime(stream.channel.createdAt),
s:Stream,
s.id = stream.channel.`_id`
//Store game information
MERGE (g:Game{name:stream.game})
MERGE (s)-[:PLAYS]->(g)
//Store language information
MERGE (l:Language{name:stream.channel.language})
MERGE (s)-[:HAS_LANGUAGE]->(l);

Now, you should have information about 1000 streamers in your Neo4j database. To examine the graph, we can take a look at a single streamer in the database. Run the following cypher query to get the game and language information for a single streamer:

MATCH (s:Stream)
WITH s LIMIT 1
MATCH p=()<-[:HAS_LANGUAGE]-(s)-[:PLAYS]->()
RETURN p

You can visualize the results in Neo4j Browser. You should see something similar to the following image.

Network visualization of a single streamer, its language, and the game they play. Image by the author.

Import information about chatters

On Twitch, users can interact with streamers by typing comments in the chat. Luckily for us, Twitch has an API endpoint that allows us to retrieve information about the chatters of a specific stream. This API endpoint does not require any authorization. If you want to learn who are the chatters in botezlive stream, you can access this information by opening the following link:

http://tmi.twitch.tv/group/user/botezlive/chatters

The information about the chatters is stored in three separate arrays, indicating whether the chatter is a VIP of the stream, moderator of the stream, or just a regular user. There are also global_mod, and admin arrays returned, but they are always empty as far as I have seen, so we will ignore them.

Before we import the information about chatters, let’s think a little about how we should define the graph model. From the API endpoint response, we have learned that we can differentiate between whether a chatter is a moderator, a VIP, or a regular user. We want to store this differentiation between users, moderators, and VIPs in our knowledge graph. I have used different relationship types to indicate this differentiation.

Graph model of chatter information. Image by the author.

My second favorite APOC procedure is the apoc.periodic.iterate procedure. It allows us to batch the transactions. This is very useful when dealing with larger data structures. In our case, a single streamer can have thousands of chatters, and if we retrieve the chatter information for 1000 streamers, we can be dealing with lots of data. The apoc.periodic.iterate procedure takes in two Cypher statements with an optional configuration map. The first Cypher statement returns a list of data we want to iterate over. The second statement takes information from the first Cypher statement and usually stores the information into Neo4j. In the configuration map, we can define the batch size. Batch size indicates how many iterations should be added into a single transaction. Learn more about APOC batching in the documentation.

If we put it all together, we can return all the streamers in the first Cypher statement of the apoc.periodic.iterate procedure. In the second statement, we create a request to the Twitch API endpoint and store the results. I have used the batchSize parameter with a value of 1 to batch each request and storage of response in a separate transaction.

// Import mods/vip/chatters for each stream
CALL apoc.periodic.iterate(
// Return all stream nodes
'MATCH (s:Stream) RETURN s',
'WITH s, "http://tmi.twitch.tv/group/user/" + s.name + "/chatters" as url
//Fetch chatter information
CALL apoc.load.json(url) YIELD value
WITH s, value.chatters as chatters
// Store information about vips
FOREACH (vip in chatters.vips |
MERGE (u:User{name:vip})
MERGE (u)-[:VIP]->(s))
//Store information about moderators
FOREACH (mod in chatters.moderators |
MERGE (u:User{name:mod})
MERGE (u)-[:MODERATOR]->(s))
//Store information about regular users
FOREACH (chatter in chatters.viewers |
MERGE (u:User{name:chatter})
MERGE (u)-[:CHATTER]->(s))',
{batchSize:1})

You can repeat this query in combination with fetching the top 1000 active streamers to collect more information about the Twitch chatter network. To examine if the information has been stored correctly, you can execute the following Cypher query:

MATCH (s:Stream)
WITH s LIMIT 1
MATCH p=(s)<-[:MODERATOR|VIP|CHATTER]-()
RETURN p LIMIT 25

If you visualize the results of this query in Neo4j Browser, you should see something similar to the following visualization:

Results of importing the chatter network. Image by the author.

Import detailed information about streamers

There is a separate Twitch API endpoint that we can use to fetch more detailed information about each streamer like the total view count lifetime. The API endpoint reference is available on this link. We have already learned how to combine apoc.load.json and apoc.periodic.iterate procedure to retrieve information from an API endpoint.

How should we store the additional description and total historical view count information? They are not categorical variables, and there is only one value of each per streamer. I have decided it makes the most sense to store them as node properties.

Storing total view count and description as node properties. Image by the author.

As mentioned, we combine apoc.periodic.iterate and apoc.load.json procedures to fetch this information from the Twitch API endpoint.

CALL apoc.periodic.iterate(
'MATCH (s:Stream) RETURN s',
'WITH s,
"https://api.twitch.tv/helix/users?login=" + s.name as url,
"Bearer <access token>" as auth, $client_id as client_id
CALL apoc.load.jsonParams(url,
{Authorization: auth, `Client-ID`:client_id},null)
YIELD value
SET s.id = value.data[0].id,
s.total_view_count = value.data[0].view_count,
s.createdAt = datetime(value.data[0].created_at),
s.description = value.data[0].description',
{batchSize:1})

Import streamer teams information

Each streamer can belong to zero, one, or many teams on Twitch. This is the last information we will import in this blog post. I’ll let you think a little bit about how we should store the information to which team a streamer belongs to. I can give you a hint that it falls into the categorical variable category. Each streamer can belong to zero, one, or many teams. There is a limited amount of teams on Twitch.

Graph model of Twitch team information. Image by the author.

You can see the repeating pattern in the graph modeling process. One thing to note, which wasn’t explicitly mentioned is that you should use as distinct relationship types as possible. You want to avoid generic relationship types like HAS, especially if it is used in many different scenarios.

Image by the author.

By now, you should have familiarized yourself with the apoc.periodic.iterate and apoc.load.json procedures. Again, we will use the same cypher query structure as before to retrieve the data from the Twitch API endpoint. Here, only the endpoint URL and how we store the response changes. Another nice side effect of using apoc.periodic.iterate with batchSize parameter of value 1 is that even if any of the API requests fails, it does not terminate the Cypher query. Neo4j is an ACID database that waits until the whole transaction is successful before committing data. As we split each API request into a separate transaction, we don’t run into an issue where we wouldn’t store any information to the database if only a single request out of a thousand failed. For example, this endpoint returns an error for about 1–2 percent of requests. As we use batchSize parameter of 1, we just ignore those errors.

CALL apoc.periodic.iterate(
'MATCH (s:Stream)
WHERE exists (s.id) and NOT (s)-[:HAS_TEAM]->()
RETURN s',
'WITH $client_id as client_id,
"Bearer <access token>" as auth, s
WITH s,
"https://api.twitch.tv/helix/teams/channel?broadcaster_id=" + toString(s.id) as url,
client_id,
auth
CALL apoc.load.jsonParams(url,
{Authorization: auth, `Client-ID`:client_id},null)
YIELD value
WITH s, value.data as data
WHERE data IS NOT NULL
UNWIND data as team
MERGE (t:Team{id:team.id})
ON CREATE SET t.name = team.team_display_name,
t.createdAt = datetime(replace(trim(split(team.created_at,"+")[0]), " ", "T"))
MERGE (s)-[:HAS_TEAM]->(t)',
{batchSize:1})

Summary

I hope you have learned how to effectively scrape API endpoint in Neo4j with the help of APOC procedures. Step by step, we have imported additional information into our graph and ended up with a knowledge graph that describes the Twitch universe.

Twitch knowledge graph schema. Image by the author.

In my next blog posts, I will demonstrate how to analyze this graph using Cypher query language and graph algorithms. Stay tuned!

p.s. If you want to play with the Twitch Graph without having to import the data yourself, I have prepared a Neo4j database dump that contains 10 million nodes and 20 million relationships. Data was scraped between the 7th and 9th of May 2021.

--

--

Data explorer. Turn everything into a graph. Author of Graph algorithms for Data Science at Manning publication. http://mng.bz/GGVN