Twitchverse: A network analysis of Twitch universe using Neo4j Graph Data Science

Learn through a practical example how to use graph theory and algorithms to gain valuable insights from connected data

Tomaz Bratanic
Towards Data Science

--

Graph data science focuses on analyzing the connections and relationships in data to gain valuable insights. Every day, massive amounts of data are generated, but the connections between data points are often overlooked in data analysis. With the rise of Graph Data Science tools, the ability to analyze connections is not limited anymore to huge technology companies like Google. In this blog post, I will present how to set up the Neo4j Graph Data Science environment on your computer and walk you through your (first) network analysis. We will be using the Twitch network dataset. In my previous blog post, I have shown how to fetch the information from the official Twitch API and store it into the Neo4j graph database. You have the option to either fetch information directly from the Twitch API, or you can load the Neo4j database dump I have prepared. The data in the database dump was scraped between the 7th and the 10th of May 2021.

Neo4j Graph Data Science environment

First of all, you need to download and install the Neo4j Desktop application. After opening the Neo4j Desktop application, you should see the following screen:

Neo4j Desktop application. Image by the author.

To follow this blog post and get the same results, you should load the database dump I have prepared. Download it from this link, and then add it to the Neo4j Desktop environment as shown in the picture below. Just a quick note, the database dump is 1.6GB large and contains 10 million nodes, so make sure you have enough disk available.

Add a file to Neo4j Desktop Project. Image by the author.

As shown in the image, click on the Add button and select the File option. It should take a couple of seconds and then the file name should appear under the File tab. To restore a database dump, hover over the file and click on the three dots in the right corner. Next, select the Create new DBMS from dump option.

Create new DBMS from a database dump. Image by the author.

In the next screen, you need to define the password and select the Neo4j database version. I suggest you always use the latest database version available. At the moment, version 4.2.6 is the latest.

Select password and database version. Image by the author.

The database restoration should only take a couple of seconds. Now we only need to install the APOC and the Graph Data Science plugins before we can start with the network analysis. Luckily for us, we can install both plugins with just a couple of clicks.

Install APOC and GDS plugins. Image by the author.

To install the plugins, click on the created database instance under the Project tab. After you select the created database by clicking on it, a new tab should appear on the right side. Select the Plugins tab and install both the APOC and the Graph Data Science libraries by clicking the Install button.

Last but not least, it is advisable to increase the heap memory when dealing with slightly larger graphs like the one we are dealing with. You can define heap allocation by clicking on the three dots of the created database and select Settings.

Access setting for a database instance in Neo4j Desktop. Image by the author.

In the settings, search for the following two lines and increase the heap memory allocation.

dbms.memory.heap.initial_size=4G
dbms.memory.heap.max_size=4G

The default heap memory allocation is 1GB and in this example, I have increased it to 4GB. I suggest you do the same to avoid OOM errors when executing some of the graph algorithms.

Now that you have loaded the database dump and installed the necessary plugins, you can go ahead and start the database instance. I have prepared a Jupyter notebook with all the Cypher queries that will be executed in this blog post, but you can also follow along by copying the Cypher queries into the Neo4j Browser environment.

Twitch network schema

We will begin with a short recap of the Twitch network model.

Graph schema of Twitch network. Image by the author.

The Twitch social network composes of users. A small percent of those users broadcast their gameplay or activities through live streams. In our graph schema, users who do live streams are tagged with a secondary label Stream. We know which teams they belong to, which games they play on stream, and in which language they present their content. We also know how many followers they had at the moment of scraping, the all-time historical view count, and when they created their user account. On the other hand, we know which users engaged in the streamer’s chat. We can distinguish if the user who chatted in the stream was a regular user (CHATTER relationship), a moderator of the stream (MODERATOR relationship), or a VIP of the stream.

Remember, the data in the database dump was scraped between the 7th and 10th of May 2021, so only streamers who streamed on that weekend will show up in the database. Similarly, only users who chatted on that weekend will be present in the network.

We will first count the nodes in the database by using the apoc.meta.stats procedure. Execute the following Cypher query to examine the count of nodes by labels.

CALL apoc.meta.stats()
YIELD labels

Results

Results of the apoc.meta.stats procedure. Image by the author.

We can observe that almost all of the nodes in our graph are users. There are 10.5 million users and only around 6000 of them are streamers. These streamers have played 594 games and broadcasted in 29 different languages.

Exploratory graph analysis

Before digging into graph algorithms, I like first to get acquainted with the graph at hand.
We will perform an exploratory graph analysis by executing a couple of Cypher queries to learn more about the network. All bar charts in this blog post are created with the help of the Seaborn library. I have prepared a Jupyter notebook that contains all the code to help you follow this blog post. To begin, we will retrieve the top ten streamers by all-time historical view count.

MATCH (u:Stream)
WHERE exists(u.total_view_count)
RETURN u.name as streamer,
u.total_view_count as total_view_count
ORDER BY total_view_count DESC
LIMIT 10;

Results

Top ten all-time view count streamers. Image by the author.

Before I get attacked by Twitch domain experts, I would like to point out that we are only analyzing Twitch streamers who have live-streamed between the 7th and the 10th of May. There might be other streamers that have higher all-time view counts. The top three all-time view count streamers in our graph seem to be a team of streamers and not an individual. I am not familiar with the mentioned streams. I only know that esl_csgo streams Counter-Strike: Global Offensive pretty much 24/7. Next, we will investigate the top ten streamers with the highest follower count.

MATCH (u:Stream)
WHERE exists(u.followers)
RETURN u.name as streamer,
u.followers as followers
ORDER BY followers DESC
LIMIT 10;

Results

Top ten followed streams. Image by the author.

Interestingly, we get a completely different group of streamers than in the previous query. Except for Shroud, he is fourth in the all-time view count category and second in the highest follower count. Each follower receives a notification when a stream goes live. It is almost unimaginable to think that 9 million people receive a notification when Rubius goes live or almost 8 million people when Pokimane starts streaming. I am curious to learn how long the streams have existed. We will aggregate the count of streams by their user creation date.

MATCH (u:Stream)
WHERE exists(u.createdAt)
RETURN u.createdAt.year as year,
count(*) as countOfNewStreamers
ORDER BY year;

Results

Count of the created stream by year. Image by the author.

Here I need to add another disclaimer. We are aggregating by the date of when the user account was created and not when they first started streaming. Unfortunately, we don’t have the date of the first stream. Judging by the results, at least some of the streamers have already a ten-year career on Twitch. Another quite shocking fact, at least to me. We can move away from the streamers and investigate which games are played by most streamers. Note that a streamer can play more than a single game, so they might be counted under multiple games. Our data was collected between Friday and Sunday, so a streamer might prefer to play Valorant on Friday and Poker on Sunday.

MATCH (g:Game)
RETURN g.name as game,
size((g)<--()) as number_of_streamers
ORDER BY number_of_streamers DESC
LIMIT 10

Results

Games played by most streamers. Image by the author.

By far, the most popular is the Just Chatting category. The cause of the Just Chatting category popularity might be due to the reason that it is used for all streams that don’t play a specific video game. This might include anything from cooking shows to “In real life” streams, where the streamer is walking around the world with a camera in hand. Otherwise, it seems that Resident Evil Village, GTA V, and League of Legends are the most popular games by streamers.

Most of the streamers belong to a team or two. We can investigate which teams have the highest member count along with all the games that the team members broadcast.

MATCH (t:Team)
WITH t, size((t)<--()) as count_of_members
ORDER BY count_of_members DESC LIMIT 10
MATCH (t)<-[:HAS_TEAM]-(member)-[:PLAYS]->(game)
RETURN t.name as team,
count_of_members,
collect(distinct member.name) as members,
collect(distinct game.name) as games

Results

G Fuel team had 64 members stream on the weekend between the 7th and 10th of May 2021. Its members cover 34 different game categories on Twitch. Before we dive into network analysis, let’s examine the users that are VIPs of the most streams.

MATCH (u:User)
RETURN u.name as user, size((u)-[:VIP]->()) as number_of_vips
ORDER BY number_of_vips DESC LIMIT 10;

Results

Users with the highest count of VIPs. Image by the author.

I would venture that Nightbot and Supibot are both bots. Having bots as VIPs seems a bit weird. I always thought that bots were moderators of the chat with the ability to remove messages with links and such. We can also examine which users are moderating for the highest count of streams.

MATCH (u:User)
RETURN u.name as user, size((u)-[:MODERATOR]->()) as number_of_mods
ORDER BY number_of_mods DESC LIMIT 10;

Results

Users with the highest count of moderator roles on different streams. Image by the author.

As expected, all top ten users with the highest count of moderator roles are most likely bots, with Nightbot being the most popular.

User network analysis

You might remember that only around 6000 out of 10 million users in our graph are streamers. If we put that into perspective, only 0.06% of users are streamers. The other users in our database are users who have chatted on other streamers’ broadcasts. So far, we have entirely ignored them in our exploratory graph analysis. That is, until now. Graph data science focuses on analyzing connections between entities in a network.
The time is right to put on our graph data science hats and examine relationships between users in the network. As a brief reminder of the graph structure, let’s visualize a small subgraph of the user network.

MATCH (s:Stream)
WITH s LIMIT 1
CALL {
WITH s
MATCH p=(s)--(:Stream)
RETURN p
LIMIT 5
UNION WITH s
MATCH p=(s)--(:User)
RETURN p
LIMIT 5
}
RETURN p

Results

Example user network of a single streamer. Image by the author.

As mentioned in the previous blog post, streamers behave like regular users. They can moderate other broadcasts, can engage in their chat, or earn VIP status. This was the reason that drove the graph model decision to represent streamers as regular users with a secondary label Stream. We don’t want to have two nodes in the graph represent a single real-world entity.

To start, we will evaluate the node degree distribution. Node degree is simply the count of relationships each node has. Here, we are dealing with a directed network as the relationship direction holds semantic value. If Aimful is a moderator of the Itsbigchase stream, it doesn’t automatically mean that then Itsbigchase is a moderator of the Aimful stream. When dealing with directed networks, you can split the node degree distribution into in-degree, where you count incoming relationships, and out-degree, where you are counting outgoing connections. First, we will examine the out-degree distribution.

MATCH (u:User)
WITH u, size((u)-[:CHATTER|VIP|MODERATOR]->()) as node_outdegree
RETURN node_outdegree, count(*) as count_of_users
ORDER BY node_outdegree ASC

Results

Out-degree distribution of the user network. Image by the author.

A log-log plot visualizes the out-degree distribution. Both axes have a logarithmic scale. Around six million out of a total of ten million users have exactly a single outgoing relationship, meaning they have only chatted in a single stream. The vast majority of users have less than ten outgoing connections. A couple of users have more than 100 outgoing links, but I would venture a guess that they are most likely bots. Some mathematicians might consider the out-degree following a Power-Law distribution. Now, let’s look at the in-degree distribution. We already know that only streamers will have an in-degree higher than 0. Only the users who broadcast their streams can have users engage in their chat. Consequently, around 99.999% of users have an in-degree value of 0. You could almost treat this network as a bipartite graph, except that there are also relationships between streamers. We will visualize the in-degree distribution only for streamers.

MATCH (u:Stream)
WITH u, size((u)<-[:CHATTER|VIP|MODERATOR]-()) as node_indegree
RETURN node_indegree, count(*) as count_of_users
ORDER BY node_indegree ASC

Results

In-degree distribution of the user network. Image by the author.

It is pretty interesting to observe that the in-degree distribution also follows the Power-law distribution. Most streamers have less than 1000 active chatters on the weekend, while some streamers have more than 15000 active chatters.

Graph Data Science library

Now it is time to execute a couple of graph algorithms. The Neo4j Graph Data Science library (GDS) features more than 50 graph algorithms, ranging from centrality to community detection and node embedding algorithms. First, we need to refresh how does the GDS library work.

In order to run the algorithms as efficiently as possible, the Neo4j Graph Data Science library uses a specialized in-memory graph format to represent the graph data. It is therefore necessary to load the graph data from the Neo4j database into an in memory graph catalog. The amount of data loaded can be controlled by so called graph projections, which also allow, for example, filtering on node labels and relationship types, among other options.

Quote copied from the official documentation.

The GDS library executes graph algorithms on a specialized in-memory graph format to improve the performance and scale of graph algorithms. Using native or cypher projections, we can project the stored graph in our database to the in-memory graph format. Before we can execute any graph algorithms, we have to project a view of our stored graph to in-memory graph format. We can filter which parts of the graph we want to project. There is no need to add nodes and relationships that will not be used as an input to a graph algorithm. We will begin by projecting all User and Stream nodes and the possible relationships between them, which are CHATTER, MODERATOR, and VIP.

CALL gds.graph.create('twitch', 
['User', 'Stream'],
['CHATTER', 'VIP', 'MODERATOR'])

Weakly Connected Components

The Weakly connected components algorithm (WCC) is used to find disparate islands or components of nodes within a given network. A node can reach all the other nodes in the same component when you disregard the relationship direction.

Visualized weakly connected components in a sample graph. Image by the author.

Nodes Thomas, Amy, and Michael form a weakly connected component. As mentioned, the relationship direction is ignored by the WCC algorithm, effectively treating the network as undirected.

Use the following Cypher query to execute a Weakly-Connected Components algorithm on the Twitch network we have previously projected in memory. The stats method of the algorithm is used when we are interested in only high-level statistics of algorithm results.

CALL gds.wcc.stats('twitch')
YIELD componentCount, componentDistribution

Results

Results of the WCC algorithm on the whole projected user graph. Image by the author.

Very interesting to see that the user network is composed of only a single connected component. For example, you could find an undirected path between a user who watches Japanese streams and another user who looks at Hungarian streams. It might be interesting to remove the bots from the graph and rerun the WCC algorithm. I have a hunch that the Nightbot and other bots help connect disparate parts of the network into a single connected component.

With the current in-memory graph projection, we can also filter at algorithm execution time nodes or relationships. In the next example, I have chosen to consider only Stream nodes and connections between them.

CALL gds.wcc.stats('twitch', {nodeLabels:['Stream']})
YIELD componentCount, componentDistribution

Results

Results of the WCC algorithm when considering only Stream nodes in the graph. Image by the author.

With this variation of the WCC algorithm, we are effectively looking at chat communication between streamers. Only Stream nodes are considered, and so, only relationships between Stream nodes are used as an input to the WCC algorithm. There are a total of 1902 separate components in the streamer network. The largest component contains around 65% of all stream nodes. And then, we are dealing with primarily single node components, where a streamer hasn’t chatted in other streamers’ broadcast on the specific weekend the data was scraped.

PageRank

PageRank is probably one of the most famous graph algorithms. It is used to calculate node importance by considering the inbound relationships of a node as well as the importance of the nodes linking to it. PageRank was initially used to calculate the importance of websites by Google, but it can be used in many different scenarios.

Use the following Cypher query to execute the PageRank algorithm on the whole user network.

CALL gds.pageRank.stream('twitch')
YIELD nodeId, score
WITH nodeId, score
ORDER BY score
DESC LIMIT 10
RETURN gds.util.asNode(nodeId).name as user, score

Results

Results of the PageRank algorithm on the whole user network. Image by the author.

Results might vary significantly if you have scraped the Twitch API at some other times. In this analysis, chat engagement only between the 7th and the 10th of May is considered. My assumption is that streamers with the highest PageRank score are likely to have the highest count of other streamers engaging in their chat. We can easily validate this assumption by running the PageRank algorithm on the streamer subset of the network.

CALL gds.pageRank.stream('twitch', {nodeLabels:['Stream']})
YIELD nodeId, score
WITH nodeId, score
ORDER BY score
DESC LIMIT 10
WITH gds.util.asNode(nodeId) as node,score
RETURN node.name as streamer,
score,
size((node)<--(:Stream)) as relationships_from_streamers,
size((node)<--(:User)) as relationships_from_users

Result

Results of the PageRank algorithm on the Stream subgraph.

The top ten streamers by PageRank score are almost identical when we consider all the users or when we consider only the streamers of the Twitch network. What I found surprising is that Yassuo is in first place with only 16 inbound relationships from other streamers and 22000 relationships from regular users. I would venture a guess that the streamers who chatted in Yassuo’s broadcast are themselves important by the PageRank score.

MATCH (s:Stream{name:"yassuo"})<--(o:Stream)
RETURN collect(o.name) as other_streamers

Results

Results of the streamers who chatted in Yassuo’s live stream. Image by the author.

It seems that my assumption has some merit to it. Among the streamers who chatted in Yassuo’s stream are loltyler1, trainwreckstv, benjyfishy. These streamers are also in the top 10 ratings by PageRank. It’s not all about the number of relationships, the quality also does matter.

Community detection

The last category of graph algorithms we will look at is the community detection category. Community detection or clustering algorithms are used to infer the community structure of a given network. Communities are vaguely defined as groups of nodes within a network that are more densely connected to one another than to other nodes. We could try to examine the community structure of the whole user network, but that does not make a pretty network visualization of results. First of all, we will release the existing project network from memory.

CALL gds.graph.drop("twitch")

I like to sometimes watch either chess or poker streamers. Let’s analyze the community structure of a subgraph that contains poker and chess streamers. To ease our further queries, we will first tag relevant nodes with an additional node label.

MATCH (s:Stream)-[:PLAYS]->(g:Game)
WHERE g.name in ["Chess", "Poker"]
SET s:PokerChess

There were a total of 63 streamers that broadcasted either chess or poker on their channel. Let’s quickly visualize the streamer network to get a better sense of how it looks.

Chess and poker streamers network. Image by the author.

It is quite obvious that we are dealing with a single large connected component and mostly isolated nodes. If you only plan to run a single graph algorithm, you can use the anonymous graph feature, where you project and execute a graph algorithm in one step. Here, we will use the Louvain Modularity algorithm to infer the community structure of this subgraph. We will also treat this network as undirected. I would say that if a streamer engages in another streamer’s chat, they are probably friends, and usually, friendship relationships go both ways. This time we will store the results of the Louvain Modularity algorithm back to the stored graph, so we can use the community structure information in our visualizations.

CALL gds.louvain.write({
nodeProjection:'PokerChess',
relationshipProjection:{
ALL:{orientation:'UNDIRECTED', type:'*'}
},
writeProperty:'louvain_chesspoker'
})

We can examine the results of the community structure algorithm in the Neo4j Bloom application.

Community structure of chess and poker streamer subgraph. Image by the author.

The node color indicates to which community a node belongs. I have ignored streamers that haven’t engaged in other streamers' chat, so there are no isolated nodes. Again, I must stress that we are only looking at a 3-day snapshot of the Twitch network, so the results might not be perfect, as some streamers like to take the weekend off and so on.

And now, I want to show you one last cool thing you can do in Neo4j. Instead of looking at which streamers interact with each other, we can examine which streamers share their audience. We don’t have all the viewers in our database, but we have the viewers that have chatted on streamers’ broadcasts. First, we will tag users who have more than a single outgoing relationship. This will help us speed up the audience comparison process.

CALL apoc.periodic.iterate("
MATCH (u:User)
WHERE NOT u:Stream AND size((u)-->(:Stream)) > 1
RETURN u",
"SET u:Audience",
{batchSize:50000, parallel:true}
)

We need to infer a new network that depicts which streamers share their audience before we can run a community detection algorithm. To start off, we must project an in-memory graph.

CALL gds.graph.create('shared-audience', 
['PokerChess', 'Audience'],
{CHATTERS: {type:'*', orientation:'REVERSE'}})

Next, we will use the Node Similarity algorithm to infer the shared audience network. The Node Similarity algorithm uses the Jaccard similarity coefficient to compare how similar a pair of nodes are. We will assume that if two streamers share at least 5% of the audience, we will create a relationship between them. The mutate mode of the algorithms stores the results back to the in-memory projected graph. This way, we can use the results of one algorithm as an input to another graph algorithm.

CALL gds.nodeSimilarity.mutate('shared-audience',
{similarityCutoff:0.05, topK:15, mutateProperty:'score', mutateRelationshipType:'SHARED_AUDIENCE'})

And finally, we can run the community detection algorithm Louvain on the shared audience network between poker and chess streamers.

CALL gds.louvain.write('shared-audience', 
{ nodeLabels:['PokerChess'],
relationshipTypes:['SHARED_AUDIENCE'],
relationshipWeightProperty:'score',
writeProperty:'louvain_shared_audience'})

I will visualize the results of the community structure of the shared audience between streamers in Neo4j Bloom.

Community structure of poker and chess streamers based on the shared audience relationships. Image by the author.

Conclusion

I hope this blog post got you excited about graph data science. Let me know if you would like me to write about a specific use case or dataset. If you need some help with Graph Data Science library code, you can always use the Neo4j Graph Algorithms Playground application that will generate the required code to execute graph algorithms for you.

As always, the code is available on GitHub.

--

--

Data explorer. Turn everything into a graph. Author of Graph algorithms for Data Science at Manning publication. http://mng.bz/GGVN