Twitchverse: Using FastRP embeddings for a node classification task

Extract the value of relationships by using the FastRP embedding algorithm to produce features for a downstream node classification task

Tomaz Bratanic
Towards Data Science

--

This is the third article in my Twitchverse series. The previous two are:

  1. Twitchverse: Constructing a Twitch Knowledge Graph in Neo4j
  2. Twitchverse: A network analysis of Twitch universe using Neo4j Graph Data Science

Don’t worry. This article is standalone, so you don’t need to examine the previous ones if you aren’t interested. However, if you are interested in how I constructed the Twitch knowledge graph and performed a network analysis, check them out. You can also follow along by loading the database dump in Neo4j and all the code is available as a Jupyter notebook.

Agenda

In this blog post, I present how I used FastRP, a node embedding algorithm, as an input to an ML classifier. The idea is to predict the language in which a streamer is broadcasting based on the co-chatting network. We will assume that if a user is chatting in multiple streams, there is a high probability that those streamers broadcast in the same language. Then we will use FastRP embeddings to engineer features for our classification model. On a high level, this post can be split into the following steps:

  1. Data cleaning
  2. Infer co-chatting network
  3. FastRP embeddings
  4. Evaluate classification accuracy

Graph Model

Twitch graph schema. Image by the author.

Our Twitch knowledge graph consists of streamers and users who chat in their broadcasts. There is also some additional metadata available about streamers such as language and the games they play.

Example subgraph of users chatting in streams. Each streamer has a designated language. Image by the author.

Here is a small subgraph of users (purple) who chat in streams (yellow). Both of the streamers in the example are broadcasting in the English language (green). A special case of Twitch is that streamers can behave like regular users and engage in other streamers' broadcasts by commenting in their chat. We can observe three users who have chatted in both streams. The more common chatters two streamers have, the more it is likely they are streaming in the same language.

Data cleaning

We will start by exploring the dataset a bit. Let’s first inspect the languages and how many streamers use them in their broadcast.

MATCH (l:Language)
RETURN l.name as language,
size((l)<--()) as numberOfStreams
ORDER BY numberOfStreams
DESC

Results

There is a total of 30 different languages in our graph. We have to exclude some of the languages in our classification task due to their small sample size. I have decided to exclude all the languages that have fewer than 100 streamers.

MATCH (l:Language)
WHERE size((l)<--()) < 100
MATCH (l)<--(streamer)
SET streamer:Exclude
RETURN distinct 'success' as result

Next, we will check to see if any of the streamers broadcast in more than a single language.

MATCH (s:Stream)
WHERE size((s)-[:HAS_LANGUAGE]->()) > 1
MATCH (s)-[:HAS_LANGUAGE]->(l)
RETURN s.name as streamer, collect(l.name) as languages

Results

There is only one streamer who has assigned more than a single language. Gige broadcasts in English and Hungary languages. We have excluded all the languages with less than 100 streamers, which Hungary falls into, so we will ignore Gige in our further analysis.

Now, we will look at the users in our knowledge graph. It makes sense to plot the node out-degree distribution. The out-degree, in this case, informs us about the count of streams a user chatted in.

MATCH (u:User)
WHERE NOT u:Exclude
WITH u, size((u)-[:CHATTER|VIP|MODERATOR]->()) as node_outdegree
RETURN node_outdegree, count(*) as count_of_users
ORDER BY node_outdegree ASC

Results

Node outdegree distribution. Image by the author.

This line chart was visualized with the Seaborn library. There are around 5000 streamers in our database, so the maximum possible out-degree is 5000. The data was fetched over a period of three days. I would venture a guess that users who chatted in more than 1000 streams are highly likely to be bots. I chose 200 to be the actual threshold, so users who have talked in more than 200 streams over three days will be ignored. I think that even that is generous. You would have to chat in more than 60 streams per day to achieve this threshold.

MATCH (u:User)
WHERE NOT u:Exclude
WITH u, size((u)-[:CHATTER|VIP|MODERATOR]->()) as node_outdegree
WHERE node_outdegree > 200
SET u:Exclude

It is also highly likely that the most active moderators are actually bots.

MATCH (u:User)
WHERE NOT u:Exclude
RETURN u.name as user, size((u)-[:MODERATOR]->()) as mod_count
ORDER BY mod_count DESC LIMIT 10

Results

Seems my hypothesis was valid. Most of the highly active moderators have a bot in their name. We will also exclude moderators who have participated in more than 10 streams over the three days the data was fetched.

MATCH (u:User)
WHERE NOT u:Exclude
WITH u, size((u)-[:MODERATOR]->()) as mod_count
WHERE mod_count > 10
SET u:Exclude

Train-test data split

Before we do the train-test data split, let’s quickly refresh how many data points per language there are.

MATCH (l:Language)<-[:HAS_LANGUAGE]-(s:Stream)
WHERE NOT s:Exclude
RETURN l.name as language, count(*) as numberOfStreams
ORDER BY numberOfStreams DESC

Results

By far, the most frequent language is English. Next, there are Spanish, German, and Russian. For some reason, Twitch decided to distinguish between English and United Kingdom English. We will not make that differentiation in our classification task and treat both of them as the same. Instead, we will write the language as the node property, combining English and UK English into a single category.

MATCH (s:Stream)-[:HAS_LANGUAGE]->(l:Language)
WHERE NOT s:Exclude
SET s.language = CASE WHEN l.name = 'en-gb' THEN 'en' ELSE l.name END

We will use 80% of the data points per language as the training set and leave the other 20% for testing.

MATCH (s:Stream)
WHERE NOT s:Exclude
WITH s.language as language, s
ORDER BY s.name
WITH language, count(*) as count, collect(s) as streamers
WITH language, streamers, toInteger(count * 0.8) as training_size
UNWIND streamers[..training_size] as train_data
SET train_data:Train

I have added the order by name line for easier reproducibility. Of course, you can use any random function to split the train-test dataset if you want.

Infer co-chatting network

Here, we will start using the Neo4j Graph Data Science library. If you need a quick refresher on how to use the GDS library, I suggest you check out the Twitchverse network analysis blog post. You should know that the GDS library uses a special in-memory graph to optimize the performance of the graph algorithms.

Image copied with permission from the GDS documentation.

We have two options to project an in-memory graph. Here, we will use the Cypher projection feature. Cypher projection is the more flexible way of projecting an in-memory graph but comes with a small loading performance cost. I have written an exhaustive blog about using Cypher projections, but for now, it is enough to know that we use the first Cypher statement to project nodes and the second Cypher statement to project the relationships of the in-memory graph.

CALL gds.graph.create.cypher("twitch",//node projection
"MATCH (u:User)
WHERE NOT u:Exclude
RETURN id(u) as id, labels(u) as labels,
coalesce(u.followers,0) as followers,
coalesce(u.total_view_count,0) as total_view_count",
//relationship projection
"MATCH (s:User)-->(t:Stream)
WHERE NOT s:Exclude AND NOT t:Exclude
RETURN id(t) as source, id(s) as target",
{validateRelationships:false})

In the first statement, we have projected all the User nodes that weren’t tagged with Exclude secondary label. Adding the labels of the nodes allows us to filter nodes at algorithm execution time. This will enable us to filter only Stream nodes when calculating FastRP embeddings. We have also included the follower and total_view_count node properties. In the second statement, we project all the relationships between users and streamers. Those relationships indicate which users have chatted in a specific streamer’s broadcast.

To infer the co-chatting network between streamers, we will use the Node Similarity algorithm. Node Similarity algorithm uses the Jaccard similarity score to compare how similar a pair of nodes is based on the number of shared chatters.

Diagram of Node Similarity algorithm. Image by the author.

As mentioned, the more common chatters a pair of streamers have, the more we will deem them similar. We will name the resulting relationship SHARED_AUDIENCE, which is precisely what we are doing in this case, evaluating shared audiences of streamers. The Node Similarity algorithm has two very important hyper-parameters that need to be tuned:

  • TopK: Number of stored relationships for a node. The top K relationships by similarity score will be stored.
  • SimilarityCutoff: Lower limit for the similarity score to be present in the result. Any relationships with a score lower than the similarityCutoff will be automatically ignored in results.

I always like to start by first evaluating the similarity score distribution using the stats mode of the algorithm.

CALL gds.nodeSimilarity.stats("twitch")
YIELD similarityDistribution
RETURN similarityDistribution

Results

Node similarity distribution results. Image by the author.

We get the distribution in the form of percentile values. On average, a pair of streamers share around 3% of users. Only 10% of streamers share more than 6% of users. On average, streamers don’t share a lot of their audience. This is probably skewed a bit because the data was retrieved only over 3 days, and only the users who have engaged in the chat are considered. We will leave the similarityCutoff parameter at the default value of 1E-42, which is a very tiny number, but slightly bigger than 0. Relationships will be considered between a pair of streamers when they share at least one user. Now, we have to decide on the topK value. The topK parameter heavily influences how dense or sparse will the resulting monopartite projection be. With a bit of trial and error, I have decided to use the topK value of 25.

CALL gds.nodeSimilarity.mutate("twitch", 
{topK:25, mutateProperty:'score', mutateRelationshipType:'SHARED_AUDIENCE'})

FastRP embeddings

Fast Random Projection, or FastRP for short, is a node embedding algorithm in the family of random projection algorithms. These algorithms are theoretically backed by the Johnsson-Lindenstrauss lemma according to which one can project n vectors of arbitrary dimension into O(log(n)) dimensions and still approximately preserve pairwise distances among the points. In fact, a linear projection chosen in a random way satisfies this property.

Copied from the documentation.

If you want to learn more about the underlying math used to calculate node embeddings, I suggest you look at the documentation. Node embedding algorithms calculate a fixed-size vector or embedding representation of a node in a graph. This is very useful when we want to use the network features in our downstream machine learning workflow.

We can easily retrieve FastRP embeddings of nodes by using the following Cypher query.

CALL gds.fastRP.stream(
"twitch",
{nodeLabels:['Stream'], relationshipTypes:['SHARED_AUDIENCE'],
relationshipWeightProperty:'score', embeddingDimension: 64}
) YIELD nodeId, embedding
WITH gds.util.asNode(nodeId) as node, nodeId, embedding
RETURN nodeId, embedding, node.language as language, CASE WHEN node:Train then 'train' else 'test' END as split

It is very common to visualize the resulting node embeddings with a t-SNE scatter plot. The code for the following visualization is available in the form of a Jupyter notebook.

TSNE results of embedding colored by language. Image by the author.

The data points are colored based on their language. Just by looking at the scatter plot, it is evident that the node embeddings capture the language of the stream quite well. It seems that only the English language is a bit all over the place, while other minor languages form lovely clusters.

Classification task evaluation

Finally, let’s examine how well we can predict broadcast language based on the FastRP embeddings. We have already done the train-test split in the previous step, so the only thing left to do is to export those embeddings and input them into a classification model. We will use the Random Forest classifier in this example. I prepared a helper function that will take in a cypher query and return a classification report and confusion matrix as the output.

Now we can go ahead and input the same query we used above to generate a classification report.

CALL gds.fastRP.stream(
"twitch",
{nodeLabels:['Stream'], relationshipTypes:['SHARED_AUDIENCE'],
relationshipWeightProperty:'score', embeddingDimension: 64}
) YIELD nodeId, embedding
WITH gds.util.asNode(nodeId) as node, nodeId, embedding
RETURN nodeId, embedding, node.language as language, CASE WHEN node:Train then 'train' else 'test' END as split

Results

Classification report and confusion matrix. Image by the author.

Without any fine-tuning of the FastRP or Random Forest algorithms hyper-parameters, we get an f1 score of 86%. That is very awesome. It seems like our hypothesis that chatters usually chat in streams that share the same language is valid. We can observe that the model only misclassified between English and minor languages by examining the confusion matrix. For example, the model never wrongly classified Korean as Portugal language. This makes sense as English is the language of the internet, and so everybody can speak at least their native language and a bit of English.

Now, we will try to optimize the FastRP algorithm hyper-parameters to achieve better accuracy.

Relationship weights

I have used the relationship weights in the previous query as we have them available from the Node Similarity algorithm. We can input them with the relationshipWeightProperty parameter. After playing around with the settings, I have noticed that ignoring the relationship weight property produces better results.

CALL gds.fastRP.stream(
"twitch",
{nodeLabels:['Stream'], relationshipTypes:['SHARED_AUDIENCE'],embeddingDimension: 64}
) YIELD nodeId, embedding
WITH gds.util.asNode(nodeId) as node, nodeId, embedding
RETURN nodeId, embedding, node.language as language, CASE WHEN node:Train then 'train' else 'test' END as split

Results

Classification report and confusion matrix. Image by the author.

Now, there is nothing that implies that ignoring relationship weights will always produce better results. You should test it on your dataset and compare the results to make the final decision.

Embedding dimension

The embedding dimension hyper-parameter defines the size of the output vector or embedding. I have found some general guidelines in the documentation:

The optimal embedding dimension depends on the number of nodes in the graph. Since the amount of information the embedding can encode is limited by its dimension, a larger graph will tend to require a greater embedding dimension. A typical value is a power of two in the range 128–1024. A value of at least 256 gives good results on graphs in the order of 105 nodes, but in general increasing the dimension improves results. Increasing embedding dimension will however increase memory requirements and runtime linearly.

Even though our graph is quite small with only 5000 streamers, I have decided to test results for different embedding dimension parameters. It seems that increasing the embedding dimension parameter to 512 yields better results.

CALL gds.fastRP.stream(
"twitch",
{nodeLabels:['Stream'], relationshipTypes:['SHARED_AUDIENCE'],
embeddingDimension: 512}
) YIELD nodeId, embedding
WITH gds.util.asNode(nodeId) as node, nodeId, embedding
RETURN nodeId, embedding, node.language as language, CASE WHEN node:Train then 'train' else 'test' END as split

Results

Classification report and confusion matrix. Image by the author.

Iteration Weights

The next hyper-parameter we can tune is the iteration weight. The documentation states the following:

The iteration weights parameter control two aspects: the number of iterations, and their relative impact on the final node embedding. The parameter is a list of numbers, indicating one iteration per number where the number is the weight applied to that iteration.

After playing around a bit with the iteration weights parameters, I have found that using the following value increases the accuracy even more.

CALL gds.fastRP.stream(
"twitch",
{nodeLabels:['Stream'], relationshipTypes:['SHARED_AUDIENCE'],
embeddingDimension: 512,
iterationWeights:[0.1, 0.5, 0.9, 1.0, 1.0]}
) YIELD nodeId, embedding
WITH gds.util.asNode(nodeId) as node, nodeId, embedding
RETURN nodeId, embedding, node.language as language, CASE WHEN node:Train then 'train' else 'test' END as split

Results

Classification report and confusion matrix. Image by the author.

Normalization Weight

Another parameter that could be optimized is the normalization weight. Again the documentation states:

The normalization strength is used to control how node degrees influence the embedding. Using a negative value will downplay the importance of high degree neighbors, while a positive value will instead increase their importance. The optimal normalization strength depends on the graph and on the task that the embeddings will be used for.

In our case, it doesn’t make sense to try to optimize this parameter, as all the nodes should have exactly the same degree of 25. If you remember, we have used the Node Similarity algorithm with the topK value of 25, meaning that each node will be connected to its top 25 neighbors.

Using node properties

The FastRP algorithm can be extended to also take into account node properties. I have noticed that if we add the followers count into consideration, the classification accuracy increases ever so slightly. Seems like the follower count might help to differentiate between English and other languages. My guess is that the English streamers have a higher count of followers, but that is just a hunch.

CALL gds.beta.fastRPExtended.stream(
"twitch",
{nodeLabels:['Stream'], relationshipTypes:['SHARED_AUDIENCE'],
embeddingDimension: 512, featureProperties: ['followers'],
iterationWeights:[0.1, 0.5, 0.9, 1.0, 1.0]}
) YIELD nodeId, embedding
WITH gds.util.asNode(nodeId) as node, nodeId, embedding
RETURN nodeId, embedding, node.language as language, CASE WHEN node:Train then 'train' else 'test' END as split

Results

Classification report and confusion matrix. Image by the author.

Conclusion

We have managed to get a whopping 90% f1 score using only network features and nothing else. I think that this is very awesome and lends itself to solve other tasks, whereby looking at the relationship between data points might deliver very accurate results.

As always, the code is available on GitHub.

--

--

Data explorer. Turn everything into a graph. Author of Graph algorithms for Data Science at Manning publication. http://mng.bz/GGVN