A Deep Dive into Neo4j Link Prediction Pipeline and FastRP Embedding Algorithm

Learn how to train and optimize Link Prediction models in the Neo4j Graph Data Science library to get the best results

Tomaz Bratanic
Towards Data Science

--

In my previous blog post, I introduced the newly available Link Prediction pipeline in the Neo4j Graph Data Science library. Since the post, I took more time to dig deeper and learn the inner workings of the pipeline. I’ve learned a couple of things along the way that I want to share with you. At first, I intended to show how the Link Prediction pipeline combines node properties to generate input features of the Link Prediction model. However, when I was developing the content, I noticed a couple of insights about using the FastRP embedding algorithm. Therefore, by the end of this blog post, you will hopefully learn more about the FastRP embedding model and how you can combine multiple node features as an input to the Link Prediction model.

The code used in this post is available on GitHub.

Graph import

I had to find a small network so I easily visualize results as we go along. I decided to use the interaction network from the first season of the Game of Thrones TV show made available by Andrew Beveridge.

Graph model. Image by the author.

The graph model consists of characters and their interactions. We will treat the interaction relationship as undirected, where is character A interacts with character B, this directly implies that character B also interacted with character A. We also know how many times two characters interacted, and we store that information as the relationship property.

If you want to follow along with examples in this post, I recommend using a Blank project in Neo4j Sandbox. It is a free cloud instance of Neo4j database that comes pre-installed with both APOC and Graph Data Science plugins.

The dataset is available on GitHub, so we can easily import it into Neo4j with the following Cypher query:

LOAD CSV WITH HEADERS FROM "https://raw.githubusercontent.com/mathbeveridge/gameofthrones/master/data/got-s1-edges.csv" as row
MERGE (s:Character{name:row.Source})
MERGE (t:Character{name:row.Target})
MERGE (s)-[i:INTERACTS]-(t)
SET i.weight = toInteger(row.Weight)

Link prediction pipeline

Under the hood, the link prediction model in Neo4j uses a logistic regression classifier. We are dealing with a binary classification problem, where we want to predict if a link exists between a pair of nodes or not. On a high level, the link prediction pipeline follows the following steps:

Image by the author.

In this post, we will focus on the first two steps, so let’s take a closer look at what is happening there.

Feature engineering for link prediction pipeline. Image by the author.

As a first step, you have to define node features. For example, you could use custom node properties such as age or gender. You can also use graph algorithms such as PageRank or Betweenness centrality as initial node features. In this blog post, we will start by using FastRP node embeddings to define initial node features. The nice thing about the FastRP embedding algorithm is that it captures the network information and preserves the similarity in embedding space between neighboring nodes that are close in a graph. At the moment, you can’t use pairwise information such as the number of common neighbors or the length of the shortest path between a pair of nodes as input features.

In the second step, the link feature combiner creates a single feature from a pair of node properties. Currently, there are three techniques that you can use to combine a pair of node properties into a single link feature vector:

  • Cosine distance
  • L2 or Euclidian distance
  • Hadamard product

In the above example, I have used the Hadamard product to combine a pair of node properties into a single link feature vector. All the available link feature combiner techniques are order-invariant as the Link Prediction pipeline supports predicting only undirected relationships at the moment. You can use multiple link feature combiners in a single pipeline to define several feature vectors, which are then concatenated as an input to the Link Prediction model. I’ll walk you through an example later in the post.

Once the node features and link feature combiner are defined, you can train the model to predict new connections.

Node Feature engineering

You can preprocess node features before defining the Link Prediction pipeline. You can also include them directly in the pipeline definition if you only use graph algorithms such as node embeddings or centrality measures as node features. In the first example, we will use the FastRP embeddings as our Link Prediction model node features. Therefore, we could potentially include them in the pipeline definition. However, we will first do a short analysis of the node embedding results, so we need to store the node embeddings to the graph before diving into the pipeline definition. We start by projecting an undirected named graph. Take a look at the documentation for more information about the inner workings of the Graph Data Science library.

CALL gds.graph.create('gots1', 'Character', 
{INTERACTS:{orientation:'UNDIRECTED', properties:'weight'}})

We will use Louvain, a community detection algorithm, to help us better understand the results of the FastRP embedding algorithm. You can use the following Cypher query to store the community structure information back to the database.

CALL gds.louvain.write('gots1', 
{writeProperty:'louvain', relationshipWeightProperty:'weight'})

Throughout this blog post, I will be using Neo4j Bloom to visualize the results of algorithms and link predictions. Take a look at this guide if you want to learn how to visualize networks with Bloom.

Community structure of the Network. Nodes are colored based on the community they belong to. Image by the author.

Now we can go ahead and execute the FastRP embedding algorithm. The algorithm will produce an embedding or a fixed-size vector for every node in the graph. My friend CJ Sullivan wrote an excellent article explaining the inner working of the FastRP algorithm.

CALL gds.fastRP.write('gots1', 
{writeProperty:'embedding', embeddingDimension:56, relationshipWeightProperty:'weight'})

First, we will evaluate FastRP embeddings with a t-SNE scatter plot visualization. The stored node embeddings are vectors with a length of 56, as defined by the embeddingDimension parameter. The t-SNE algorithm is a dimensionality reduction algorithm, which we can use to reduce the embedding dimension to two. Having vectors with length two allows us to visualize them with a scatter plot. The Python code I used for dimensionality reduction and scatter plot visualization is:

This code produces the following visualization:

t-SNE scatter plot visualization of FastRP embedding. Nodes are colored based on the community the belong to using the Louvain algorithm. Image by the author.

FastRP embeddings and the Louvain algorithm were executed independently, and yet, we can observe that FastRP embeddings cluster nodes in the same community close in the embedding space. This is no surprise as FastRP is a community-based node embedding algorithm, meaning that nodes close in the graph will be also close in the embedding space. Next, we will evaluate the cosine similarity between nodes in the graph.

MATCH (c:Character)
WITH {item:id(c), weights: c.embedding} AS userData
WITH collect(userData) AS data
CALL gds.alpha.similarity.cosine.stats({
data: data,
topK: 1000,
similarityCutoff: 0.1
})
YIELD nodes, similarityPairs, min, max, mean, p25, p50, p75, p90, p95, p99
RETURN nodes, similarityPairs, min, max, mean, p25, p50, p75, p90, p95, p99

Results

An average cosine similarity coefficient between all nodes in the graph is around 0.5. Around 25% of the node pairs have a cosine similarity greater than 0.81. Nodes are so similar in the embedding space because we have a tiny graph of only 126 nodes. Next, we will evaluate the cosine and euclidian distance between pairs of nodes connected by a relationship.

Results

Most node pairs that are connected with a relationship have a high cosine similarity. Again, this is expected as the FastRP is designed to translate the network topology structure into embedding space. Therefore, we expect the neighboring nodes in the graph to be very similar in the embedding space. We will examine the node pairs connected in the network with a cosine similarity of less than 0.5 as this is a bit more unexpected. First, we have to tag them with Cypher:

MATCH p=(c1:Character)-[i:INTERACTS]->(c2:Character)
WHERE gds.alpha.similarity.cosine(c1.embedding, c2.embedding) < 0.5
SET i.show = True

And now, we can go ahead and visualize them with Neo4j Bloom.

Relationships between pairs of nodes that have cosine similarity of less than 0.5. Image by the author.

It seems that pairs of connected nodes with a lower cosine similarity mainly occur when we have connections between various clusters or communities in the network. If you remember the t-SNE visualization, the nodes in the same community are nicely grouped in the embedding space. However, we have a couple of relationships between nodes from different communities. When we have connections between nodes from various communities, their similarity decreases. It also seems that these nodes have a higher degree, meaning they have many links within their community and then a couple of links to other communities. Therefore, they are more similar to neighbors within their community and then less similar to neighbors from other clusters.

Link Feature Combiner

We’ve got the embeddings ready, and we know that pairs of connected nodes are highly likely to have a high cosine similarity in the embedding space. Now, we will evaluate how different link feature combiners affect the output of the link prediction model.

Cosine combiner

Interestingly enough, the first combiner we will take a look at is the Cosine similarity combiner.

Cosine link feature combiner. Image by the author.

The Link Feature combiner takes pairs of node features and combines them into a single link feature, which is then used as training data to the logistic regression model that will predict new links. We have already done the cosine similarity analysis, so we know that node pairs with a high cosine similarity are likely to be connected. Therefore, you might imagine that new predicted links will be between pairs of not yet connected nodes with high cosine similarity, as this is precisely how our training data looks.

The Python script we will be using to produce link predictions is:

It is a bit lengthy script as we need to define the entire Link Prediction pipeline for each link feature combiner option. Finally, the predicted links are stored to Neo4j so that we can visualize them with Bloom. In my previous blog post, I did a step-by-step explanation of the pipeline. As mentioned, I have prepared a Jupyter Notebook with all the code, so you don’t have to copy it from the blog post directly.

Let’s examine the predicted links using the Cosine Link Feature combiner.

Top 20 predicted links using the Cosine Link Feature combiner. Image by the author.

The average Euclidian and Cosine similarity of node pairs that have predicted links is:

As we might imagine, the cosine similarity between nodes of predicted links is on average 0.999. Thus, the results agree with our training data. Furthermore, we can learn a bit about FastRP embeddings from the results. Peripheral nodes with a low degree are very similar in the embedding space, even if they are not directly connected. For example, the blue nodes in the bottom-left part of the visualizations are all very similar. All four nodes have only a single relationship, and they share their only neighbor nodes.

L2 link feature combiner

Next, we will take a look at the L2 link feature combiner. L2 link feature combiner calculates the Euclidian distance between two node features.

Top 20 predicted links using the L2 Link Feature combiner. Image by the author.

The results are almost identical to the Cosine Link Feature combiner. Seems that the FastRP embedding algorithm optimizes both Cosine and Euclidian distance similarities between neighboring nodes in the network.

Hadamard link feature combiner

The Hadamard link feature combiner uses the Hadamard product to produce a link feature. The Hadamard product is simply entrywise multiplication.

Hadamard link feature combiner. Image by the author.

Let’s examine the predicted links using the Hadamard link feature combiner.

Top 20 predicted links using the Hadamard Link Feature combiner. Image by the author.

The average Euclidian and Cosine similarity of node pairs that have predicted links is:

Some of the predicted links are similar to the Cosine and Euclidian link feature combiner results. For example, the predicted links in the red community are almost identical. On the other hand, the model predicted more links in the center instead of the periphery of the blue cluster.

Using multiple Link Feature Combiners

In the previous example, we have only used a single link feature combiner. In the case of Cosine or L2 link feature combiner, we have effectively only used a single input feature to the logistic regression model. In practice, it makes sense to have multiple input features that best describe your domain. As a demonstration, we will add the Preferential attachment input as the second link feature. Looking at the documentation in Neo4j, the Preferential attachment is defined as multiplying node degrees between a pair of nodes. In practice, the preferential attachment model assumes that nodes with a higher node degree are more likely to form new connections. Unfortunately, we can’t automatically add the preferential attachment link feature just yet, but we can add it manually. To add the preferential attachment input feature, we will first calculate the node degree values for all nodes. You sometimes want to normalize the input features with logistic regression models, so I will show you how to scale features directly in the Link Prediction pipeline. Then we just need to add the Hadamard link feature combiner, which multiplies the input matrices, in this case, node degrees. So if I understand the math correctly, the resulting link feature should represent preferential attachment as we effectively multiply node degrees between pairs of nodes.

Using multiple link feature combiners in Link Prediction pipeline. Image by the author.

In the Link Prediction pipeline, you can have as many link feature combiners as you wish. The results of all link feature combiners are then concatenated into a single vector that is used as an input to the Link Prediction logistic regression model. We can add the degree calculation and scaling directly into the pipeline and don’t have to prepare the degree features beforehand.

We need to add the following three steps into our pipeline. The first query mutates the degree value to the named graph. The second step will scale the degree value using the MinMax scaler and mutate it as the scaledDegree property. Lastly, we use the Hadamard link feature combiner that combines node degrees. In the end, we’ll quickly evaluate how adding a secondary link feature affects the link prediction results.

We get the following results using the Cosine link feature combiner with FastRP embeddings and Hadamard combiner with scaled node degrees.

Image by the author.

Adding the secondary link feature shifted predicted links from the network periphery to the center. In some way, it makes sense because the preferential attachment prefers links between nodes with a higher degree, and those nodes are usually located in the center of the network.

We get the following results using the Euclidian link feature combiner with FastRP embeddings and Hadamard combiner with scaled node degrees.

The secondary link feature didn’t really affect the results. I’ve also tried not scaling the node degree values and then the predicted links shift strongly towards the center.

Using Hadamard link feature combiner without scaling node degrees. Image by the author.

If we don’t scale the node degrees before using the Hadamard combiner, the preferential attachment feature becomes dominant. Therefore, the predicted links are between nodes with a high degree instead of the low degree nodes on the periphery.

Conclusion

I have really enjoyed writing this blog post and learned a lot about FastRP embedding algorithm and Link Prediction pipeline along the way. A quick summary would be:

  • FastRP is more likely to assign high similarity between neighboring nodes with a low degree
  • On the other hand, the cosine similarity between connected nodes of different communities could be lower than 0.5
  • Using multiple link feature combiners can help you better describe your domain
  • Scaling node features influences the result of Link Prediction logistic regression model

I encourage you to start a free Neo4j Sandbox project and start predicting new connections in your graphs. Let me know how it goes!

As always, the code is available on GitHub.

--

--

Data explorer. Turn everything into a graph. Author of Graph algorithms for Data Science at Manning publication. http://mng.bz/GGVN