Hands-on Tutorials

Making FastRP Graph Embeddings Work for You

How to tune the hyperparameters of FastRP to your specific problem

CJ Sullivan
Towards Data Science
12 min readNov 20, 2021

--

Photo by Brian Asare on Unsplash

Introduction

Graphs are everywhere! Individual data points can be connected to other data points in a myriad of ways. By using data on those connections, typically represented in a graph structure, we can make machine learning models that have the potential to be much more accurate than creating models out of just columnar data. But in order to incorporate graph data into a machine learning model, we must first find a way to vectorize it.

I have already written about how to get started with graph embeddings and showed a way to visualize them. My friend Tomaz Bratanic and I then wrote dueling blog posts looking at the inner workings of two of the most common graph embedding techniques (both available in the Neo4j Graph Data Science library, GDS): FastRP and node2vec. Tomaz has also written some great posts on things like using FastRP for node classification and link prediction.

However, like any vector embeddings that one might use in an ML model, there are pesky hyperparameters that must all be tuned in order to get the optimal results out of the model. In this post, I am going to walk through an example of how to tune the FastRP hyperparameters to a given node classification problem using Optuna.

Using those embeddings, we are going to compare a traditional ML approach for multi-class classification based on word embeddings to one based on graph embeddings.

So in this post we are going to do the following:

  • Build the graph: importing the CORA dataset (made available through the Apache 2.0 license) into a Neo4j Sandbox instance
  • Setting up the problem: creating the ML models for both the traditional and graph-based approaches
  • Optimizing the FastRP hyperparameters: exploring how much better we can make the graph-based model using Optuna
  • Discussion and final thoughts

In the end, I hope you will see that not only is is possible to create an ML model strictly from graph embeddings, but that sometimes you can actually get better results than a traditional ML model, even using vectors of fewer dimensions than a traditional approach!

The Google Colab notebook that accompanies this post can be found here.

Building the graph

In order to get up and running for yourself, you will need to create a free Neo4j Sandbox database. You will want to choose a Blank Sandbox for this exercise.

You should get a screen that looks like the following:

Creating a Blank Sandbox (Image by author)

You should particularly take note of the Bolt URL and the password, since these will be needed to make the connection to the database from Python. (This Sandbox will be destroyed by the time this blog post is published, so it is only used for demonstration purposes in this post.)

We will be using the CORA dataset, which is a classic graph dataset looking at co-citations between peer-reviewed scientific articles on the subject of machine learning. This is a useful dataset for this example for a few reasons. First, the dataset is already labeled. It is divided into 7 different classes based on the sub-branch of machine learning the given paper is attributed to (more on these classes shortly). Second, the maintainers have created a series of word vectors of the abstracts of these papers, which we will use to compare our graph-based ML approach to a traditional ML model based on natural language processing (NLP). These embeddings are one-hot encoded based on a dictionary of 1433 unique words. Finally, the graph itself is simple. It is a monopartite graph, meaning there is only one type of node (papers) and one time of relationship (citations). Many graph algorithms work best with monopartite graphs, although we will not be taking advantage of those algorithms today. If this were not the case, we would need to refactor our graph into a monopartite graph, which will be the subject of a future blog post.

Uploading data into the graph

I have staged the CORA dataset on GitHub as CSV files for the ease of importing. Should you wish to know more about how to import data into Neo4j, you can check out this blog post. The graph model for this dataset is shown below.

CORA graph model (Image by author)

The first thing we do is to create some uniqueness constraints on the nodes, based on their IDs, within the database before we import the data. This has the benefit of both ensuring there are no duplicates as well as creating an index for the nodes, which makes searching the data much faster. To do so, we go into the Sandbox through the browser. (See this blog post, about midway down, if you would like to learn how to access the browser.) From the browser, we will enter the following Cypher command:

Next, from the browser, we will load in the CORA node list. In this file, we have the node IDs for each paper (an integer), the class (called subjectin the dataset) each paper is assigned to, and the word vectors (called features in the data set, which is a list of one-hot encoded integers) as provided by the CORA maintainers.

Finally, we will bring in the relationships (AKA edges) of each paper. This file is a CSV of node IDs. There is a source and a target node for each row. So in that way, we know that each of these two will be connected by a relationship I have called CITES.

So now we have a fully-populated graph to work on. We should see that there are 2708 paper nodes and 5429 citation relationships between them. The graph should look something like this:

A portion of the CORA graph (Image by author)

(Note that this is a small portion of the graph, returned by the Cypher command MATCH (n) RETURN n LIMIT 300 , which prevents the screen from becoming too cluttered and the browser from getting bogged down.)

Excellent! Now we have a graph and we can move on to doing some ML on it!

Setting up the problem

Now that we have the graph, we need to create a GDS in-memory graph that we will use for calculating the FastRP embeddings. These graph projections are very useful since they put the desired nodes and relationships into a memory space that is efficiently stored in memory and only contains the necessary information used for running subsequent calculations. To do so, we will run the following in the browser:

We got lucky with using CORA, as previously stated, because the graph is already monopartite. So we can use a single node projection, Paper, and a single relationship type, CITES. Note that we are stating that this graph is undirected. This is because we are assuming that a cited article is likely to be in the same class as the one which is citing it, and vice versa. For more information on how to create in-memory graphs, check out the GDS documentation or see my quick video here.

Cool. This now gives us everything we need to create our ML models from the database!

As previously stated, we will be using a Google Colab notebook for running our models, which can be found here. Many of the packages we will be using are already pre-installed in Colab. However, we will still need to install the official Neo4j Python driver as well as Optuna (when we get there). For more information on how to use the driver, consult this video or this blog post. We see that we need to make a connection to our Sandbox instance. This information is available from the Sandbox UI as shown above.

ML model based on word embeddings

Let’s begin with the word embeddings. It should be noted in advance that I did not create these embeddings, nor do I have the opportunity to tune them. So we will have to go with what we have and consider it a baseline model.

In the notebook, there are some functions to get the data from Neo4j and convert the columns to data types that are compatible with scikit-learn. I will not go over them here.

Once we have this data, the notebook illustrates a multi-class classification model using a support vector classifier (SVC) from scikit-learn. In this example we use the default hyperparameters for the SVC and use 5-fold validation. The purpose of this blog post is not to tune this model, but rather to highlight the difference in classification accuracy based on which embeddings are used. So this model will stay consistent throughout this post. The bulk of the model is:

When this is run, mean accuracy was found to be 0.724 (with variation, of course, based on the stochastic nature of ML modeling) and the following confusion matrix was obtained:

Confusion matrix for the word embedding model (Image by author)

The classes, represented as numbers are as follows and their total number of samples (nodes) is in parentheses:

  • Neural_Networks: 0.0 (818 nodes)
  • Rule_Learning: 1.0 (180 nodes)
  • Reinforcement_Learning: 2.0 (217 nodes)
  • Probabilistic_Methods: 3.0 (426 nodes)
  • Theory: 4.0 (351 nodes)
  • Genetic_Algorithms: 5.0 (418 nodes)
  • Case_Based: 6.0 (298 nodes)

It is clear that we have an imbalanced class problem, but such is life. Given that fact and our lack of access to the original vocabulary, an accuracy of 0.726 is really not that bad.

ML model based on FastRP graph embeddings

We will begin by creating the FastRP graph embeddings using the in-memory graph we created above. In the notebook associated with this model we create a 256-dimensional graph embedding for each node. We then run the very basic SVC model as before and we obtain the following confusion matrix:

Confusion matrix for basic FastRP embeddings (Image by author)

We also obtain a mean accuracy of 0.845! This is a pretty substantial improvement over the word embedding model! There are some good reasons for this. First, NLP is notoriously noisy. The quality of an NLP model is very much a function of what you put into it, based on the garbage in-garbage out philosophy. However, without having access to the vocabulary, it is not exactly possible to look at how to improve those word embeddings. However, the connections between these individuals nodes are not noisy at all. In fact, let’s explore those embeddings a bit further via t-SNE dimensionality reduction. If we reduce these 256-dimensional vectors down to 2 dimensions, we obtain the following:

t-SNE visualization of FastRP embeddings (Image by author)

In an ideal world we would see each of the classes in their own, very distinguishable and separable clusters. While we don’t see that in all cases, what we see is somewhat explainable. Class 0.0 corresponds to neural networks, which is a pretty broad category. In fact, it is likely that the other classes will reference a paper or two here, making them much more connected that the other papers. To contrast that, genetic algorithms (the purple class 5.0) are a pretty narrow and specific topic. So it is not surprising that they are not connected to many other papers outside of that discipline. This would then potentially result in them being more clustered. Inspecting the other clusters in this plot can lead us to similar conclusions.

Optimizing the FastRP hyperparameters

We have now seen the FastRP graph embeddings at work. However, these results were obtained just using the default settings of the algorithm while specifying the number of dimensions of the embedding. In general, it is rarely a good idea to do this in the real world. Graph embeddings, like any other embedding using in ML, should be tuned through optimization of their hyperparameters. This is where Optuna comes in.

There are many different Python packages for tuning embeddings and Optuna is one of the easiest to use. The general idea is that you specify some variables to be tuned, give them a range of values to be evaluated over, and then specify the target metric (accuracy in this case) to optimize over. Optuna then goes and conducts a complicated search across that hyperparameter space for a given number of iterations until the best value of the metric is found, which was obtained by use of those optimal hyperparameters.

In our case, we are going to use the very same SVC model but allow Optuna to vary the FastRP hyperparameters until we are left with the optimal values. This is demonstrated in this notebook.

We begin by specifying the hyperparameters for the trials that Optuna will be sweeping:

These values of params will be mathematically chosen by Optuna through a complicated search method. They will be fed to the ML algorithm that is calculating the accuracy based on the trials values of params, which vary each run:

Once these things are set in place, we can begin the study:

Here we have provided some initial, arbitrarily-chosen parameters. We then tell Optuna to create the study with the target metric (accuracy) to be “maximized.” We then have Optuna run 100 trials of this study and we obtain the optimized hyperparameters back after a few minutes. In this case, we can see that our accuracy increased to 0.868! While there are still some stochastic variables involved, if you run this study several times you will see that the accuracy and optimized hyperparameter values do not vary much.

However, it is interesting to see which hyperparameters were most important in this solution:

Hyperparameter importance (Image by author)

It is clear that the embedding dimension dominates the solution. Optuna determined in this run that the optimal number of dimensions was 506. While this was at the larger end of our range, it is interesting to note how much less this is than the 1433-dimensional word vectors. Second we can see here that the fourth weight on the transition matrix is more dominant than the third (see this blog post), which confirms what the authors of the original algorithm reported.

Discussion and final thoughts

In this blog post I have shown a simple optimization of the FastRP hyperparameters for a given problem. It is important to note though that this process should absolutely be done with the specific graph for your problem. I also was very particular in my choice of this graph because it afforded us the chance to evaluate a traditional ML model versus a graph one. I will not claim that this is the right approach for every graph, or even every problem. For example, if your graph is not very densely connected, then the matrices involved in the FastRP calculation will likely be too sparse to obtain meaningful results.

One thing that I did not show in the post was anything other than the very basic FastRP algorithm. There are many other graph embedding algorithms out there, both available in Neo4j and elsewhere, that can be used. In fact, FastRP has a more sophisticated sibling called FastRPExtended, which can bring node properties into this embeddings as well! But this will be the subject of future blog posts. It would also be worthwhile to look at how a popular algorithm like node2vec would respond to this approach. Lastly, I just used the absolute most basic of classifier: SVC. It would be worthwhile to see what would happen if you chose another classifier. And, of course, you could use the embeddings in TensorFlow or PyTorch! My friend Kristof Neys wrote up a great demonstration of this using graph neural networks.

All of that being said, I hope you are able to see that there can be a tremendous benefit to using graphs to solve interesting ML problems!

For more information…

I have created a series of mini-videos (my goal is 5 minutes or less) to teach data scientists the basics of how to do data science with Neo4j: “Bite-Sized Neo4j for Data Scientists.” I try to add one video per week about very specific subjects that are geared towards new users to Neo4j.

Some other references

Many of these are linked in various places above, but here is a proper listing of all of them.

Special thanks to Jacob Sznajdman and Tomaz Bratanic who helped with the content and review of this blog post! Also, a special thanks to Alessandro Negro for his valuable insights and coding support for this post!

--

--