Thoughts and Theory

Behind the scenes on the Fast Random Projection algorithm for generating graph embeddings

A detailed look into FastRP and its hyperparameters

CJ Sullivan
Towards Data Science
10 min readAug 19, 2021

--

Photo by Crissy Jarvis on Unsplash

The vast majority of data science and machine learning models rely on creating a vector, or embedding, of your data. Some of these embeddings naturally create themselves. For example, for numerical data organized in columns we can think of the values associated with each row as a single vector. In more complicated cases such as natural language processing we have to generate those embeddings from the words through a variety of different approaches like one-hot encoding, skip-gram methods such as word2vec, etc. These vectors are then used as the representation of the data that is to be modeled.

It is easy to understand embeddings as being independent entities where one row’s worth of data does not impact the next row. However, the same cannot be said about network graphs. Data points in this domain ARE connected and the relationships among them are as important as the individual data points. So it is therefore necessary to have a way to turn the nodes of graph into an embedding while also preserving the relationship (i.e. the context) of those nodes to their neighbors.

I initially posted a blog post on how to get started with graph embeddings. This post can be found here.

There is also a great post by Tomaz Bratanic showing the use of FastRP for a node classification task, which can be found here.

This particular post builds off the above as well as my previous post where I show the creation of a dashboard using Streamlit and how to tinker with the Neo4j-generated graph embedding hyperparameters. This was created both for the FastRP and node2vec graph embedding algorithms. I provided a means of visualizing the embeddings in two dimensions via t-SNE.

In this post we will go into the why’s and how’s of one of those embedding algorithms, FastRP, with the goal of another post coming up regarding how to optimize those hyperparameters.

Also, this post is the parallel post to that of Tomaz Bratanic, “Complete guide to understanding Node2Vec algorithm.”

Warning!

There is math coming up! If you have never tried to properly render math on Medium, it is not pretty! I will do my best…

Photo by ThisisEngineering RAEng on Unsplash

The origin of FastRP: the Johnson-Lindenstrauss lemma

The FastRP algorithm was created by H. Chen et. al, which you can read about in detail in the original paper. I really do recommend that article since I am going to be drawing heavily off it. Because the math behind it is the basis for the Neo4j FastRP algorithm (even including variable names), we need to take a little time talking about that math. Don’t worry…I hope to not make it too bad and I have every confidence in you, dear reader, for getting through it!

The Johnson-Lindenstrass lemma, also referred to as the JL lemma, is the mathematical core of the FastRP algorithm. In a nutshell, it says that if points in a vector space are of sufficiently high dimension, then they may be projected into a suitable lower-dimensional space in a way which approximately preserves the distances between the points. And if you think about it, this really is at the root of most dimensionality reduction approaches.

Basics of dimensionality reduction (Image by author)

So yes, we are trying to go from p-dimensional space to d-dimensional space where d < p. Given these two points xᵢ and xⱼ, we want to create some mapping, Ψ, which converts from the high dimensional space to the lower one while ideally approximately maintaining the distance between those two points. What this essentially means is that we are going to look to solve the following equation:

Dimension reduction solutions (Image by author)

where our goal is to keep ϵ as small as possible. The fact that this equation works is the JL lemma.

The real trick here is understanding what Ψ should be. D. Achlioptas discovered (and described in this paper) that you could obtain database-friendly random projections (actually, that is the title of the paper) by using a simple binary coin toss, a random number. H. Chen et. al used this idea to implement a random projection matrix, R, which will be the main point of our mapping eventually given by

Random projection matrix (Image by author)

In fact, Li et al. reported that so long as R has zero mean, then the pairwise distances between points are preserved — the goal of dimensionality reduction. Further, Li showed that s could be chosen to be as high as the square root of the number of edges in the graph, s = sqrt(m), assuming the downstream matrices (see below) could be considered very sparse. Interestingly enough, using this random projection matrix can actually be faster than a Gaussian random projection by sqrt(m). However, depending on the sparsity of the matrix, choosing s = 3 was found to be adequate in the original Achlioptas paper.

So what does that really have to do with graphs???

I am glad you asked! Yeah, so we have this matrix that can randomize stuff. Obviously we are going to want to apply it to some other matrix. But what matrix do we have that is at all related to graphs?

The adjacency matrix is a great start here! Actually, we are going to spiff it up a bit by creating the transition matrix, A, as A = D⁻¹S where S is the adjacency matrix and D is the graph’s degree matrix (a diagonal matrix giving the degree of each node). Let’s visualize those quickly since it will make the dimensions of the subsequent matrices more clear. Suppose we have a weighted, directed graph that looks like the following

Sample weighted, directed graph (Image by Tomaz Bratanic, used with permission)

(Note that we can also do this with unweighted and/or undirected graphs, but I figured I would show the most complicated example.) So with the above graph, we would have the adjacency matrix as

Adjacency matrix for weighted, directed graph (Image by author)

We can also establish the degree matrix:

Degree matrix for above weighted, directed graph (Image by author)

So we can then construct the transition matrix via

Transition matrix for above weighted, directed graph (Image by author)

Note that each of the above matrices is a square, n x n matrix where n is the number of nodes in the graph.

Now we are starting to get into the notation that is actually used by the Neo4j implementation of the FastRP algorithm! Here is where we are going to start getting into those hyperparameters.

Like the creation of any embedding, before we do the fun stuff we want to think about normalizing our data. Remember, some of our nodes might have a very high degree relative to each other. So we are going to create a normalization matrix (diagonal in n) given by

Normalization matrix (Image by author)

where β is the normalization strength and m is the number of edges and dⱼ is the degree of the j-th node. So when β goes to infinity, we can get

Let’s think about what this actually means for a second, particularly when we apply this normalization matrix to the transition matrix. Let’s say we have the k-th power of A (i.e. we multiple A by itself k times). The ij-th entry of Aᵏ is the probability of reaching j from i in exactly k steps of a random walk!

Yes, that is bold faced and italicized because it is really important. FastRP is just a random walk controlled by this coin-flipped randomizer!

Photo by Chris Briggs on Unsplash

So now that we know that…

Let’s put A in a different form now, which is more computationally efficient, and then we are ready to write out what the actual FastRP algorithm.

Regenerating Aᵏ (Image by author)

And now it is time to begin FastRP!

Given our graph transition matrix A, embedding dimensionality d, normalization strength β, and some iteration weights α₁ … αₖ (more on those in a second) the steps are as follows:

  1. Produce the Rᵢⱼ random projection matrix
  2. Calculate the first iteration of matrix embeddings as N₁ = A ⋅ L ⋅ R
  3. Compute subsequent values of N for i = 2 .. number of nodes as Nᵢ ← A ⋅ Nᵢ₋₁
  4. Calculate weighted sum of all values of N

The final result here is that we have matrix multiplication of (n × n) ⋅ (n × n) ⋅ (n × d) which results in N having dimension (n × d), which is exactly what we want (one d-dimensional embedding for each n nodes).

If we consider this in the context of the LR lemma, recall that Ψ was the thing that took us from ℝᵖ to ℝᵈ. So if we consider that N = A ⋅ L ⋅ R, this then implies that Ψ = L ⋅ R.

Weights, weights, and more weights

Yes, we have another set of weights in our α values above. These are the iteration weights calculated for each Nᵢ and they need their own explanation.

It is clear that these weights are going to control the relative impact on the final node embedding of each step through the above. With 1 iteration, we are only going to get embeddings based on the first neighbor of each node, which is not very useful. So, for example, suppose we have 4 iterations (i.e. k = 4). What does that really mean? We said before (that bold and italicized bit above) that the ij-th entry of Aᵏ denotes the probability of reaching i from j in exactly 4 steps. If the probability is zero until 5 steps, then Aᵢⱼ is zero. In other words, with k = 4, we cannot randomly hop to a node that is, say, 5 or more hops away in the graph. So k is telling us how much of the local neighborhood around a node we are including in our embeddings. We can then weight each of the random projection steps above as

Final computation of FastRP embeddings (Image by author)

In reality, embeddings will not be as good if we only look at the immediate neighbors of any given node. So it might even be worth it to just ignore those small values of k. In fact, this is what the authors originally did. They observed that the embeddings were fine with k = 4 and only using and A⁴. So based on that, they set the weights (α₁, α₂, α₃) = (0, 0, 1). In the original paper they then optimized their embeddings by parametrically tuning α₄ and β given a desired value of d.

Photo by Tony Tran on Unsplash

Wow! That is a lot! Is it really worth it???

TL;DR yes!

This algorithm is both fast and simple. For example, the authors reported that using the WWW-200k dataset, their CPU time for FastRP to do its embeddings was 136.0 seconds whereas node2vec took 63.8 days! When compared to both node2vec and another popular embedding algorithm, DeepWalk, they observed embeddings that produced node classification results that were at least as accurate, if not more so! There are also plenty of other graph embedding algorithms out there, but one of the great things about FastRP is how simple it is. It is also easily extensible, allowing you to incorporate additional information of the graph such as the properties of the individual nodes, to further enhance your embeddings.

Like any graph embedding algorithm, the so-called “devil” is in the details. How you set each of the hyperparameters will depend on your graph itself and requires a fair bit of experimentation. That will be the subject of a future blog post. Stay tuned!

Special thanks to Tomaz Bratanic and Michael Hunger for their suggestions and reviews of this post.

References

--

--