Imagine that one morning you wake up and you say to yourself:
Well, would it not be awesome if I could compute the betweenness centrality scores on that graph of 100 mio nodes we loaded into Neo4j yesterday, and that represents the email traffic in our company? Gosh, that would certainly help us to identify our key employees from an information flow perspective. Yes, let’s get to work!
The next question you are likely to ask yourself then is: "Now, where did I put my 96,000 core supercomputer?" and unless you are part of Ziyad AlGhamdi‘s research team, you will quickly remember that all you have access to is something like 16 cores.
Many organizations often compile graphs with 100 million or even several billions of nodes that represent product networks, co-purchases, email traffic, etc. and the ability to perform Centrality analysis on these graphs is essential.
However, for many centrality metrics, such as the Betweenness centrality or the Closeness centrality score, the computational complexity of the exact algorithm is prohibitively expensive. For instance, Cohen et al. illustrate in _"Computing Classic Closeness Centrality, at Scale"_, that calculating the closeness centrality on a network of 24 million nodes using standard computing resources takes an astonishing 44,222 hours, or around 5 years, to compute.
In this post, Mark Needham and I will illustrate how a custom Machine Learning model can be used to approximate betweenness centrality scores of large graphs in Neo4j.
Using Neo4j
The Neo4j Graph Data Science library has no fewer than 7 centrality scores, amongst which is the important, but expensive, Betweenness centrality. This algorithm, which applies the Brandes algorithm, can be called from the GDS library using the following command:
CALL gds.betweenness.stream(
{ nodeProjection: "Node", relationshipProjection: "EDGE" })
YIELD nodeId, score
RETURN nodeId, score
ORDER BY score DESC
Using these Neo4j centrality algorithms is, as we have illustrated in a previous article; "Fire up your Centrality Metric engines: Neo4j vs NetworkX – a drag race, of sorts…", still one of the most efficient ways of computing centrality scores, but as highlighted, when the graph size goes up, so does the runtime in an exponential fashion, hence the need for speed.
Fire up your Centrality Metric engines: Neo4j vs NetworkX – a drag race, of sorts…
Using a Machine Learning model
An alternative way of approximating centrality metrics is to use a machine learning approach, where we train a model on a small graph and then apply it to the larger graph.
The assumptions are as follows: often, many organizations will have a graph in their database which is either a sample of a much larger graph or is representative of such a larger graph.
An example of such a set-up is the peer-2-peer network; Gnutella. These datasets can be found on the Stanford SNAP repository and consist of 9 published networks ranging from 6,300 to 63,000 nodes. Again, our goal is to predict Betweenness centrality, and in this experiment, we will be using the smallest Gnutella network to train on, and once we have a trained model make predictions of Betweenness centrality scores for a much larger graph in this dataset.
The rationale is that instead of generating a set of random training graphs, this approach uses one real-world graph with the same topological properties as the test graph and uses the Leave-One-Out-Cross-Validation (LOOCV) as a resampling method for the model to learn the mapping.
Model architecture
Our first task is to ‘prepare’ the graph for processing through a Neural Network. For that, we make use of the NetworkX Python package, which is an efficient tool to construct and analyse graphs, such as extracting the adjacency matrix, etc.
Once we have the graph pre-processed we can feed it into our Neural Network.
To recap, a (artificial) Neural Network consists of artificial neurons and these neurons are connected by links, which in turn are characterised by numerical weights. Then, a Neural Network learns through repeated adjustment of these weights.
The Neural Network we deploy is a simple multilayer perceptron (MLP), but with a twist: the input variables for the model is an embedding matrix for which we have used the Struc2Vec node embedding technique, as established by Dai et al.
In our case, an embedding matrix consists of the node embeddings, as represented by vectors. Once such an embedding matrix is compiled, we run it through two fully connected layers and use Keras early stopping method to control the number of epochs. We can visualize this as follows:

All the code is in TensorFlow 2.2.3 & Keras and can be found here:
Loading the graph from Neo4j
We assume that the graph, is in Neo4j, and call on the Neo4j Python driver to fetch the graph by the following command:
from neo4j import GraphDatabase
import networkx as nx
driver = GraphDatabase.driver
("bolt://localhost:7687", auth=("neo4j", "<password>"))
query = "MATCH (n)-[r]->(c) RETURN *"
results = driver.session().run(query)
G = nx.DiGraph()
nodes = list(results.graph()._nodes.values())
for node in nodes:
G.add_node(node.id, labels=node._labels)
rels = list(results.graph()._relationships.values())
for rel in rels:
G.add_edge(rel.start_node.id, rel.end_node.id, key=rel.id,
type=rel.type)
print(nx.info(G))
As shown in the code snippet: once we have loaded the graph on our local machine we call on NetworkX to construct a graph from where we can start our work.
Firstly, for the Struc2Vec method, we produce the normalized degree centrality value, which combined with the adjacency matrix of the graph allows for each node to be represented by an embedding vector. That is, it takes the specific information of a node, its degree, as well as the embedding of the neighbors of this specific node to construct such a vector. We can visualize the workflow as follows:

In mathematical terms, this workflow formalizes to:

Where d is the degree vector and W the weight matrices. Then, the sausage that comes out at the end is an embedding matrix, H, that will provide the input to our Neural Network.
As for the Neural Network, there is nothing particularly fancy: we have a simple input layer taking in the embedding matrix of shape (6301 x 512) as well as an output layer with two fully connected layers in between, these create the weight matrices where each layer uses the Rectified Linear Unit (Relu) activation function. As it is a univariate regression model we have as the final layer a fully-connected layer with only one neuron.
Training the model
As mentioned earlier, in this experiment, we use a different training protocol.
Instead of generating random training graphs that aim to capture some of the topology of the real-world graph, we use a smaller real-world graph (which is either a sample or a reduced representation) and use the Leave-One-Out-Cross-Validation (LOOCV) method to train the model on.
The LOOCV method involves splitting the set of observations into tuples where a single observation, (x1, y1) is used for the validation set, and the remaining observations _{(x2,y2), ….,(x_n, y_n)}_ make up the training set.
Then, the model is fit on n-1 training observations, and a prediction y is made for the excluded observation using its value _x1. In our case, the x1 represents the embedding vector_ which has been obtained using the Struc2Vec method.
The benefit of this approach is that it captures the exact topology of the graph we want to make predictions on. Once we have the model trained, we can then compute the embedding matrix on any new, and much larger graph, for which we want to estimate the Betweenness centralities of the nodes.
The training of the model proved to be quite efficient where the use of the Keras Earlystopping callback halted the training at 30 epochs, taking less than 10 minutes to train (on Google Colab) for the smallest graph in the Gnutella network, with 6,301 nodes, and resulting in a low loss rate as can be seen from the image here below.

How accurate?
To measure that the predicted normalized Betweenness values (and ranking of the nodes) produce credible outcomes we use the Kendall Tau distance measure. This metric compares how similar two lists are, where a score of 1 means a perfect match between the two lists.
After 30 epoch training, we obtained a Kendall Tau score of 0.783.
This is quite a high score considering how quickly we managed to train the model and certainly compares favourably with the findings that for instance Fan et al. achieved with their Deep Ranker for Betweenness (DrBC) model where the scores for the top 10% Kendall Tau ranged from 0.64 to 0.77.
Approximating the Betweenness scores
As a next step, we can use this trained model to predict the Betweenness centrality scores for the largest graph in the Gnutella set with 62,586 nodes.
The only pre-processing we need to do is to compute the embedding matrix with Struct2Vec which will become our input variables for our Neural Network.
The workflow is only a few steps that need to be replicated from the training workflow: we load the graph into the model and again use the NetworkX package to construct a graph, upon which the Struc2Vec function produces the embedding matrix. With our training model saved, we only need to make a few TensorFlow calls to get the predictions:
#re-loading model:
path_model = '/content/drive/MyDrive/Colab_files/GN08_model.h5'
model = tf.keras.models.load_model(path_model)
x_new = mu_all #where mu_all is the embedding matrix computed by
Struc2Vec
y_pred = model.predict(x_new)
Writing the predicted Betweenness back to Neo4j
Now that we have obtained the predicted Betweenness scores, we can write them back to our much larger graph in Neo4j. This will then allow us to do various graph queries in Neo4j and for instance, help us identify those key employees. For that, we call on the Neo4j Python driver again, and using our beloved Pandas, we execute the following Python code:
from neo4j import GraphDatabase
import pandas as pd
import os
driver = GraphDatabase.driver("bolt://localhost:7687",
auth=("neo4j", "<password>"))
cwd = os.getcwd()
file_path = 'y_pred'
save_path = os.path.join(cwd, file_to_save)
path = os.path.join(cwd, file_path)
df = pd.DataFrame(y_pred)
df.index.names = ['id']
df.columns = ['Betweenness']
df['NodeId'] = df.index
df_list_nodes = df.values.tolist()
q ="""
UNWIND $nodes AS line
MATCH (f:Node {id:toInteger(line[1])})
Set f.Betweenness = line[0]
Return Count(*)"""
with driver.session() as session:
result = session.run(q, {"nodes": df_list_nodes })
Conclusion
In this post, Mark Needham and I have tried to illustrate, that when speed is of the essence, and a good approximation is sufficient, then relatively simple Machine Learning models can be deployed as ‘plug & play’ tools which through the use of the Neo4j Python drivers can expand the current Neo4j Graph Data Science functionality.