The world’s leading publication for data science, AI, and ML professionals.

Computing Node Embedding with a Graph Database: Neo4j & its Graph Data Science Library

Using the new version of the Neo4j Graph Data Science plugin to compute node embedding and extract them into a pandas DataFrame thanks to…

Image by Gerd Altmann
Image by Gerd Altmann

Machine Learning these days is all about vectors. Performing a classification task requires that your data is well-arranged into rows (observations), each one containing the same amount of features (columns). While this is easy to obtain from data originally stored in an Excel sheet or in SQL or noSQL databases, the transformation is far from obvious when the problem involves complex objects such as texts, images or graphs.

For these objects to be represented as vectors, embedding techniques are used. Embedding algorithms assign a vector with given "small" size to each of these complex objects that would require thousands (at least) of features otherwise. The challenge of embedding is to preserve some characteristics of the object you are trying to model, while reducing the number of features. For instance, word embedding will try and capture the word meaning, such that words semantically close to each other have similar vector representations.

Graph embedding

Graph embedding covers several techniques, depending on the objects to be represented. The most common ones are node embedding, where the entities to be represented as vectors are nodes, but we can also find edge embedding or whole-graph embedding. This post will focus on the former one, node embedding.

Graph databases and Neo4j

The primary purpose of graph databases is to make relationships easier to manage, whether we are talking about a web application with complex ManyToMany relationships, or graph data science. Interest for this kind of store is almost constantly increasing, especially since the Cypher query language has been introduced into Neo4j.

Source: https://db-engines.com/de/blog_post/53
Source: https://db-engines.com/de/blog_post/53

Cypher makes it so easy to write intuitive queries. For instance, I am sure you will get instantaneously what this query is doing:

MATCH (:User {name: "Emil"})-[:FOLLOWS]->(u:User)
WHERE NOT u.name STARTS WITH "neo"
RETURN u.name, u.dateJoined

A few things to note from the above query:

  • Nodes, delimited by parenthesis (), have a label, that can be identified thanks to the leading :
  • Relationships, delimited by square brackets [], must have a type. Conventionally, relationship types are capitalized.

Trying Neo4j

Convinced to give a try to Neo4j? You have two ways to do it:

In both cases, this is totally free. I’ll be using the later in this blog, creating a new graph using Neo4j 4.1.

Importing some data

Let’s import some data into our graph. To do so, we are going to use the "Game Of Thrones" dataset using the got browser guide. Type:

:play got

in the browser, and follow the instructions.

Once you have imported the data, the graph schema looks like this:

The "Game Of Thrones" graph schema. (CALL db.schema.visualization)
The "Game Of Thrones" graph schema. (CALL db.schema.visualization)

It contains one node label, Character , and five relationship types depending on which book the characters were found to interact with each other, and a global INTERACTS relationship. We are going to use only the last one for the rest of this post. If you want to visualize some of the data, you can use:

MATCH (n)
RETURN n
LIMIT 200

Let’s go ahead and proceed to the graph analysis and node embedding.

The Graph Data Science plugin (GDS)

The GDS is the successor of the Graph Algorithm plugin whose first release date back to 2018. It’s goal is to enable the use of graph algorithms, from path finding algorithms to graph neural networks, without having to extract data from Neo4j.

Follow the steps in https://neo4j.com/docs/graph-data-science/current/installation/#_neo4j_desktop for installation.

Projected graph

The Neo4j graph usually contains a lot of data: nodes with different labels, relationships with different types and properties attached to them. Most of the time, a data science algorithm requires only a small portion of these entities: only some node labels or some relationship type and only one property (relationship weight for shortest path algorithm for instance). That’s why the GDS does not run on the full Neo4j graph, but on a projected (lighter) version. So, let’s start and build our projected graph. In the Neo4j browser, execute:

CALL gds.graph.create(
    "MyGraph", 
    "Character", 
    "INTERACTS"
)

Here we create a projected graph named MyGraph, containing all nodes with label Character. Additionally, we add to this projected graph relationships with type INTERACTS.

Here we are, our projected graph is created, we can go ahead and execute algorithms on it.

Executing Node2vec

The simplest way to run the node2vec algorithm on the MyGraph projected graph is to use this simple query:

CALL gds.alpha.node2vec.stream("MyGraph")

The result in the browser looks like the following image, where a list of numbers gets assigned to each node (identified by its internal Neo4j ID, nodeId):

Output of the node2vec procedure with default parameters.
Output of the node2vec procedure with default parameters.

If you know a bit about how node2vec works, you know there are many configuration parameters to configure how:

  • the training data is build (random walks parameters): number of steps, number of generated walks per node, in-out and return factors.
  • the skip-gram neural network is trained: embedding size, initial learning rate, etc.

The full list of parameters is reproduced below, from the GDS documentation page.

Node2vec parameters from the GDS documentation https://neo4j.com/docs/graph-data-science/current/algorithms/node-embeddings/node2vec/
Node2vec parameters from the GDS documentation https://neo4j.com/docs/graph-data-science/current/algorithms/node-embeddings/node2vec/

Let’s try for instance to reduce the embedding size:

CALL gds.alpha.node2vec.stream("MyGraph", {walksPerNode: 2, embeddingSize: 10})

Without surprise, the output now looks like:

Before dealing with the usage of these results, let’s see how to use another embedding algorithm, GraphSAGE.

Executing GraphSAGE

While Node2vec only takes into account the graph structure, GraphSAGE is able to consider node properties, if any.

In our GoT graph, nodes only have a name property which is not that meaningful for embedding. We will then use only the node degree, or number of relationships attached to it, as property:

CALL gds.alpha.graphSage.stream("MyGraph", {degreeAsProperty: true})

The full list of parameters includes configuration of the properties (nodePropertyNames), the aggregator function (mean by default), the batch size… See image below for a full list.

GraphSAGE parameters from the GDS documentation: https://neo4j.com/docs/graph-data-science/current/algorithms/alpha/graph-sage/
GraphSAGE parameters from the GDS documentation: https://neo4j.com/docs/graph-data-science/current/algorithms/alpha/graph-sage/

Making use of the results with Python

Neo4j provides a python driver that can be easily installed through pip. However, in this post, I will talk about a small tool that I have developed allowing to call GDS procedures from Python without effort: pygds. It still needs to be heavily tested, so feel free to report issues if you find any.

Let’s start by installing the package in your favorite python environment:

pip install "pygds>=0.2.0"

Then, you can import the library and define the credentials to connect to your Neo4j graph:

from pygds import GDS

URI = "bolt://localhost:7687"
AUTH = ("neo4j", "<YOUR_PASSWORD>")

Usage of pygds is as follows:

with GDS(URI, auth=AUTH) as gds:
    # create the projected graph
    # NB: make sure a graph with the same does not already exists
    # otherwise run CALL gds.graph.drop("MyGraph") 
    gds.graph.create(
        "MyGraph", 
        "Character", 
        "INTERACTS",
    )
    # run any algorithm on it

For instance, to run the node2vec algorithm, we will write:

result = gds.alpha.node2vec.stream(
    "MyGraph", 
    {
        "walksPerNode": 2, 
        "embeddingSize": 10
    }
)

The result can then be parsed into a DataFrame:

import pandas as pd
_tmp = pd.DataFrame.from_records(result)
df = pd.DataFrame(_tmp["embedding"].tolist())
print(df.head())

Starting from there, you can perform any Machine Learning algorithm, from PCA for visualization to classification if nodes also have some target class…

As a last step, whether you are working with Cypher or pygds, you have to drop the projected graph which is stored in the live memory of your computer: gds.graph.drop("MyGraph")

Want to know more?

That’s great! The GDS contains much more fantastic algorithms implementations (path finding, node importance, community detection, node similarity, link prediction) and features (e.g. writing the algorithms results as node properties to persist the result in the graph instead of streaming them).

Here are a few online resources you can check if you are interested in learning more about graph algorithms and Neo4j:


Related Articles