
Machine Learning these days is all about vectors. Performing a classification task requires that your data is well-arranged into rows (observations), each one containing the same amount of features (columns). While this is easy to obtain from data originally stored in an Excel sheet or in SQL or noSQL databases, the transformation is far from obvious when the problem involves complex objects such as texts, images or graphs.
For these objects to be represented as vectors, embedding techniques are used. Embedding algorithms assign a vector with given "small" size to each of these complex objects that would require thousands (at least) of features otherwise. The challenge of embedding is to preserve some characteristics of the object you are trying to model, while reducing the number of features. For instance, word embedding will try and capture the word meaning, such that words semantically close to each other have similar vector representations.
Graph embedding
Graph embedding covers several techniques, depending on the objects to be represented. The most common ones are node embedding, where the entities to be represented as vectors are nodes, but we can also find edge embedding or whole-graph embedding. This post will focus on the former one, node embedding.
Graph databases and Neo4j
The primary purpose of graph databases is to make relationships easier to manage, whether we are talking about a web application with complex ManyToMany relationships, or graph data science. Interest for this kind of store is almost constantly increasing, especially since the Cypher query language has been introduced into Neo4j.

Cypher makes it so easy to write intuitive queries. For instance, I am sure you will get instantaneously what this query is doing:
MATCH (:User {name: "Emil"})-[:FOLLOWS]->(u:User)
WHERE NOT u.name STARTS WITH "neo"
RETURN u.name, u.dateJoined
A few things to note from the above query:
- Nodes, delimited by parenthesis
()
, have a label, that can be identified thanks to the leading:
- Relationships, delimited by square brackets
[]
, must have a type. Conventionally, relationship types are capitalized.
Trying Neo4j
Convinced to give a try to Neo4j? You have two ways to do it:
- Using a sandbox that will run a Neo4j instance for you during a limited period (3 days, that can be extended for 7 more days): https://sandbox.neo4j.com/
- Download Neo4j Desktop and run it locally and enjoy all the features: https://neo4j.com/download/
In both cases, this is totally free. I’ll be using the later in this blog, creating a new graph using Neo4j 4.1.
Importing some data
Let’s import some data into our graph. To do so, we are going to use the "Game Of Thrones" dataset using the got
browser guide. Type:
:play got
in the browser, and follow the instructions.
Once you have imported the data, the graph schema looks like this:

It contains one node label, Character
, and five relationship types depending on which book the characters were found to interact with each other, and a global INTERACTS
relationship. We are going to use only the last one for the rest of this post. If you want to visualize some of the data, you can use:
MATCH (n)
RETURN n
LIMIT 200
Let’s go ahead and proceed to the graph analysis and node embedding.
The Graph Data Science plugin (GDS)
The GDS is the successor of the Graph Algorithm plugin whose first release date back to 2018. It’s goal is to enable the use of graph algorithms, from path finding algorithms to graph neural networks, without having to extract data from Neo4j.
Follow the steps in https://neo4j.com/docs/graph-data-science/current/installation/#_neo4j_desktop for installation.
Projected graph
The Neo4j graph usually contains a lot of data: nodes with different labels, relationships with different types and properties attached to them. Most of the time, a data science algorithm requires only a small portion of these entities: only some node labels or some relationship type and only one property (relationship weight for shortest path algorithm for instance). That’s why the GDS does not run on the full Neo4j graph, but on a projected (lighter) version. So, let’s start and build our projected graph. In the Neo4j browser, execute:
CALL gds.graph.create(
"MyGraph",
"Character",
"INTERACTS"
)
Here we create a projected graph named MyGraph
, containing all nodes with label Character
. Additionally, we add to this projected graph relationships with type INTERACTS
.
Here we are, our projected graph is created, we can go ahead and execute algorithms on it.
Executing Node2vec
The simplest way to run the node2vec algorithm on the MyGraph
projected graph is to use this simple query:
CALL gds.alpha.node2vec.stream("MyGraph")
The result in the browser looks like the following image, where a list of numbers gets assigned to each node (identified by its internal Neo4j ID, nodeId
):

If you know a bit about how node2vec works, you know there are many configuration parameters to configure how:
- the training data is build (random walks parameters): number of steps, number of generated walks per node, in-out and return factors.
- the skip-gram neural network is trained: embedding size, initial learning rate, etc.
The full list of parameters is reproduced below, from the GDS documentation page.

Let’s try for instance to reduce the embedding size:
CALL gds.alpha.node2vec.stream("MyGraph", {walksPerNode: 2, embeddingSize: 10})
Without surprise, the output now looks like:

Before dealing with the usage of these results, let’s see how to use another embedding algorithm, GraphSAGE.
Executing GraphSAGE
While Node2vec only takes into account the graph structure, GraphSAGE is able to consider node properties, if any.
In our GoT graph, nodes only have a name
property which is not that meaningful for embedding. We will then use only the node degree, or number of relationships attached to it, as property:
CALL gds.alpha.graphSage.stream("MyGraph", {degreeAsProperty: true})
The full list of parameters includes configuration of the properties (nodePropertyNames
), the aggregator function (mean
by default), the batch size… See image below for a full list.

Making use of the results with Python
Neo4j provides a python driver that can be easily installed through pip. However, in this post, I will talk about a small tool that I have developed allowing to call GDS procedures from Python without effort: pygds. It still needs to be heavily tested, so feel free to report issues if you find any.
Let’s start by installing the package in your favorite python environment:
pip install "pygds>=0.2.0"
Then, you can import the library and define the credentials to connect to your Neo4j graph:
from pygds import GDS
URI = "bolt://localhost:7687"
AUTH = ("neo4j", "<YOUR_PASSWORD>")
Usage of pygds
is as follows:
with GDS(URI, auth=AUTH) as gds:
# create the projected graph
# NB: make sure a graph with the same does not already exists
# otherwise run CALL gds.graph.drop("MyGraph")
gds.graph.create(
"MyGraph",
"Character",
"INTERACTS",
)
# run any algorithm on it
For instance, to run the node2vec algorithm, we will write:
result = gds.alpha.node2vec.stream(
"MyGraph",
{
"walksPerNode": 2,
"embeddingSize": 10
}
)
The result can then be parsed into a DataFrame
:
import pandas as pd
_tmp = pd.DataFrame.from_records(result)
df = pd.DataFrame(_tmp["embedding"].tolist())
print(df.head())
Starting from there, you can perform any Machine Learning algorithm, from PCA for visualization to classification if nodes also have some target class…
As a last step, whether you are working with Cypher or pygds, you have to drop the projected graph which is stored in the live memory of your computer:
gds.graph.drop("MyGraph")
Want to know more?
That’s great! The GDS contains much more fantastic algorithms implementations (path finding, node importance, community detection, node similarity, link prediction) and features (e.g. writing the algorithms results as node properties to persist the result in the graph instead of streaming them).
Here are a few online resources you can check if you are interested in learning more about graph algorithms and Neo4j:
- Free Book "Graph Algorithms: Practical Examples in Apache Spark and Neo4j" by Mark Needham: https://neo4j.com/graph-algorithms-book/ (caveat: this book was written using the Graph Algorithm library, predecessor of the GDS, but is still a must-read to understand graph algorithms use-cases (and a migration guide is available here together with updated examples using the GDS); caveat 2: it may not be free indefinitely and the end of its "free-period" is periodically announced by Neo4j and so far has always been extended but who knows)
-
Check the GraphAcademy created by Neo4j and its data science course: https://neo4j.com/graphacademy/online-training/data-science/part-0/
- You can follow Tomaz Bratanic on medium and his blog to see many interesting use cases for all the algorithms of the GDS: https://tbgraph.wordpress.com/
- And of course, read the doc: https://neo4j.com/docs/graph-data-science/current/
- For more information about the algorithms discussed here, see for instance A Gentle Introduction to Graph Neural Networks (Basis, DeepWalk and GraphSage) by Kung-Hsiang, Huang (Steeve):