Getting Started

How to get started with the Graph Data Science Library of Neo4j

Big changes to the way graph data science is managed in Neo4j present big opportunity

CJ Sullivan
Towards Data Science
9 min readNov 2, 2020

--

Image by Ecpp, CC BY-SA 4.0 <https://creativecommons.org/licenses/by-sa/4.0>, via Wikimedia Commons. I have not altered this image.

The field of graph analytics has been around for a long time. The general idea is to create a database of things connecting to other things. Those things might be people connecting to other people on social media or maybe flights between cities or any number of other examples. Graphs are regularly used to enhance search capabilities, recommend products to shoppers on e-commerce sites, detect fraud, or to map the shortest route from point A to point B.

Neo4j has long been a key player in the world of graph databases. It historically uses the Cypher query language to interact with the database. There are many primers out there on how to analyze graphs or use Cypher. This post, instead, means to provide an introduction to how to achieve tasks in data science using the new Neo4j Graph Data Science Library (GDS)¹, which represents a significant enhancement to the original Neo4j Graph Algorithms library (now deprecated). There are some fundamental shifts from Graph Algorithms to GDS, but for those who have used the former once you get the hang of the new modalities of usage it will become second nature. For the purposes of this writeup, it is not necessary to have used either to get started on solving graph problems with data science approaches.

We will be starting with the premise that you have a running Neo4j server. If you don’t already have this, you can download the open source Community Edition here. It is important to note that in order to have access to the GDS library you will want to download a version of v4 of Neo4j. The current version is always the best and offers the latest functionality of GDS (and they are always adding great new features!).

Using .csv files for graph data

Once you have the Neo4j server up and running, you will need to have some graph data with which to populate it. While there are some sample graphs built in to Neo4j, it can be instructive to go through how to import your own data.

Graph data usually comes in the form of edge lists and node lists, which are typically separate files. Neo4j makes importing these files easy when they are in .csv format and we will discuss one of the easiest formats to import. Let’s talk about each of these files separately.

Node lists: in this .csv file we provide the information about the nodes — their labels. There can be many types of labels, or you can provide just a generic node ID. These labels will be used as the identifiers in node1 and node2 above. There can be any number of labels associated with a node. Good choices are things like a unique identifier, the name of the known (if known), and the generic type of the node (think of things like “this node is a person” or “this node is a place,” which could be used to give a node type like “person” or “place”). The format would look something like

node1, label1, label2, …

Edge lists: in this .csv file you have the information about which nodes connect to which other nodes. It also can specify the type of relationship. So the format follows something typical like

node1, node2, relationship_type, weight

where node1 is the starting node, node2 is the terminating node, relationship_type specifies the edge label (optional) and weight indicates the strength of that relationship (optional).

Note that it is generally good practice (although not required) to have a header line for each file so you can keep track of what columns are what. Once you have these files, we can get them loaded into Neo4j. For the sake of this tutorial we will be using a graph of the characters from Game of Thrones (GoT). There are several sources of graph data on GoT out there, but I particularly like the one maintained by Andrew Beveridge in this repo due to its simplicity, organization, and ease of use.

The data is conveniently broken into .csv files for both nodes and edges, one file for each season. In this tutorial I have cloned the repo onto my local machine for reading the files, but you can also choose to read them from the web.

WARNING!!! Graph analytics is a powerful tool and will reveal some serious spoilers of GoT as we get into it! So if you don’t want the series spoiled, turn back now!

Loading data into Neo4j

Loading data into the database is pretty straight forward using the built in LOAD CSV command. Because we are loading .csv files from our local machine, you will want to be sure to go into the neo4j.conf file and comment out the line dbms.directories.import=import in order to allow us to load files from anywhere on your local machine. There are many ways to load .csv data into the database, and I have picked on that I find the easiest below.

We will begin by loading in the node lists using the following command:

WITH "file:///Users/cj2001/gameofthrones/data/got-s1-nodes.csv" AS uri
LOAD CSV WITH HEADERS FROM uri AS row
MERGE (c:Character {id:row.Id})
SET c.name = row.Label

(This is the command to load in the season 1 file, and you can just repeat this command to load in whichever seasons you want. I have executed this command for each of the s1 through s8 files. Head’s up: the season 5 filename has a mild typo in it.)

Using the above command, I now have a series of nodes of type Character that have properties called id and name . Note that we have loaded this file in line-by-line with Id and Label corresponding to the column names available from the header line.

Next, we will load in the edge lists with the following command:

WITH "file:///Users/cj2001/gameofthrones/data/got-s1-edges.csv" AS uri
LOAD CSV WITH HEADERS FROM uri AS row
MATCH (source:Character {id: row.Source})
MATCH (target:Character {id: row.Target})
MERGE (source)-[:SEASON1 {weight: toInteger(row.Weight)}]-(target)

(Again, we will repeat the above for each of the edge files we would like to incorporate.)

The above command creates the relationships between the characters where the edge type is :SEASON1 . I find it convenient to give each season its own edge type to allow for exploration of changes to the graph between seasons. The edge itself is weighted by the number of times the source and target characters interact. Note that on import Neo4j considers all columns to be strings, but for our future calculations we want Neo4j to know that Weight is actually an integer, so we have to recast it as such. Also observe that we have used the format (source)-[edge_type]-(target) . Because we have no arrows in this relationship indicating direction, we have created an undirected graph. If we wanted to do otherwise, this relationship would instead look like (source)-[edge_type]->(target) .

In-memory graphs: the big deal about moving to the GDS library and a simple PageRank calculation

The creation of in-memory graphs represents a revolutionary step for Neo4j. Basically, what it allows us to do is to create different graphs or subgraphs for analysis. We can run commands on portions of the database rather than the entire database. For example, in the GoT data, we might care about calculations done on the entire graph, or we might only want to do calculations on a single season. I cannot emphasize enough how this new philosophical approach opens a lot of doors for data science and machine learning on graphs!

We will begin by creating an in-memory graph of the entire 8 seasons. There are two main ways this can be done in GDS: using a Cypher projection or a “Native” projection. I will use the former here since this is a pretty straight forward set of commands and it is pretty easy to understand what is going on. Native projections are pretty fast and powerful though, but beyond the scope of this tutorial. To create the in-memory graph with a Cypher projection, we use the command

CALL gds.graph.create.cypher(
'full_graph',
'MATCH (n) RETURN id(n) AS id',
'MATCH (n)-[e]-(m) RETURN id(n) AS source, e.weight AS weight, id(m) AS target'
)

The graph creation requires three things:

  1. A graph name (full_graph)
  2. A node query (MATCH (n) RETURN id(n) AS id)
  3. An edge query (MATCH (n)-[e]-(m) RETURN id(n) AS source, e.weight AS weight, id(m) AS target)

The graph name is pretty clear. But the node and edge queries require a little explanation. For speed reasons, Neo4j works on node IDs and not the actual node information. Typically these IDs tend to not map to anything other than a set of integers not related to any node or edge labels we have. But when you see id(n), this is how GDS is actually relating to that data. We obtain these IDs in the node query and then return them back in the edge query.

Now we have a graph that we can do some math on. One of the more basic things we could do is to calculate the PageRank of every character in the graph. We can view and use these results in one of two different ways. In order to understand the difference, we consider that the bulk of GDS functions can call either the stream or write methods. The former outputs the results of the calculation to the screen. For example, to calculate PageRank on this graph and output it to the screen, we would do

CALL gds.pageRank.stream('full_graph', {
maxIterations: 20,
dampingFactor: 0.85,
relationshipWeightProperty: 'weight'
})
YIELD nodeId, score
RETURN gds.util.asNode(nodeId).id AS id, gds.util.asNode(nodeId).name as name, score as full_pagerank
ORDER BY full_pagerank DESC

(I am just using the default values for maxIterations and dampingFactor .) Note that we are working with the node IDs from before where we convert these from their respective IDs through the gds.util.asNode(nodeId) commands to properties we recognize, like id and name .

If we were to do this on the graph of all 8 seasons, we would find some unsurprising results. The five characters with the highest PageRank across all seasons are, in order, Tyrion, Jon, Daenerys, Cersei, and Sansa. (Don’t worry…Arya is #6.)

This result is unsurprising and maybe we want to actually use it as a node property for future analyses outside of the in-memory graph. To do this, we want to use the write method, which will take the results of the calculation and write them as properties to the nodes. To do that, we do

CALL gds.pageRank.write('full_graph', {
maxIterations: 20,
dampingFactor: 0.85,
relationshipWeightProperty: 'weight',
writeProperty: 'full_pagerank'})

We can now do a simple MATCH (c:Character RETURN c.name ORDER BY c.full_pagerank DESC to see the same result as above. However, we can see that in this case the PageRank has been added as a node property called full_pagerank.

This is interesting, but if we know anything about GoT we would expect PageRank to change with each season as characters gain and lose prominence or, well, die. To explore this theory, I am going to create two additional in-memory graphs — one for season 1 and one for season 8. For season 1, this would look something like

CALL gds.graph.create.cypher(
's1_graph',
'MATCH (n) RETURN id(n) AS id',
'MATCH (n)-[e:SEASON1]-(m) RETURN id(n) AS source, e.weight AS weight, id(m) AS target'
)

and season 8 would be similar. You can check that the graphs made it in with CALL gds.graph.list() .

Now, if I run the same PageRank calculation for season 1, I get that Ned, Tyrion, Catelyn, Jon, and Daenerys are the top 5 most influential characters. Repeating this then for season 8, I get that the top 5 are Tyrion, Jon, Daenerys, Jaime, and Sansa. Again, if you know the show, none of this is really surprising.

Community detection via the Louvain method

One powerful tool included in the GDS Library is the ability to run Louvain clustering on in-memory graphs. To do this on the full graph of all 8 seasons and write the calculated community as a property to each node, we would run

CALL gds.louvain.write('full_graph', 
{relationshipWeightProperty: 'weight',
writeProperty: 'full_community_id'
})

(Note that you will not get great modularity in doing this, but tuning the parameters for Louvain is beyond the scope of this tutorial.) If I want to get the detected community for all 8 seasons for Tyrion, we would start by getting that ID value (MATCH (c:Character {name: ‘Tyrion’}) RETURN c.name, c.full_community_id which gave me community 143, but yours will likely be different) and then finding the top PageRank characters within the same community. In my case, I would do MATCH (c:Character {full_community_id: 143}) RETURN c.name, c.full_community_id ORDER BY c.full_pagerank DESC and I would get the 5 most influential characters in Tyrion’s life across all 8 seasons are Cersei, Jaime, Varys, Joffrey, and Tywin. Not surprising. I leave it as an exercise for the reader to explore other characters or look at how the detected communities change by season.

Final thoughts

I have really been impressed by the Neo4j move from basic querying to a formalized treatments of graphs that enable data science and machine learning. There is a lot of power here, and I have only scratched the surface in this tutorial. I hope to write others in the future discussing things like Native projections and the whole host of possibilities provided by the vector embeddings of node2vec and GraphSAGE that have moved out of alpha release in the more recent versions of the GDS Library.

[1] M. Needham and A. Hodler, Graph Algorithms: Practical Examples in Apache Spark and Neo4j (2020), O’Reilly Media.

--

--