Understanding graph embeddings with Neo4j and Emblaze

Published in

Towards Data Science

9 min readMay 2, 2022

Graph embeddings can represent the rich network of relationships and properties in a graph as vectors. These embedding vectors are useful for comparing nodes, and they are also valuable inputs for machine learning algorithms. Neo4j Graph Data Science makes it possible to derive embeddings from a graph using only a few lines of Python code.

While it’s pretty simple to generate embeddings with Neo4j, it’s not always easy to tell if you have the right embedding for your application. Neo4j offers the choice of several embedding algorithms, and each algorithm has multiple tunable hyperparameters. How can you understand the differences between different embedding results and choose the best embedding for your use case? Enter Emblaze, a JupyterLab widget developed by Carnegie Mellon University Data Interaction Group that beautifully supports interactive exploration of multiple embedding options for a dataset.

Graph embedding visualization and animation with Emblaze (image by author)

In this tutorial, we will generate several different graph embeddings for a dataset involving airports and air routes. We will then use Emblaze to visualize and compare the embedding results. To complete this hands-on tutorial, you will need a free Neo4j Sandbox account and a JupyterLab environment. You can use pip to install JupyterLab, or get it as part of Anaconda. With those prerequisites in place, you can follow along with the video and tutorial below.

Choose the Graph Data Science project in Neo4j Sandbox (image by author)

The Graph Data Science sandbox comes with the Graph Data Science library installed, and a preloaded dataset of airports and air routes. When the sandbox has launched, you can expand the information about the project to reveal Connection details. Note the Password and Bolt URL for use later.

Connection details from sandbox (image by author)

Find the Airports_Emblaze.ipynb notebook on GitHub. Switch to the “Raw” view, and then save the file from your browser. The downloaded file will be called Airports_Emblaze.ipynb.txt by default, but rename it to remove the “.txt” extension. Launch JupyterLab, and then open Airports_Emblaze.ipynb.

If you haven’t installed the Emblaze and GraphDataScience Python libraries, you can use the commands in the first two cells of the notebook to install them.

In the fourth notebook cell, replace the bolt_url and password with the connection information from your Neo4j sandbox.

bolt_url = "bolt://44.198.160.170:7687"
user = "neo4j"
password = "specifications-lifeboats-jacket"
gds = GraphDataScience(bolt_url, auth=(user, password))

Notice that we’re using the GraphDataScience package, which is the new data science Python client for Neo4j. If you’re an experienced Neo4j user, and the syntax in the notebook looks a little different from what you’re used to, that is why.

The preloaded dataset for the data science sandbox contains nodes representing airports and geographic divisions. The HAS_ROUTE relationship which connects Airport nodes includes a distance property that tells how far apart the airports are. We would like to transform that property value into a weight that reflects a stronger relationship for airports that are closer together. The cypher below applies a simple linear transformation by subtracting a route’s distance from the maximum distance across all routes plus 1. That way we always have a positive value for weight.

gds.run_cypher("""
match (:Airport)-[r:HAS_ROUTE]->(:Airport)
with collect(r) as routes, max(r.distance) as maxDistance
foreach(route in routes | set route.weight = maxDistance + 1 — route.distance)
""")

Feel free to experiment with other transformations that will assign routes with small distance values high weights. For example, you could try applying a negative exponent to the distance.

Next, we’ll create an in-memory graph projection that contains the Airport nodes and the HAS_ROUTE relationships with their weight properties. We’ll treat the HAS_ROUTE relationships as undirected.

G_routes, result = gds.graph.project(
   "air-routes", 
   "Airport",                                   
   {"HAS_ROUTE":
      {"orientation":"UNDIRECTED", 
      "aggregation":"MAX"}
   }, 
   relationshipProperties = "weight")

I like to run the weakly connected components (WCC) algorithm as part of my exploratory data analysis process. This finds sets of connected nodes in the graph. We call the algorithm stats mode first to quickly look at the distribution of component sizes.

routes_wcc = gds.wcc.stats(G_routes)routes_wcc['componentDistribution']--------------------------------{'p99': 1,
 'min': 1,
 'max': 3292,
 'mean': 16.52358490566038,
 'p90': 1,
 'p50': 1,
 'p999': 3292,
 'p95': 1,
 'p75': 1}

It looks like we have one giant connected component containing 3,292 Airport nodes, and then some small components containing single Airport nodes. Perhaps those small components represent general aviation airports that are not connected to commercial aviation routes. For the purposes of our demo, we are only interested in working with airports that are part of the giant component.

We run the WCC algorithm again, but this time in mutate mode to update the in-memory graph.

gds.wcc.mutate(G_routes, mutateProperty = 'componentId')

We would also like to write the componentId property generated by WCC to the persistent graph on disk so that we can refer to it after our GDS session. We run the gds.graph.writeNodeProperties() function to do this.

gds.graph.writeNodeProperties(G_routes, ['componentId'])

We run a cypher query to get the componentId associated with the giant component.

gds.run_cypher("MATCH (a:Airport) RETURN a.componentId as componentId, count(*) as nodeCount ORDER BY count(*) DESC limit 1")componentId   nodeCount
---------------------------
0             3292

Now, we can project a subgraph that contains only the nodes with the componentId of the giant component.

G_connected_airports, result = gds.beta.graph.project.subgraph("connected-airports", G_routes, "n.componentId = 0", "*")

We write a function that will create Fast Random Projection (FastRP) embeddings for each airport node in the subgraph. I chose FastRP embeddings for this project because it runs very quickly, even in a Sandbox environment. FastRP is good at representing the local neighborhoods around nodes. The embeddings will be added to the in-memory graph projection. If we decide we are happy with one of the embeddings, we can write a property for the embedding vectors to the on-disk graph later.

For each run of the algorithm, we will generate embedding vectors of length 64, and we’ll use a random seed of 45. The embedding dimension is a tunable parameter, but we’ll hold it steady here so that we can examine the impact of other parameters. I chose an embedding dimension of 64 for this dataset because it is relatively small and simple. Feel free to experiment with longer or shorter embedding dimensions on your own. Setting the random seed keeps things consistent across repeated runs of the algorithm. The random seed value of 45 was an arbitrary choice. We will also pass in a dictionary containing specific parameters for each embedding.

def train_fast_rp(graph, config):
    result = gds.fastRP.mutate(
        graph,
        embeddingDimension = 64,
        randomSeed = 45,
        **config
    )
    return result

We will experiment with three different values for the iterationWeights property. This property contains a list of weights that are applied to each iteration of the algorithm. With each iteration, the FastRP algorithm traverses one step farther away from the nodes for which the embedding is being calculated. A short list means that the algorithm iterates few times and each node’s embedding is influenced only by its nearest neighbors. A longer list means that the algorithm iterates more times and each node’s embedding is influenced by more distant neighbors.

We’ll also compare embeddings that use the weight parameter on the relationships with those those that treat the graph as unweighted. All together, that will give us six embeddings to explore.

configs = [{"iterationWeights": [1.0, 1.0], 
            "mutateProperty": "shallowUnweighted"},
           {"iterationWeights": [0.0, 1.0, 1.0], 
            "mutateProperty": "mediumUnweighted"},
           {"iterationWeights": [1.0, 1.0, 1.0, 1.0], 
            "mutateProperty": "deepUnweighted"},
           {"iterationWeights": [1.0, 1.0], 
            "relationshipWeightProperty": "weight", 
            "mutateProperty": "shallowWeighted"},
           {"iterationWeights": [0.0, 1.0, 1.0], 
            "relationshipWeightProperty": "weight", 
            "mutateProperty": "mediumWeighted"},
           {"iterationWeights": [1.0, 1.0, 1.0, 1.0], 
            "relationshipWeightProperty": "weight", 
            "mutateProperty": "deepWeighted"}]

Now we run the train_fast_rp function for each of the six configurations.

embedding_results = [train_fast_rp(G_connected_airports, config) for config in configs]

Next, we can run a Cypher statement that will stream the embedding results from the in-memory graph. In addition to the embedding vectors, we will return some properties of the Airport nodes, and the name of the Continent that each Airport is associated with. We will sample the top 900 Airports based on the number of HAS_ROUTES relationships. Sampling will speed up Emblaze performance. It also excludes some very small airports from the result set, making the results a little easier to interpret for people who are not airport or geography buffs.

embedding_df = gds.run_cypher("""
    call gds.graph.streamNodeProperties("connected-airports", 
        ["shallowUnweighted", 
         "mediumUnweighted", 
         "deepUnweighted", 
         "shallowWeighted", 
         "mediumWeighted", 
         "deepWeighted"]) 
    yield nodeId, nodeProperty, propertyValue
    WITH gds.util.asNode(nodeId) as a,
    MAX(case when nodeProperty = "shallowUnweighted" then       
           propertyValue end) as shallowUnweighted,
    MAX(case when nodeProperty = "mediumUnweighted" then 
           propertyValue end) as mediumUnweighted,
    MAX(case when nodeProperty = "deepUnweighted" then 
           propertyValue end) as deepUnweighted,
    MAX(case when nodeProperty = "shallowWeighted" then 
           propertyValue end) as shallowWeighted,
    MAX(case when nodeProperty = "mediumWeighted" then 
           propertyValue end) as mediumWeighted,
    MAX(case when nodeProperty = "deepWeighted" then 
           propertyValue end) as deepWeighted
    MATCH (a)-[:ON_CONTINENT]->(c:Continent)
    RETURN
    a.descr as airport_name, 
    a.iata as airport_code, 
    c.name as continent,
    shallowUnweighted,
    mediumUnweighted,
    deepUnweighted,
    shallowWeighted,
    mediumWeighted,
    deepWeighted
    ORDER BY size([(a)-[:HAS_ROUTE]-() | a]) DESC
    LIMIT 900
    """)

Next, we’ll write a function that takes a column of the embedding_df Pandas data frame and turns it into an Emblaze embedding. The Emblaze embedding will contain points for each Airport. The points will be colored by the associated Continent. We will compute the ten nearest neighbors for each Airport based on cosine similarity. The nearest neighbors calculation is similar to the k-nearest neighbors algorithm in Neo4j’s Graph Data Science library. Finally, we will project each embedding from 64-dimensional space into 2-dimensional space using Emblaze’s default UMAP dimensionality reduction algorithm.

def create_emblaze_embedding(embedding_df, column):
    emb = emblaze.Embedding({
             emblaze.Field.POSITION:
             np.array(list(embedding_df[column])),
             emblaze.Field.COLOR: embedding_df['continent']}, 
             n_neighbors = 10,
             label=column, 
             metric='cosine')
    emb.compute_neighbors()
    return emb.project()

We run the create_emblaze_embedding function for the the last six columns in the embedding_df data frame.

emblaze_embeddings = [create_emblaze_embedding(embedding_df, column)    
   for column in embedding_df.columns[3:]]

Next, we turn the embeddings into an Emblaze embedding set.

variants = emblaze.EmbeddingSet(emblaze_embeddings)

We create text thumbnails that will appear as tool tips in the Emblaze embedding viewer.

thumbnails = emblaze.TextThumbnails(embedding_df['airport_name'] + 
   " (" + embedding_df['airport_code'] + ")")

Finally, we create the Emblaze viewer and display it as a JupyterLab widget.

w = emblaze.Viewer(embeddings = variants, thumbnails = thumbnails)
w

Emblaze embedding viewer (image by author)

Along the left side of the widget, I see thumbnails for the six embeddings that we created. The bars next to the thumbnails are color-coded to represent which embeddings are most similar. I see that the pairs of embeddings that share iterationWeights parameter values have more similar colors than the sets of three embeddings that share the same relationshipWeight property. This suggests that changing iterationWeights property is having a bigger impact than changing therelationshipWeight property in determining the final position of nodes in the embedding space.

You can click on one of the thumbnails to generate lines that show points that will have the biggest change in relative position between the currently displayed embedding and the thumbnail you clicked on. Click the new embedding a second time to see an animated transition from the former view to the new embedding.

Zoom in to see what European airports change most from shallowUnweighted to mediumUnweighted (image by author)

You can enter an airport code in the search box. I typed “MCI,” the code for my local airport in Kansas City. The visualization zooms to the selected airport and its nearest neighbors in the embedding space.

Search for a single airport (image by author)

Click on a different embedding thumbnail, and the “NEIGHBORS” panel shows which airports will be added or subtracted as neighbors for the Kansas City airport when you switch to the new embedding.

Comparing neighbor nodes in different embeddings for a single airport (image by author)

Let’s explore changes between the mediumWeighted and mediumUnweighted embeddings. Click a blank area of the visualization to deselect all airports. Then, double-click on the thumbnail for mediumWeighted. Single-click on the thumbnail for mediumUnweighted. Click on the “Suggested” button above the right panel. Emblaze generates a sets of nodes that have interesting changes in their sets of neighbors. Click “Load Selection” for one of the suggestions. I chose the one that starts with Melborne International Airport and includes six airports in Australia, New Zealand, and New Caledonia.

Changes from weighted to unweighted embedding for six Australian airports (image by author)

In the details panel on the right, I can see that the change from a weighted embedding to an unweighted embedding caused all six of the airports to lose Townsville Airport (TSV) as a close neighbor in the embedding space. Airports that are added as close neighbors in the embedding space such as Fiji (NAN) and Bali (DPS) are farther away in distance, but they have more two-hop connecting flights to the airports in the set we are examining than Townsville.

Exploring graph embeddings with Neo4j and Emblaze has helped strengthen my intuition about how different algorithm parameters influence embedding output. I hope you will find this combination of tools useful as you work with graph data.

Understanding graph embeddings with Neo4j and Emblaze

Written by Nathan Smith