Integrate Neo4j with PyTorch Geometric to create recommendations

Leverage the power of PyTorch Geometric to develop and train custom Graph Neural Networks for your application

Published in

Towards Data Science

6 min readJan 19, 2022

I have wanted to write about the PyTorch Geometric (pyG) ever since I saw they announced their collaboration with Stanford University on their workshop. The PyTorch Geometric (pyG) is a library built upon PyTorch to help you easily write and train custom Graph Neural Networks for your applications. In this blog post, I will present how you can fetch data from Neo4j to create movie recommendations in PyTorch Geometric.

The graph we will be working with is the MovieLens dataset, which is handily available as a Neo4j Sandbox project. Not knowing before, there is an example in pyG that also uses the MovieLens dataset for a link prediction task. The links between users and movies have a rating score. We can then use the graph neural network model to predict which unseen movies a user is likely to rate high and then use that information to recommend them.

The main goal of this post is to show you how to convert a Neo4j graph into a heterogeneous pyG graph. As a side goal, we will also prepare some of the node features in Neo4j using the Graph Data Science library and export them to pyG.

Agenda

Develop movie embeddings to capture movie similarity based on the actors and directors
Export Neo4j Graph and construct a heterogeneous pyG graph
Train the GNN model in pyG
Create predictions and optionally store them back to Neo4j

As always, I have prepared a Google Colab notebook if you want to follow along with the examples in this post.

Develop movie embeddings to capture movie similarity based on the actors and directors

First, you need to create and open the Recommendations project in Neo4j Sandbox that has the MovieLens dataset already populated. Next, you need to define the Neo4j connection in the notebook.

You can find the credentials under the Connection details tab in the Sandbox interface.

Sandbox connection details. Image by the author.

The Neo4j instance contains a graph with the following graph schema.

MovieLens graph schema. Image by the author.

The link prediction example in pyG uses word embeddings of the title and one-hot encoding of genres as the node features. To make it a bit more interesting, we will also develop movie node features that encapsulate the similarity of actors and directors. We could, similarly to genres, one-hot encode actors and directors. But instead, we will choose another route to capture movie similarities based on the actors that appeared in the movies. In this example, we will use the FastRP embedding models to produce node embeddings. If you want to learn more about the FastRP algorithm, I suggest you check out this excellent article by my friend CJ Sullivan.

The FastRP algorithm will only consider the bipartite network of movies and persons and ignore genres and ratings. This way, we make sure to capture the movie similarity based on only the actors and directors that appeared in the movie. First, we need to project the GDS in-memory graph.

CALL gds.graph.create('movies', ['Movie', 'Person'],
  {ACTED_IN: {orientation:'UNDIRECTED'}, 
   DIRECTED: {orientation:'UNDIRECTED'}})

Now we can go ahead and execute the FastRP algorithm on the projected graph and store the results back in the database.

CALL gds.fastRP.write('movies', 
  {writeProperty:'fastrp', embeddingDimension:56})

After executing the FastRP algorithm, each movie node has a node property fastrp that contains the embeddings that encapsulate the similarity based on the present actors and directors.

Export Neo4j Graph and construct a heterogeneous pyG graph

I’ve taken a lot of inspiration for constructing a custom heterogenous pyG graph from this example. The example creates a pyG graph from multiple CSV files. I have simply rewritten the example to fetch the data from Neo4j instead of CSV files.

The generic function to create node mapping and features while retrieving data from Neo4j is the following:

And similarly, the function to create edge index and features is:

As mentioned, the code is almost identical to the pyG example. I’ve just changed that the Pandas DataFrame is constructed from data retrieved from Neo4j instead of CSV files. In the next step, we have to define the feature encoders, but I will skip them in the post as they are identical to the example. However, I have obviously included them in the Colab notebook, so you can check that out if you are interested.

Finally, we can fetch the data from Neo4j and construct user mappings and features that will be used as input to the pyG heterogeneous graph. We will begin by constructing the user nodes input. They have no node features available, so we don’t have to include any encoders.

Next, we will construct the movie’s mapping and features.

With the movies, we have several node features. First, the Sequence encoder uses the sentence-transformers library to produce word embeddings based on the titles. The genres are one-hot-encoded using the Genre encoder, and lastly, we simply transform the FastRP embeddings into the correct structure to be able to use them in PyTorch Geometric.

Before we can construct the pyG graph, we have to fetch the information about the ratings, which are represented as weighted links between users and movies.

We have all the information we require. So now, we can go ahead and build a heterogeneous pyG graph. In a heterogeneous graph, distinct types of nodes contain different features.

The process of constructing a heterogeneous pyG graph seems easy enough. First, we define node features of each node type and then add any relationships between those nodes. Remember, the GNNs require all of the nodes to contain node features. So here is one example of what you could do for user nodes with no pre-existing features.

As with all the Machine Learning flows, we have to perform the train/test data split. The pyG library makes this very easy with the RandomLinkSplit method.

The pyG graph is prepared. Now, we can go ahead and define our GNN. I have simply copied the definition from the pyG example, so I won’t show it here as it is identical. The GNN will predict ratings between 0 and 5 of movies by users. We can think of that as a link prediction task, where we are predicting the relationship property of new links between users and movies. Once we have defined the GNN, we can go ahead and train our model.

Since the dataset is not that big, it shouldn’t take long to train the model. If you can, use the GPU mode in the Google Colab environment.

Lastly, we will predict new links between users and movies and store results back to Neo4j. We will only consider links where the predicted rating is equal to 5.0, which is the highest possible rating, for movie recommendations.

I’ve only selected the first ten recommendations for each user to make it simple and not have to import tens of thousands of relationships back to Neo4j. One important thing to note is that we don’t filter out existing links or ratings in our predictions, so we’ll just skip them during import to the Neo4j database. Let’s import those predictions to Neo4j under the RECOMMEND relationships.

If you open Neo4j Browser, you should be able to see the new RECOMMEND relationships in the database.

Predicted links between users and movies with high rating that we can use for recommendations. Image by the author.

Conclusion

PyTorch Geometric is a powerful library that allows you to develop and train custom Graph Neural Networks applications. I am looking forward to exploring it more and seeing all the applications I can create using it.

The MovieLens dataset contains several node features we haven’t used, like the release date or the movie budget that you could test out. You could also try and change the GNN definition. Let me know if you find any exciting approaches that work for you.

As always, the code is available as a Colab notebook.

Special thanks to Matthias Fey for his support and help with writing the code!