Exploring Practical Recommendation Systems In Neo4j

Leveraging Graph Data Science & Machine Learning

Published in

Towards Data Science

16 min readDec 17, 2021

In this post we explore how to get started with practical & scalable recommendation in graph. We will walk through a fundamental example with news recommendation on a dataset containing 17.5 million click events and around 750K users. We will leverage Neo4j and the Graph Data Science (GDS) library to quickly predict similar news based on user preferences and enable sub-second, rank-ordered, recommendation queries personalized to each user.

All the code to reproduce this analysis as well as resources to set up the graph from source data is available in this GitHub Repository.

Recommender Systems are a rich area of study and we will just barely scratch the surface here. If you are interested in going deeper in any area — whether it be other recommendation techniques, evaluating performance, optimization, including more inputs, how to scale up further, or anything else graph based recommenders, please leave a note in the comments.

This post is structured as follows: First, we will briefly define Recommendation Systems. Next, we will go over the source dataset and graph we will be using, along with how to query basic profiling statistics to help us understand the graph and better prepare for analysis. Next, we will talk about a technique called Collaborative Filtering (CF) which will be our mechanism for recommendation in this post. After that we will get into the guts of applying CF using the Cypher query language and scaling with the Graph Data Science (GDS) Library, leveraging node embeddings and an ML technique called K-Nearest Neighbor (KNN). Finally we will talk about next steps and follow up resources.

What are Recommender Systems?

Put simply, Recommender Systems are a type of information filtering system that seek to generate meaningful recommendations to users for items they may be interested in.

In the context of recommender systems, “item” is a general term that can refer to anything marketed or directed towards users, including products in online retail stores, content such as written articles, videos, and/or music, or potential connections or people to follow on social media platforms.

It should go without saying that Recommender Systems are essential for increasing user satisfaction and accelerating business growth in our increasingly competitive online landscape.

Today’s Dataset: Microsoft MIND

In this post we will explore news recommendation with the MIcrosoft News Dataset (MIND) Large Dataset which is a sample of 1 million anonymized users and their click behaviors collected from the Microsoft News website [1]. It includes about 15M impressions logs for about 160K English news articles.

I formatted the dataset and loaded it into a graph with the below schema:

//visualize schema in Neo4j Browser
neo4j$ CALL db.schema.visualization();

We see that news articles are modeled as nodes and can be CLICKED or HISTORICALLY CLICKED by users which are also modeled as nodes. In this context, CLICKED refers to a click action parsed from an impression record occurring over the time interval of our sample, approximately November 9, 2019 through November 15, 2019. HISTORICALLY CLICKED refers to a click action from user history, occurring at some unknown time in the past.

Additionally, news also have subcategories and categories that were annotated by Microsoft as well as WikiData entities that Microsoft extracted from news titles and abstracts using NLP methods. I modeled these with nodes and relationships in the graph as well. I will touch on these briefly at the end, and they will be useful for extended analysis, but we will not be using them in this post. Our primary interest here is to get started with users and their click events to derive recommendations.

Technical Analysis

Going forward we will be working through the dataset to uncover insights for news recommendation and display code snippets along the way. All the code below comes from the notebook here. While I think the code snippets are helpful for connecting the dots, there are some configurations and function definitions from the notebook that I do not repeat here for the sake of brevity. If you are interested in fully replicating the analysis in Python, please use the notebook as the source.

Graph Data Profiling

Before diving right into into analysis, it is helpful to inspect some high level statistics of the graph to get a better sense of size and connectivity.

Aggregate Counts

Let’s start with high node and relationships counts. These functions assume APOC is installed on your Neo4j DB.

gds = GraphDataScience(HOST, auth=(USERNAME, PASSWORD), aura_ds=True)# total node counts
gds.run_cypher( '''
    CALL apoc.meta.stats()
    YIELD labels AS nodeCounts
    UNWIND keys(nodeCounts) AS label
    WITH label, nodeCounts[label] AS nodeCount
    WHERE label IN ['User','News']
    RETURN label, nodeCount
''')

# total relationship counts
gds.run_cypher( '''
    CALL apoc.meta.stats()
    YIELD relTypesCount as relationshipCounts
    UNWIND keys(relationshipCounts) AS type
    WITH type, relationshipCounts[type] AS relationshipCount
    WHERE type IN ['CLICKED','HISTORICALLY_CLICKED']
    RETURN type, relationshipCount
''')

As far as recommendation systems are concerned, this is a relatively modestly sized graph, with only around 750K users, 100k news articles, and around 17.5 M total click events. These numbers are smaller than the totals reported for MIND Large because some of the original dataset was set aside by Microsoft as a test set for competition evaluation and therefore doesn’t have complete impression information. That part of the data was excluded from this graph.

Click Event Distributions

Next, we can look at the distribution of clicks per user. It is helpful to check this distribution to make sure that:

The graph is reasonably well connected, as the quality of our upcoming recommendation technique will depend on a reasonably well connected graph.
We do not have any large supernodes, i.e. nodes with very high numbers of relationships. What qualifies as a supernode varies greatly by use case. For this, I would be concerned about users with tens of thousands of clicks.

all_clicks_df = degree_counts('User', 'CLICKED|HISTORICALLY_CLICKED', 'OUT')recent_clicks_df = degree_counts('User', 'CLICKED', 'OUT')f, axs = plt.subplots(1,2,figsize=(16,5))axs[0].bar(all_clicks_df.degree[:80], all_clicks_df.degreeCount[:80], width=1)
axs[0].set_title('User Clicks Distribution')
axs[0].set_ylabel('User Count')
axs[0].set_xlabel('Number of Total Clicks')
plt.figtext(0.4, 0.5, get_percentiles(all_clicks_df).to_string())axs[1].bar(recent_clicks_df.degree[:80], recent_clicks_df.degreeCount[:80], width=1)
axs[1].set_title('User Recent Clicks Distribution')
axs[1].set_ylabel('User Count')
axs[1].set_xlabel('Number of Recent Clicks')
plt.figtext(0.83, 0.5, get_percentiles(recent_clicks_df).to_string())plt.show()

The above graphic shows the distributions for total clicks (CLICKED and HISTORICALLY_CLICKED) as well as recent clicks (just CLICKED) by user. We see that the distributions have heavy left tails, showing that activity is not uniformly distributed among users — rather, there is relatively small fraction of users that account for a large number of clicks. This is somewhat expected. Importantly, every user has at least one recent click event, and we do not see users with upwards of tens of thousands of click events.

We can do a similar breakdown for news below.

all_clicks_df = degree_counts('News', 'CLICKED|HISTORICALLY_CLICKED', 'IN')recent_clicks_df = degree_counts('News', 'CLICKED', 'IN')f, axs = plt.subplots(1,2,figsize=(16,5))axs[0].bar(all_clicks_df.degree[:80], all_clicks_df.degreeCount[:80], width=1)
axs[0].set_title('News Total Clicks Distribution')
axs[0].set_ylabel('News Count')
axs[0].set_xlabel('Number of Total Clicks')
plt.figtext(0.4, 0.5, get_percentiles(all_clicks_df).to_string())axs[1].bar(recent_clicks_df.degree[:80], recent_clicks_df.degreeCount[:80], width=1)
axs[1].set_title('News Recent Clicks Distribution')
axs[1].set_ylabel('News Count')
axs[1].set_xlabel('Number of Recent Clicks')
plt.figtext(0.83, 0.5, get_percentiles(recent_clicks_df).to_string())plt.show()

While news looks okay for total clicks, only a small portion of news (<25%) was recently clicked, i.e. actually clicked on by users in the time window of our sample. News that was not clicked within the time window may not be good candidates for recommendation in this context. We will deal with this below.

Labeling Recent News

To ensure CF recommendations are relevant we need to filter down to a subset of news articles for consideration.

News tends to be most relevant when it is recent and can lose relevance quickly with the passage of time. As a general rule, good news recommenders should avoid stale and/or outdated news. In this case we can use Microsoft impressions as a proxy and only consider news articles that were included in an impression within our sample time window. To accomplish this, we can use a node attribute called approxTime that reflects the minimum impression time for the news article. I calculated this attribute when loading the source data into Neo4j. Long story short, approxTime will only be non-null for news articles with at least one impression in our sample.
A limitation of CF is that it can only recommend content with user feedback. As such, news articles with no user clicks cannot be used for CF here. In application, this is sometimes referred to as a cold-start problem. It can be resolved with content based recommendation and other hybrid approaches. We may cover this in a separate post, but for this example we only consider news articles with at least one user clicks, i.e. at least one CLICKED or HISTORICALLY_CLICKED relationship.

We can assign a second node label, RecentNews, to allow us to easily filter to news that meets the above criteria in Cypher queries and GDS projections. Remember that Neo4j allows a node to have multiple labels, so the original News label will still be retained.

gds.run_cypher('''
    MATCH(n:News)<-[:CLICKED|HISTORICALLY_CLICKED]-()
    WHERE n.approxTime IS NOT NULL
    SET n:RecentNews
    RETURN count(DISTINCT n)
''')

We do see a significant reduction in the number of news articles, from 104K to about 22K. But being more recent and well connected, this news is also likely to be more relevant, so we should be able to improve our recommender by narrowing the focus to these news articles. We can also see these improvements reflected in the click distributions.

all_clicks_df = degree_counts('RecentNews', 'CLICKED|HISTORICALLY_CLICKED', 'IN')recent_clicks_df = degree_counts('RecentNews', 'CLICKED', 'IN')

f, axs = plt.subplots(1,2,figsize=(16,5))

axs[0].bar(all_clicks_df.degree[:80], all_clicks_df.degreeCount[:80], width=1)
axs[0].set_title('Total Clicks Distribution')
axs[0].set_ylabel('News Count')
axs[0].set_xlabel('Number of Total Clicks')
plt.figtext(0.4, 0.5, get_percentiles(all_clicks_df).to_string())


axs[1].bar(recent_clicks_df.degree[:80], recent_clicks_df.degreeCount[:80], width=1)
axs[1].set_title('Recent Clicks Distribution')
axs[1].set_ylabel('News Count')
axs[1].set_xlabel('Number of Recent Clicks')
plt.figtext(0.83, 0.5, get_percentiles(recent_clicks_df).to_string())

plt.show()

We can see now that every news article was clicked at least once, and at least 75% of the news articles have been recently clicked.

Collaborative Filtering (CF)

There are many different types of recommendations systems. In this post we will apply a technique called Collaborative Filtering (CF), which is the practice of automatically predicting a user’s preferences based on the actions of similar users.

From: https://en.wikipedia.org/wiki/Collaborative_filtering : “This image shows an example of predicting of the user’s rating using collaborative filtering. At first, people rate different items (like videos, images, games). After that, the system is making predictions about user’s rating for an item, which the user hasn’t rated yet. These predictions are built upon the existing ratings of other users, who have similar ratings with the active user. For instance, in our case the system has made a prediction, that the active user won’t like the video.”: image by Moshanin, CC BY-SA 3.0 <https://creativecommons.org/licenses/by-sa/3.0>, via Wikimedia Commons

User vs Item Based CF

Roughly speaking there are two main classes of collaborative filtering:

User-based, which focuses on directly calculating similarities between users based on their interaction with items
Item-based, which measures the similarities between pairs of items based on correlated user activity, such as the same users liking, viewing, rating, or otherwise similarly interacting with items

Our approach today will focus on the later, item-based CF. Item-based CF will be more manageable and scalable than user-based CF for many problem domains. There are a couple reasons for this.

First, because there are far fewer items then users in most applied cases there are less entities to compare. Second, and perhaps more importantly, user preferences can also be dynamic and change overtime. For example, in a movie example, a user could be mostly interested in Sci Fi movies but go through a brief stint in Westerns, and making recommendations for Westerns to other Sci Fi fans who historically viewed the same things may not be the best thing. Often the best indicator of a user’s current interest is their most recent activity, and item based filtering gives you better flexibility to find similarities based off the most recent item(s) clicked, viewed, rated, etc.

Implicit Collaborative Filtering

You will usually hear about collaborative filtering in the context of explicit user feedback such as reviews for movies or products in an online store. In this post, we will apply collaborative filtering to just the news click activity, i.e. 1 for a click, implying that the user “likes” a piece of content, and 0 otherwise. This is a basic form of “implicit collaborative filtering” where a user’s preferences are implied based on their activity rather than in explicit feedback. The need for implicit CF comes up frequently as many real-world scenarios will not have explicit ratings, or if they do they will be very sparsely populated. In a production setting, you may also want to incorporate other user activity data points to help weigh the graph relationships such as read/view time, scroll-depth, etc. which are not publicly made available in the MIND dataset. The more information you have, the more accurately you will be able to model user preferences.

Basic Cypher Queries for CF

From here we could try just using Cypher to accomplish basic Collaborative Filtering. For example, take the below user and the news they clicked. You can see a mixed interest between autos, finance, US news, and a couple other categories.

USER_ID = 'U218584'gds.run_cypher('''
    MATCH (u1:User {userId: $userId})
           -[r1:CLICKED|HISTORICALLY_CLICKED]->(n1:RecentNews)
    RETURN n1.newsId AS newsId,
           n1.title AS title,
           n1.abstract AS abstract,
           n1.category AS category,
           n1.subcategory As subcategory,
           r1.impressionTime AS impressionTime,
           type(r1) AS clickType
    ORDER BY clickType, impressionTime DESC
    ''', params={'userId': USER_ID})

Assuming we can measure the similarity of user interests via commonly clicked news articles, we can do a three hop query to find potential recommendations for userU218584 based on the activity of users that clicked on the same news as U218584.

With the below query we can get an aggregate count of the nodes we would need to traverse over to get the recommendations. To keep things simple we won’t worry about HISTORICALLY CLICKED relationships yet.

gds.run_cypher('''
    MATCH (u1:User {userId: $userId})
           -[r1:CLICKED]->(n1:RecentNews)
           <-[r2:CLICKED]-(u2:User)
           -[r3:CLICKED]->(n2:RecentNews)
    RETURN u1.userId AS userId,
           count(DISTINCT n1) AS clickedNews,
           count(DISTINCT u2) AS likeUsers,
           count(DISTINCT n2) AS potentialRecommendations
    ''', params={'userId': USER_ID})

While the above can work well in some cases, and while it can certainly be a massive improvement from joining SQL tables or cross-walking over document stores, notice that we get a lot of potential recommendations back (almost 11K) and must traverse many user nodes (over 63K). And this is just a sample of the total Microsoft dataset and we aren’t leveraging HISTORICALLY CLICKED information at all.

For production use cases where recommendations will need to be queried frequently, this method will have trouble scaling as the number of users, amount of content, and/or observed engagement grows. We need some other strategy to help narrow down the results. There are a few different ways we can accomplish this, but one robust and scalable way to do so is with the Neo4j Graph Data Science (GDS) Library .

Scaling CF with FastRP Node Embeddings and KNN

With GDS we can use FastRP node embeddings to reduce the dimensionality of the problem, then use an unsupervised machine learning technique called K-Nearest Neighbor (KNN) to identify and draw recommendations between news with similar/close embeddings. Because FastRP embeddings are based off the graph structure, news with similar embeddings should also be relatively connected in the graph via being clicked on by the same and similar users.

Graph Projection

We will start with a graph projection leveraging the User and News nodes. It may be helpful to include all the News in this projection for the embeddings step and filter to just the RecentNews in the KNN step. User click activity on stale news still constitutes a significant portion of the graph structure and as such may help inform similarity inferences for more recent news in the KNN step.

We will also include both historic and recent impression clicks, but we will give less weight to historic clicks so as to favor more recent user activity. Lastly we will use an UNDIRECTED orientation so FastRP can traverse the graph bi-directionaly.

# Putting an index on a 'weight' attribute allows us to assign default values in the projection below
gds.run_cypher('''
   CREATE INDEX weight_index IF NOT EXISTS FOR ()-[r:CLICKED]-() 
   ON (r.weight)
''')# Projecting the graph
g0, res = gds.graph.project('embedding-projection', ['User', 'News'], {
    'CLICKED':{
        'orientation':'UNDIRECTED',
        'properties': {
            'weight': {
                'property': 'confidence', 
                'defaultValue': 1.0
            }
        }
    },
    'HISTORICALLY_CLICKED':{
        'orientation':'UNDIRECTED',
        'properties': {
            'weight': {
                'property': 'confidence', 
                'defaultValue': 0.1
            }
        }
    }
})
res

FastRP

When running FastRP we will make sure to include the relationship weight property.

gds.fastRP.mutate(g0, mutateProperty='embedding', 
                  embeddingDimension=256, randomSeed=7474, 
                  relationshipWeightProperty='weight');

While we should be able to do this all in one projection, depending on the GDS version you are using, I find it easiest to write the embeddings back to the database and create a seperate projection just for KNN.

gds.graph.writeNodeProperties(g0, ["embedding"], ["News"])

If you are curious as to what these embeddings look like, they are just vectors of floating point numbers. In this case, they are 256 numbers long as specified in the embeddingDimension parameter above.

gds.run_cypher('''
    MATCH(n:RecentNews) 
    RETURN n.newsId, n.embedding LIMIT 3
''')

K-Nearest-Neighbors (KNN)

We can now run KNN to estimate similarity (a.k.a. USERS_ALSO_LIKED) relationships between RecentNews articles and write them back to the graph.

#graph projection for knn
g1, res = gds.graph.project('cf-projection', 
        {'RecentNews':{'properties':['embedding']}},'*')
res

#run knn
knn_stats_df = gds.knn.write(g1, nodeProperties=['embedding'],
    writeRelationshipType='USERS_ALSO_LIKED',
    writeProperty='score',
    sampleRate=1.0,
    maxIterations=1000);

Below we can see statistics related to algorithm convergence, compute and write time, the number of nodes compared, pairs considered, and relationships written. KNN is actually very well optimized in GDS and uses sophisticated sampling and parallelization techniques under the hood to make it more computationally efficient and highly scalable. See the documentation for more details if interested.

knn_stats_df[['didConverge',
              'ranIterations',
              'computeMillis',
              'writeMillis',
              'nodesCompared',
              'nodePairsConsidered',
              'relationshipsWritten']]

KNN relationships are only written when a positive similarity is found between node pairs which, in this case, is based on cosine similarity between the nodeWeightProperty values of each node. Here we are using the FastRP embedding we calculated over the CLICKED and HISTORICALLY CLICKED relationships as the nodeWeightProperty. We can see a distribution of those similarity scores below.

knn_stats_df.similarityDistribution

This is a helpful distribution to check. In this case it looks reasonably healthy. If you see a lot of zeros going into higher percentiles it is a sign that KNN was not able to find many similarities. A lack of variance in your scores between percentiles can also signal a lack of variance in your node weight properties.

CF with KNN Relationships

Now we can structure a Collaborative filtering query for user U218584 but with:

more refined results,
using less traversal steps, and
with a score from KNN that allows us to order the results based on aggregate similarity.

gds.run_cypher( '''
    MATCH(u:User {userId: $userId})-[:CLICKED]->(n:RecentNews)
    WITH collect(id(n)) AS clickedNewsIds//get similar News according to KNN and exclude previously clicked news
    MATCH (clickedNews)-[s:USERS_ALSO_LIKED]->(similarNews:News)
    WHERE id(clickedNews) IN clickedNewsIds AND NOT id(similarNews) IN clickedNewsIds//aggregate and return ranked results
    RETURN DISTINCT similarNews.newsId as newsId, 
        similarNews.title AS title, 
        similarNews.category AS category,
        similarNews.subcategory As subcategory,
        sum(s.score) AS totalScore ORDER BY totalScore DESC
    ''', params={'userId': USER_ID})

We cut back to 59 results from the previous almost 11K with pure Cypher.

Recommendations Based on Latest Viewed Content

A nice thing about using this type of Item-based CF with graph is that it is easy to change the range of news articles considered for recommendations.

For example, it may not be ideal to use the entire user history to generate recommendations if user interests change dynamically over time. In some instances of content recommendation, the current or latest viewed piece of content may be the best signal for what to recommend next. If we wanted to focus our recommendations in this way, we can easily adjust our Cypher query to do so.

We can grab the latest clicked news article for the user like so:

gds.run_cypher('''
    MATCH (u:User {userId:$userId})-[r:CLICKED]->(:RecentNews) 
    WITH u, max(r.impressionTime) AS maxImpressionTime
    MATCH (u)-[r:CLICKED]->(n:RecentNews) 
    WHERE r.impressionTime = maxImpressionTime
    RETURN n.newsId as newsId, 
        n.title AS title, 
        n.category AS category,
        n.subcategory As subcategory,
        r.impressionTime AS impressionTime
    ''', params={'userId': USER_ID})

Image by Author

And add the logic to our single CF query to obtain recommendations based of that last clicked article.

gds.run_cypher('''
    MATCH (u:User {userId:$userId})-[r:CLICKED]->(:RecentNews) 
    WITH u, max(r.impressionTime) AS maxImpressionTime
    MATCH (u)-[r:CLICKED]->(n:RecentNews) 
    WHERE r.impressionTime = maxImpressionTime
    WITH n
    MATCH(n)-[s:USERS_ALSO_LIKED]->(similarNews:News)
    RETURN DISTINCT similarNews.newsId as newsId,
        similarNews.title AS title,
        similarNews.abstract AS abstract,
        similarNews.category AS category,
        similarNews.subcategory As subcategory,
        sum(s.score) AS totalScore
        ORDER BY totalScore DESC
    ''', params = {'userId': USER_ID})

Note that these are more narrowly focused around automobiles and the Ford GT500 specifically.

It will not always be the case that CF recommendations align to category, subcategory, and entities inside the articles to the extent that they are above because we aren’t directly leveraging those properties — at least not yet. Instead this CF is based on correlating user interests, and if you explore this dataset further you will find some categories/subcategories better correlated across user interests then others.

What’s Next?

In this post, we explored how we can begin to rapidly develop and scale recommendation with robust graph analytics using Neo4j and GDS.

Here is a GitHub repository that should have everything needed to reproduce the example we just walked through, including a notebook for this analysis and another for retrieving, formating, and loading the source data into Neo4j.

If you are new to Neo4j, I recommend using Neo4j Desktop with GDS for this example. Please use GDS version 2.0 or higher for compatibility.

As mentioned in the introduction, we are just barely scratching the surface for graph-based Recommender Systems in this post. Just within CF there are many different ways we could expand our analysis, as well as different ways we could optimize it to improve predictive performance, such as with FastRP optimization shown here. We also didn’t even touch on content-based filtering, and other hybrid recommender systems which can be particularly helpful for improving performance and recommendations of new content or content with otherwise sparse user feedback. This can be explored with the current graph using the category, subcategory, and wikiData nodes in addition to other news article attributes. How to properly evaluate Recommender Systems is also a non-trivial topic which may be worth inspecting in the context of graph. If you are interested in exploring these or other graph recommendation related topics in more detail, please let me know in the comments.

Until then I hope this demonstration of graph for Recommender Systems was helpful!

Thank you to CJ Sullivan and Angela Zimmerman who helped review the content for this blog post.

[1]: Fangzhao Wu, Ying Qiao, Jiun-Hung Chen, Chuhan Wu, Tao Qi, Jianxun Lian, Danyang Liu, Xing Xie, Jianfeng Gao, Winnie Wu and Ming Zhou. MIND: A Large-scale Dataset for News Recommendation. ACL 2020.