Link Prediction with Neo4j Part 2: Predicting co-authors using scikit-learn

This is the 2nd in series of posts on the link prediction functions that were recently added to the Neo4j Graph Algorithms Library.

Published in

Towards Data Science

12 min readMar 28, 2019

Link Predictions in the Neo4j Graph Algorithms Library

In the 1st post we learnt about link prediction measures, how to apply them in Neo4j, and how they can be used as features in a machine learning classifier. We also learnt about the challenge of splitting train and test data sets when working with graphs.

In this post we’ll apply the things we learned on a citation dataset. We’re going to predict future co-authorships using scikit-learn and the link prediction algorithms.

Amy Hodler and I showed how to apply the approaches described in this post in last week’s Neo4j Online Meetup, so you can also watch the video.

Link Prediction Online Meetup

If you’re sticking around for the written version, let’s get started!

Import the citation dataset

We’re going to use data from the DBLP Citation Network, which includes citation data from various academic sources. I wrote a blog post explaining how to import the full dataset, but in this post we’re going to focus on the data from a few Software Development Conferences.

We can import that subset of the data by running the following Cypher statements. You can run them all in one go as long as you have the multi statement editor enabled in the Neo4j Browser:

// Create constraints
CREATE CONSTRAINT ON (a:Article) ASSERT a.index IS UNIQUE;
CREATE CONSTRAINT ON (a:Author) ASSERT a.name IS UNIQUE;
CREATE CONSTRAINT ON (v:Venue) ASSERT v.name IS UNIQUE;// Import data from JSON files using the APOC library
CALL apoc.periodic.iterate(
  'UNWIND ["dblp-ref-0.json", "dblp-ref-1.json", "dblp-ref-2.json", "dblp-ref-3.json"] AS file
   CALL apoc.load.json("https://github.com/mneedham/link-prediction/raw/master/data/" + file)
   YIELD value WITH value
   RETURN value',
  'MERGE (a:Article {index:value.id})
   SET a += apoc.map.clean(value,["id","authors","references", "venue"],[0])
   WITH a, value.authors as authors, value.references AS citations, value.venue AS venue
   MERGE (v:Venue {name: venue})
   MERGE (a)-[:VENUE]->(v)
   FOREACH(author in authors | 
     MERGE (b:Author{name:author})
     MERGE (a)-[:AUTHOR]->(b))
   FOREACH(citation in citations | 
     MERGE (cited:Article {index:citation})
     MERGE (a)-[:CITED]->(cited))', 
   {batchSize: 1000, iterateList: true});

The following diagram shows what the data looks like once we’ve imported into Neo4j:

Diagram showing Citation Network in Neo4j

Building a co-author graph

The dataset doesn’t contain relationships between authors describing their collaborations, but we can infer them based on finding articles authored by multiple people.

The following Cypher statement creates aCO_AUTHOR relationship between authors that have collaborated on at least one article:

MATCH (a1)<-[:AUTHOR]-(paper)-[:AUTHOR]->(a2:Author)
WITH a1, a2, paper
ORDER BY a1, paper.year
WITH a1, a2, collect(paper)[0].year AS year, 
     count(*) AS collaborations
MERGE (a1)-[coauthor:CO_AUTHOR {year: year}]-(a2)
SET coauthor.collaborations = collaborations;

We create only one CO_AUTHOR relationship between authors that have collaborated, even if they’ve collaborated on multiple articles. We create a couple of properties on these relationships:

a year property that indicates the publication year of the first article on which the authors collaborated
a collaborations property that indicates how many articles on which the authors have collaborated

Now that we’ve got our co-author graph, we need to figure out how we’re going to predict future collaborations between authors.

We’re going to build a binary classifier to do this, so our next step is to create train and test graphs.

Train and test datasets

As mentioned in the 1st post, we can’t just randomly split the data into train and test datasets, as this could lead to data leakage.

Data leakage can occur when data outside of your training data is inadvertently used to create your model. This can easily happen when working with graphs because pairs of nodes in our training set may be connected to those in the test set.

Instead we need to split our graph into training and test sub graphs, and we are lucky that our citation graph contains time information that we can split on. We will create training and test graphs by splitting the data on a particular year.

But which year should we split on? Let’s have a look at the distribution of the first year that co-authors collaborated:

Chart showing distribution of year of collaboration

It looks like 2006 would act as a good year on which to split the data, because it will give us a reasonable amount of data for each of our sub graphs. We’ll take all the co-authorships from 2005 and earlier as our training graph, and everything from 2006 onwards as the test graph.

Let’s create explicit CO_AUTHOR_EARLY and CO_AUTHOR_LATE relationships in our graph based on that year. The following code will create these relationships for us:

Train sub graph

MATCH (a)-[r:CO_AUTHOR]->(b) 
WHERE r.year < 2006
MERGE (a)-[:CO_AUTHOR_EARLY {year: r.year}]-(b);

Test sub graph

MATCH (a)-[r:CO_AUTHOR]->(b) 
WHERE r.year >= 2006
MERGE (a)-[:CO_AUTHOR_LATE {year: r.year}]-(b);

This split leaves us with 81,096 relationships in the early graph, and 74,128 in the late one. This is a split of 52–48. That’s a higher percentage of values than we’d usually have in our test graph, but it should be ok.

The relationships in these sub graphs will act as the positive examples in our train and test sets, but we need some negative examples as well. The negative examples are needed so that our model can learn to distinguish nodes that should have a link between them and nodes that shouldn’t.

As is often the case in link prediction problems, there are a lot more negative examples than positive ones. The maximum number of negative examples is equal to :

# negative examples = (# nodes)² - (# relationships) - (# nodes)

i.e. the number of nodes squared, minus the relationships that the graph has, minus self relationships.

Instead of using almost all possible pairs, we’ll use pairs of nodes that are between 2 and 3 hops away from each other. This will give us a much more manageable amount of data to work with.

We can generate these pairs by running the following query:

MATCH (author:Author)
WHERE (author)-[:CO_AUTHOR_EARLY]-()
MATCH (author)-[:CO_AUTHOR_EARLY*2..3]-(other)
WHERE not((author)-[:CO_AUTHOR_EARLY]-(other))
RETURN id(author) AS node1, id(other) AS node2

This query returns 4,389,478 negative examples compared to 81,096 positive examples, which means we have 54 times as many negative examples.

So we still have a big class imbalance, which means that a model that predicts that every pair of nodes will not have a link will be very accurate.

To solve this issue we can either up sample the positive examples or down sample the negative examples. We’re going to go with the downsampling approach.

py2neo, pandas, scikit-learn

For the rest of the post we’re going to be working in Python with the py2neo, pandas, and scikit-learn libraries.

The py2neo driver enables data scientists to easily integrate Neo4j with tools in the Python Data Science ecosystem. We’ll be using this library to execute Cypher queries against Neo4j.

pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language

scikit-learn is a popular machine learning library. We’ll be using this library to build our machine learning model.

We can install these libraries from PyPi:

pip install py2neo==4.1.3 pandas sklearn

Once we’ve got those libraries installed we’ll import the required packages, and create a database connection:

from py2neo import Graph
import pandas as pdgraph = Graph("bolt://localhost", auth=("neo4j", "neo4jPassword"))

Building our train and test sets

We can now write the following code to create a test DataFrame containing positive and negative examples based on the early graph:

# Find positive examples
train_existing_links = graph.run("""
MATCH (author:Author)-[:CO_AUTHOR_EARLY]->(other:Author)
RETURN id(author) AS node1, id(other) AS node2, 1 AS label
""").to_data_frame()# Find negative examples
train_missing_links = graph.run("""
MATCH (author:Author)
WHERE (author)-[:CO_AUTHOR_EARLY]-()
MATCH (author)-[:CO_AUTHOR_EARLY*2..3]-(other)
WHERE not((author)-[:CO_AUTHOR_EARLY]-(other))
RETURN id(author) AS node1, id(other) AS node2, 0 AS label
""").to_data_frame()# Remove duplicates
train_missing_links = train_missing_links.drop_duplicates()# Down sample negative examples
train_missing_links = train_missing_links.sample(
    n=len(train_existing_links))# Create DataFrame from positive and negative examples
training_df = train_missing_links.append(
    train_existing_links, ignore_index=True)
training_df['label'] = training_df['label'].astype('category')

And now we’ll do the same to create a test DataFrame, but this time we only consider relationships in the late graph:

# Find positive examples
test_existing_links = graph.run("""
MATCH (author:Author)-[:CO_AUTHOR_LATE]->(other:Author)
RETURN id(author) AS node1, id(other) AS node2, 1 AS label
""").to_data_frame()# Find negative examples
test_missing_links = graph.run("""
MATCH (author:Author)
WHERE (author)-[:CO_AUTHOR_LATE]-()
MATCH (author)-[:CO_AUTHOR_LATE*2..3]-(other)
WHERE not((author)-[:CO_AUTHOR_LATE]-(other))
RETURN id(author) AS node1, id(other) AS node2, 0 AS label
""").to_data_frame()# Remove duplicates 
test_missing_links = test_missing_links.drop_duplicates()# Down sample negative examples
test_missing_links = test_missing_links.sample(n=len(test_existing_links))# Create DataFrame from positive and negative examples
test_df = test_missing_links.append(
    test_existing_links, ignore_index=True)
test_df['label'] = test_df['label'].astype('category')

Now it’s time to create our machine learning model.

Choosing a machine learning algorithm

We’ll create a random forest classifier. This method is well suited as our data set will be comprised of a mix of strong and weak features. While the weak features will sometimes be helpful, the random forest method will ensure we don’t create a model that over fits our training data.

We can create this model with the following code:

from sklearn.ensemble import RandomForestClassifierclassifier = RandomForestClassifier(n_estimators=30, max_depth=10, 
                                    random_state=0)

Now it’s time to engineer some features which we’ll use to train our model.

Feature extraction is a way to distill large volumes of data and attributes down to a set of representative, numerical values, i.e. features.’ This is then used as input data so we can differentiate categories/values for learning tasks.

Keep in mind that if you want to try out the code samples in the next part of this post you’ll need to make sure you have your Neo4j development environment setup as described in the 1st post of this series.

Generating link prediction features

We’ll start by creating some features using link prediction functions

def apply_graphy_features(data, rel_type):
    query = """
    UNWIND $pairs AS pair
    MATCH (p1) WHERE id(p1) = pair.node1
    MATCH (p2) WHERE id(p2) = pair.node2
    RETURN pair.node1 AS node1,
           pair.node2 AS node2,
           algo.linkprediction.commonNeighbors(
               p1, p2, {relationshipQuery: $relType}) AS cn,
           algo.linkprediction.preferentialAttachment(
               p1, p2, {relationshipQuery: $relType}) AS pa,
           algo.linkprediction.totalNeighbors(
               p1, p2, {relationshipQuery: $relType}) AS tn
    """
    pairs = [{"node1": pair[0], "node2": pair[1]}  
             for pair in data[["node1", "node2"]].values.tolist()]
    params = {"pairs": pairs, "relType": rel_type}
    
    features = graph.run(query, params).to_data_frame()
    return pd.merge(data, features, on = ["node1", "node2"])

This function executes a query that takes each pair of nodes from a provided DataFrame, and computes:

common neighbors (cn)
preferential attachment (pa), and
total neighbors (tn)

for each of those pairs. These measures are defined in the first post.

We can apply it to our train and test DataFrames like this:

training_df = apply_graphy_features(training_df, "CO_AUTHOR_EARLY")
test_df = apply_graphy_features(test_df, "CO_AUTHOR")

For the training DataFrame we compute these metrics based only on the early graph, whereas for the test DataFrame we’ll compute them across the whole graph.

We can still use the whole graph to compute these features, as the evolution of the graph depends on what it looks like across all time, it’s not only based on what happened in 2006 and later.

Now we’re ready to train our model. We can do this with the following code:

columns = ["cn", "pa", "tn"]X = training_df[columns]
y = training_df["label"]
classifier.fit(X, y)

Our model is now trained, but we need to evaluate it.

Evaluating our model

We’re going to compute its accuracy, precision, and recall. The diagram below, taken from the O’Reilly Graph Algorithms Book, explains how each of these metrics are computed.

scikit-learn has built in functions that we can use for this. We can also return the importance of each feature used in our model.

The following functions will help with this:

from sklearn.metrics import recall_score
from sklearn.metrics import precision_score
from sklearn.metrics import accuracy_scoredef evaluate_model(predictions, actual):
    accuracy = accuracy_score(actual, predictions)
    precision = precision_score(actual, predictions)
    recall = recall_score(actual, predictions)
    
    metrics = ["accuracy", "precision", "recall"]
    values = [accuracy, precision, recall]    
    return pd.DataFrame(data={'metric': metrics, 'value': values})def feature_importance(columns, classifier):        
    features = list(zip(columns, classifier.feature_importances_))
    sorted_features = sorted(features, key = lambda x: x[1]*-1)
    
    keys = [value[0] for value in sorted_features]
    values = [value[1] for value in sorted_features]
    return pd.DataFrame(data={'feature': keys, 'value': values})

We can evaluate our model by running the following code:

predictions = classifier.predict(test_df[columns])
y_test = test_df["label"]evaluate_model(predictions, y_test)

We have quite high scores across the board. We can now run the following code to see which feature is playing the most prominent role:

feature_importance(columns, classifier)

We can see above that Common neighbors (cn) is the massively dominant feature in our model. Common neighbors returns us a count of the number of unclosed co-author triangles that an author has, so this perhaps isn’t that surprising.

Now we’re going to add some new features that are generated from graph algorithms.

Triangles and The Clustering Coefficient

We’ll start by running the triangle count algorithm over our test and train sub graphs. This algorithm returns the number of triangles that each node forms, as well as each node’s clustering coefficient. The clustering coefficient of a node indicates the likelihood that its neighbours are also connected.

We can run the following Cypher queries in the Neo4j Browser, to run this algorithm on our train graph:

CALL algo.triangleCount('Author', 'CO_AUTHOR_EARLY', { 
  write:true,
  writeProperty:'trianglesTrain', 
  clusteringCoefficientProperty:'coefficientTrain'});

And the following Cypher query to run it on the test graph:

CALL algo.triangleCount('Author', 'CO_AUTHOR', { 
  write:true,
  writeProperty:'trianglesTest', 
  clusteringCoefficientProperty:'coefficientTest'});

We now have 4 new properties on our nodes: trianglesTrain, coefficientTrain, trianglesTest, and coefficientTest. Let’s now add these to our train and test DataFrames, with help from the following function:

def apply_triangles_features(data,triangles_prop,coefficient_prop):
    query = """
    UNWIND $pairs AS pair
    MATCH (p1) WHERE id(p1) = pair.node1
    MATCH (p2) WHERE id(p2) = pair.node2
    RETURN pair.node1 AS node1,
    pair.node2 AS node2,
    apoc.coll.min([p1[$triangles], p2[$triangles]]) AS minTriangles,
    apoc.coll.max([p1[$triangles], p2[$triangles]]) AS maxTriangles,
    apoc.coll.min([p1[$coefficient], p2[$coefficient]]) AS minCoeff,
    apoc.coll.max([p1[$coefficient], p2[$coefficient]]) AS maxCoeff
    """
    
    pairs = [{"node1": pair[0], "node2": pair[1]}  
          for pair in data[["node1", "node2"]].values.tolist()]    params = {"pairs": pairs,
              "triangles": triangles_prop,
              "coefficient": coefficient_prop}
    
    features = graph.run(query, params).to_data_frame()
    return pd.merge(data, features, on = ["node1", "node2"])

These measures are different than the ones we’ve used so far, because rather than being computed based on the pair of nodes, they are node specific measures.

We can’t simply add these values to our DataFrame as node1Triangles or node1Coeff because we have no guarantee over the order of nodes in the pair. We need to come up with an approach that is agnostic of the order

We can do this by taking the average of the values, the product of the values, or by computing the minimum and maximum value, as we do here.

We can apply that function to the DataFrame like this:

training_df = apply_triangles_features(training_df, 
  "trianglesTrain", "coefficientTrain")
test_df = apply_triangles_features(test_df, 
  "trianglesTest", "coefficientTest")

And now we can train and evaluate:

columns = [
    "cn", "pa", "tn", 
    "minTriangles", "maxTriangles", "minCoeff", "maxCoeff"
]X = training_df[columns]
y = training_df["label"]
classifier.fit(X, y)predictions = classifier.predict(test_df[columns])
y_test = test_df["label"]display(evaluate_model(predictions, y_test))

Those features have been helpful! Each of our measures have improved by about 4% from the initial model. Which features are most important?

display(feature_importance(columns, classifier))

Common neighbors is still the most influential, but the triangles features are adding some value as well.

In Summary

This post has already gone on a lot longer than I intended so we’ll stop there, but there are certainly some exercises left for the reader.

Engineering more features

Can you think of any other features that we could add that might help us to create a model with even higher accuracy? Perhaps other community detection, or even centrality algorithms might help?

Extending support for link predictions

At the moment the link prediction algorithms in the Graph Algorithms Library only work for mono-partite graphs. Mono-partite graphs are ones where the label of both nodes are the same. The algorithms are based on the topology of nodes, and if we try to apply them to nodes with different labels, those nodes will likely have different topologies, meaning that the algorithms won’t work so well.

We’re looking at adding versions of the link prediction algorithms that work on other graphs. If you have any particular ones that you’d like to see, please let us know on GitHub issues.

What next?

If you found this blog post interesting, you might enjoy the O’Reilly Graph Algorithms Book that Amy Hodler and I have been working on over the last 9 months. We’re in the final review stage and it should be available in the next few weeks.

You can register to get your free digital copy from the Neo4j webisite, at: neo4j.com/graph-algorithms-book

Will Lyon and I are also working on a new online training course on Data Science with Neo4j, so keep an eye on the Online Training page for details of that.