Fire up your Centrality Metric engines: Neo4j vs NetworkX — a drag race, of sorts…

Kristof Neys
Towards Data Science
7 min readNov 6, 2020

--

Photo by eloy carrasco on Unsplash

In this article, Mark Needham and I hold a drag-race between the Python NetworkX package and the Neo4j Graph Data Science library. In this race we let them compete to discover which one is the fastest in computing the Betweenness and Harmonic centrality measure on undirected graphs. We will also use the Neo4j Python driver so that we don’t even need to leave the cushy environment of our Python IDE.

Why NetworkX?

The NetworkX package is the old kid on the block and was released all the way back in 2008¹ (…in those days Neo4j was still having a hot chocolate for breakfast…). By many measures, it is still the most popular Python package for network analysis, and indeed by our count, at least four books on network analysis use the NetworkX package as their workhorse.

Which Centrality metric?
As mentioned, we focus on two important network centrality measures: the Betweenness and Harmonic centrality metric. The Betweenness metric measures the importance of a node based on the fraction of shortest paths passing through it, as such it is also often viewed as how much ‘information flow’ it captures. As an example in the graph below: should Alice leave, many nodes will end up isolated.

Neo4j image

The Harmonic centrality measure is an improvement on the Closeness metric and is therefore also a ‘distance-based’ metric, where a high score implies that this specific node can, on average, reach other nodes in fewer hops. As we argued in a previous article; it’s sort of the ‘Gold standard’ according to Boldi & Vigna.

We have organized this article in two main parts: we first explore the set-up of the Neo4j database and Python driver as well as the NetworkX algorithms, and in the second part we experiment and comment on the actual performance of our competitors.

Preparing for the race…

A great post on how to create a database in Neo4j, as well as the Python driver, was published by Shuyi Yang. The change we will make, however, is for the Neo4j Python driver to read files from the ‘import’ directory of the database, (as opposed to a URL, as in the article by Shuyi). This is handy for when we have multiple different files we want to perform network analysis on, which we’ll come back to later. As a final step, we install the Neo4j Python driver in our environment through `pip install neo4j` or even easier by loading the driver as a package in PyCharm.

Using the Neo4j Python driver to construct a database.

As can be witnessed from the code below; it only takes a few simple lines to access the vast wealth of tools and functions of Neo4j. We start off by instantiating the driver where the arguments are the bolt connector, which we can find in the Neo4j browser, followed by the authentication of user and password, which we established when creating the project.

Next, we define the Neo4j queries we want the driver to perform, which need to be written in the Cypher language. The first of the two main statements will create a constraint on the nodes of the graph to ensure it is unique. This constraint will also create an index on the Node(id) property which will significantly improve the runtime. The second query is the standard Cypher statement to create the database, as we would use in the Neo4j environment and we also make sure that the file-path is referenced with a file:/// prefix before the file name.

We are now ready to run the driver sessions, which we perform in two steps. The first run uses the ‘CREATE OR REPLACE DATABASE’ statement which is executed against the system database and allows the creation of a database ‘on-the-fly’. The second run executes both the constraint creation statement as well as the database import statement. As a safety check, we also counted the nodes as a query.

Calculating the Centrality Metrics in Neo4j

Once we have the data inserted we can compute the centrality metrics using the Neo4j Graph Database library (GDS). There is nothing fancy about calling the GDS procedures using Cypher, as shown in the code below, and when the respective metrics have been retrieved we need to do some formatting in order to measure the accuracy between the Neo4j and NetworkX outputs. We do this by using the Kendall Tau measure, calling on the ‘scipy’ library, which requires the two lists to be in a ‘numpy’ array. For that, we have called our Pandas friend where we sort the results in a descending list which then needs formatting to a numpy array to use in our accuracy test.

As a safety measure, we again save the data to a csv file. This is not strictly necessary of course, but as we want to run everything from one script, the reason for which will become clear in a little while, it will prove to be valuable for when things trip up.

Before we move onto the NetworkX engine, a few sentences about the real world network we used. From Stanford’s SNAP database we selected the Deezer RO set http://snap.stanford.edu/data/gemsec-Deezer.html. This is a friendship network, undirected, from the music streaming service Deezer, and consists of 41,773 nodes and 125,826 edges.

Getting our NetworkX engine ready

We can now stroll over to the NextworkX pitlane. The Deezer RO network comes in a simple edgelist for which we use the NetworkX function to load the graph:

G = nx.read_edgelist(RW_network, nodetype=int, comments='#',
delimiter=',', create_using=nx.Graph(), encoding='utf-8')

The NetworkX centrality functions are equally straightforward:

HC_nx = nx.harmonic_centrality(G)
BC_nx = nx.betweenness_centrality(G)

And finally, once again we ask our Pandas friend to bring some structure, but we also need to normalise the values. Here we apply the same method as the Neo4j algorithm by simply dividing by the number of nodes — 1.

With both engines ready, we can roll them onto the starting grid, which we combine in the following Python script, where we have every function call timed:

Let the Games begin!

As soon as the green “Run” button was pressed we saw NetworkX first off the grid by constructing the graph in just a few seconds, while Neo4j, using the Python driver, took 9.7 seconds. However, that was as good as it got for NetworkX. For the Harmonic centrality Neo4j GDS took a mere 14 seconds to return the 41,773 centrality values… It took NetworkX 3,997 seconds to cross the finish line…(that is 1 hour 6 mins). When computing the Kendall Tau the two lists matched perfectly, scoring a clean 1.00. Computing the Betweenness centrality became sort of embarassing — Neo4j returning the values in a smidget over 10 seconds…. NetworkX crawling over the line in 15,284 seconds, or around 4 hours 15 mins. The Betweenness Kendall Tau came in at 0.986, which is only marginally off and likely due to some ties or floating point differences (hardware used: macOS, 3.1 GHz Dual-Core Intel Core i5, 8GB memory).

The need for Speed

The magic sauce in the Neo4j GDS algorithm is twofold. For the Harmonic centrality the Neo4j GDS algorithm makes use of the Multi-Source Breadth-First-Search approach², which is an algorithm that allows for running multiple concurrent BFS over the same graph on a single CPU core. When running the Betweenness centrality, the NetworkX algorithm is executed on a single thread. By contrast, the Neo4j GDS partitions the node space evenly after which it runs the Brandes algorithm for each node in each partition, hence it applies a multi-threaded approach.

Conclusion

It is no secret that NetworkX is a rather slow package, but this exercise shows that for medium to large undirected graphs Neo4j GDS becomes the go-to engine. Furthermore, using the handy trick we referred to earlier of changing the Neo4j settings, the graph analysis can be done entirely using the Neo4j Python driver and as such it allows running the analysis over a bunch of different graphs stored in the ‘import’ directory using a simple Python loop. A set of say 20 graphs of between 50–100k nodes that would take days to analyse using NetworkX, can now be completed in a matter of hours.

¹ Aric A. Hagberg, Daniel A. Schult and Pieter J. Swart, “Exploring network structure, dynamics, and function using NetworkX”, in Proceedings of the 7th Python in Science Conference (SciPy2008), Gäel Varoquaux, Travis Vaught, and Jarrod Millman (Eds), (Pasadena, CA USA), pp. 11–15, Aug 2008

² Then, Manuel, et al. “The more the merrier: Efficient multi-source graph traversal.” Proceedings of the VLDB Endowment 8.4 (2014): 449–460.

--

--

Graph Data Science specialist at Neo4j, fascinated by anything with Graphs and Deep Learning. PhD student at Birkbeck, University of London