Batching Transactions in Neo4j

Learn the difference between APOC and native Cypher approach to batching transactions

Tomaz Bratanic
Towards Data Science

--

Photo by Bernd Dittrich on Unsplash

When dealing with massive graphs and more extensive updates in Neo4j, you will likely run out of heap memory if you don’t resort to batching transactions. While you could use your favorite scripting language to batch transactions, you can also split a single Cypher query into multiple transactions directly in Cypher or with the help of the APOC library.
This post aims to demonstrate the latter, meaning we will only look at how to split a single Cypher query into multiple transactions.

We will be using the ICIJ Paradise Papers dataset, made available by International Consortium of Investigative Journalists (ICIJ). The ICIJ has a long history of publishing spectacular researches like the Panama, Paradise, and the latest Pandora Papers. If you are interested, you can explore the datasets on their website, or download the data (licensed under Open Database License) and explore it with your favorite data mining tool. Luckily for us, the dataset is available as a Neo4j Sandbox project. Neo4j Sandbox is a free cloud instance of Neo4j that comes with a pre-populated database. Click on the following link to create your own Sandbox instance of the Paradise Papers dataset.

While the graph model is a slightly more complex, we will only use the following subset of the graph for our demonstration.

Subset of the Paradise Papers graph. Image by the author.

We are only interested in officers (green) and entities (orange) and the OFFICER_OF relationships between the two. This is essentially a bi-partite network containing two types of nodes. To analyze bi-partite networks, we often transform them into a monopartite network as the first step of the analysis. In this example, we will use the above bi-partite network and project a monopartite network representing officers, and the relationships between them will define if and how many entities they have in common.

Monopartite projection of officers. Image by the author.

This is a straightforward operation in Cypher. We simply match the original pattern and count the occurrences, and optionally store the results as relationships between officers. However, these types of operations are likely to explode in the number of rows.

MATCH (n:Officer)-[:OFFICER_OF]->(:Entity)<-[:OFFICER_OF]-(m)// avoid duplication
WHERE id(n) < id(m)
// count the number of occurrences per pair of node
WITH n,m, count(*) AS common
// return the total number of rows
RETURN count(*) AS numberOfRows

In our example, we need to create 1.313.187 relationships to project a monopartite network of officers and store the number of common entities as the relationship weight. If two officers have no entities in common, a relationship is not created.

Unfortunately, the Neo4j Sandbox instance has only 1GB of heap memory. Consequently, creating more than a million relationships in a single transaction is likely to run into memory issues. Before we continue, we need to increase the transaction timeout setting. By default, the Sandbox instances have a transaction timeout of 30 seconds, meaning that if a transaction lasts longer than 30 seconds, it will be automatically terminated. We can avoid that by setting the following transaction timeout configuration.

CALL dbms.setConfigValue(‘dbms.transaction.timeout’,’0');

For those of you who have some experience with Neo4j and batching, you might be familiar with the apoc.periodic.iterate procedure, which is frequently used to batch transactions.

CALL apoc.periodic.iterate(  // first statement
"MATCH (n:Officer)-[:OFFICER_OF]->()<-[:OFFICER_OF]-(m)
WHERE id(n) < id(m)
WITH n,m, count(*) AS common
RETURN n,m, common",
// second statement
"MERGE (n)-[c:COMMON_ENTITY_APOC]->(m)
SET c.count = common",
// configuration
{batchSize:50000})

In the first statement, we provide the data stream to operate on. The data stream can consist of millions of rows. The second statement does the actual update. In our case, it will create a new relationship between a pair of officers and store the count as the relationship weight. By defining the batchSizeparameter in the configuration, we describe the number of rows to be committed in a single transaction. For example, by setting the batchSizeto 50.000, a transaction will be committed after 50.000 executions of the second statement. In our case, the first statement produces 1.3 million rows, which means that the update will be split into 26 transactions.

In Neo4j 4.4, batching transactions was introduced as a native Cypher feature. To batch transactions using only Cypher, you have to define a subquery that updates the graph and is followed by the IN TRANSACTIONS OF X ROWS .

:auto MATCH (n:Officer)-[:OFFICER_OF]->()<-[:OFFICER_OF]-(m)
WHERE id(n) < id(m)
WITH n,m, count(*) AS common
CALL {
WITH n,m,common
MERGE (n)-[c:COMMON_ENTITY_CYPHER]->(m)
SET c.count = common} IN TRANSACTIONS OF 50000 ROWS

If you execute the above Cypher statement in Neo4j Browser, you must prepend the :autocommand. Otherwise, you can omit it when executing this Cypher query from your favorite scripting language. The logic is identical to the APOC batching. We first define the stream of data (first statement) and then use a subquery (second statement) to batch large updates.

When to use which?

So what’s the difference, and when should you use which?

I think the most significant difference is in how they handle errors. First of all, if a single execution within a batch fails, then the whole batch fails, independent of if you are using APOC or Cypher to batch transactions. However, the Cypher variant will not continue with the operation after a single batch has failed and all the previously successfully committed transactions will not be rolled back. So, if the third transaction has failed, the previous two that were successfully committed will not be rolled back.
On the contrary, the APOC variant has no problem if an intermediate batch fails and will iterate through all the batches regardless. Furthermore, APOC also has the option to define retries in case of a failing batch, which is something that native Cypher transaction batching lacks.

For example, when you want to update nodes through an external API, I would advise to use the apoc.periodic.iterate with the batchSize value of 1. As external APIs are unpredictable and potentially costly, we want to store all the information we gathered from the API and iterate through all the nodes regardless if some updates failed in between.

Another feature that apoc.periodic.iteratehas is the option to run update statements in parallel. We need to ensure that we won’t run into any deadlocks when doing parallel updates. Otherwise, the transaction execution will fail. As a rule of thumb, you can’t use parallel execution when creating relationships because the query might try to create several relationships starting or ending from the same node, which will result in a node deadlock and failed execution. However, when you update a single node property once, you can be confident that you won’t run into any deadlocks. For example, if you wanted to store the node degree as a node property you could use parallel execution.

CALL apoc.periodic.iterate(
"MATCH (o:Officer)
RETURN o",
"WITH o, size((o)--()) AS degree
SET o.degree = degree",
{batchSize:10000, parallel: true})

Conclusion

All in all, it seems that the apoc.periodic.iterate is the older and more mature brother from the transaction procedure family. As of now, I am more inclined to use the APOC batching instead of native Cypher batching. The only advantage to the Cypher approach I see is if you want to terminate the operation after a single failed batch, otherwise I would still suggest to use APOC.

p.s. If you are interested in graph analysis, I’ve prepared a sample graph analysis of the Paradise Paper dataset back in the days

--

--

Data explorer. Turn everything into a graph. Author of Graph algorithms for Data Science at Manning publication. http://mng.bz/GGVN