Working with Neo4j

RFM Analysis using E-Commerce Data

Published in

Towards Data Science

6 min readNov 11, 2020

Relational databases are logical way to manage data, but on the other hand, alternative approaches such as graph database can be more useful in many cases. It’s known that huge companies in various industries such as eBay, Airbnb, Cisco and many others use the graph database [1]. At this point, Neo4j shows itself as a graph database platform for managing the data.

In this article, I’ll try to explain how to create an example graph from the e-commerce data, using Neo4j and also touch on RFM Analysis.

The Data

First of all, you can find the e-commerce data I mentioned before here. At first glance, it’s clearly seen that the data consists of transactions. Therefore, the data includes a series of columns such as customer, purchased products, quantity and date of transaction.

It would be a step in the right direction to plan the schema before inserting data to Neo4j. The schema aimed to be builded in present study is as follows,

We can start with the customers now. Creating a constraint before creating nodes both prevents duplication and performs better because it uses MERGE locks [2]. You can create the constraint as follows,

CREATE CONSTRAINT ON (customer:Customer) ASSERT customer.customerID IS UNIQUE

Please notice that, having uniqueness for a property value is only useful in the graph if the property exists. Then you can create customer nodes as follows,

:auto 
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS
FROM 'https://git.io/JkIjD'
AS line
WITH 
  toInteger(line.CustomerID) AS CustomerID, 
  line WHERE NOT line.CustomerID IS null
MERGE(customer:Customer {customerID: CustomerID})
ON CREATE SET customer.country = line.Country

After creating customer nodes, it’ll be even easier to create product and transaction nodes. Likewise, firstly it would be correct to create constraint for product nodes.

CREATE CONSTRAINT ON (product:Product) ASSERT product.stockCode IS UNIQUE

There is an important point in here, when you create a constraint, Neo4j will create an index. Cypher will use that index for lookups just like other indexes. Therefore, there’s no need to create a separate index. In fact, if you try to create a constraint when there’s already an index, you’ll get an error.

After taking into account all of these, you can create product nodes as follows,

:auto 
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS
FROM 'https://git.io/JkIjD'
AS line
MERGE(product:Product {stockCode: line.StockCode})
ON CREATE SET product.description = line.Description

As you can see above, ON CREATE statement is used when creating nodes. If the node needs to be created, merge a node and set the properties. Similarly, you can also use the statement ON MATCH if the node already exists [3].

It’ll be nice to create transaction nodes just before start dealing with relationships as follows,

CREATE CONSTRAINT ON (transaction:Transaction) ASSERT transaction.transactionID IS UNIQUE;:auto 
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS
FROM 'https://git.io/JkIjD'
AS line
MERGE(transaction:Transaction {transactionID: line.InvoiceNo})
ON CREATE SET transaction.transactionDate = line.InvoiceDate

Looking at the Cypher statements above, you can see that semicolon is used to separate Cypher statements. In general, you don’t need to end a Cypher statement with a semi-colon, but if you want to execute multiple Cypher statements, you must separate them [2].

The nodes in the graph are ready, but these nodes have no connection with each other. The connections capture the semantic relationships and context of the nodes in the graph [2]. As it’s known, 3 types of nodes are available in the graph: Customer, transaction and product. As I mentioned at the beginning of this section, having relationships between customer-transaction and transaction-product will make this graph much more logical. The customer MADE a transaction and the transaction CONTAINS products. Here is the Cypher statement to building MADE relationships,

:auto 
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS
FROM 'https://git.io/JkIjD'
AS line
WITH toInteger(line.CustomerID) AS CustomerID, 
  line.InvoiceNo AS InvoiceNo
MATCH (customer:Customer {customerID: CustomerID})
MATCH (transaction:Transaction {transactionID: InvoiceNo})
MERGE (customer)-[:MADE]->(transaction)

Let’s finalize the graph by creating CONTAINS relationships,

:auto USING PERIODIC COMMIT
LOAD CSV WITH HEADERS
FROM 'https://git.io/JkIjD'
AS line
WITH toInteger(line.Quantity) AS Quantity, 
    toFloat(line.UnitPrice) AS UnitPrice,
    line.InvoiceNo AS InvoiceNo, 
    line.StockCode AS StockCode
MATCH (transaction:Transaction {transactionID: InvoiceNo})
MATCH (product:Product {stockCode: StockCode})
MERGE (transaction)-[r:CONTAINS]->(product) 
    ON CREATE SET r.quantity = Quantity, 
        r.price = UnitPrice * Quantity 
    ON MATCH SET r.quantity = r.quantity + Quantity, 
        r.price = r.price + UnitPrice * Quantity

You can now check the schema of the graph with the statement CALL db.schema.visualization(). The result is like that,

Keep in mind that you can create the graph in a different way. For instance, the transaction could have been a relationship, not a node, and we could call it BOUGHT. As you can imagine, which one you choose depends on your business problem. At this point, you should set the rules and build the structure accordingly.

RFM Analysis

RFM analysis is a behavior-based approach grouping customers into segments. It groups the customers on the basis of their previous purchase transactions. [4]. Here are the three dimensions of RFM,

Recency: How recently did the customer purchase?
Frequency: How often do they purchase?
Monetary Value: How much do they spend?

Segmenting customers using RFM analysis is an important point for companies that sell in many industries. Because companies want to know the customers that are valuable to them and to ensure loyalty for all their customers.

After mentioning the dimensions of the RFM and the significance of the customer segmentation, we can get the recency, frequency and monetary value with following python code,

Then, it would be a correct step to define the segments by creating percentiles for dimensions. Please keep in mind that the segmentation here can be taken to a much more advanced level and is often not that simple. Real world problems can be more complex.

When you run the above python code, you’ll see a result like the following,

As you can see in the output, there are descriptive statistics to segments. For instance, looking at the statistics of the best customers, it’s seen that they have recently purchased, frequently purchased, and the monetary value was quite high. Therefore, it’s important that customers in this segment are kept by the company.

There will be different approaches to other segments. This is natural considering the purpose of segmentation. It’s a big impact to develop different approaches for segments and to improve customer loyalty. For instance, you can see the descriptions and required actions to the four segments as follows,

Conclusion

As I mentioned at the beginning of the article, different approaches may be needed to find solutions to problems in business life. At this point, it’ll be necessary to identify the problem quite well and build the solution step by step. Although it is not a viable solution to all problems, trying different approaches to tackle an issue would be both beneficial for your company and your career development.

I hope it’ll be useful for you to get a head start on Neo4j and customer segmentation with this article.

Thank you for reading! If you’re curious about more, please visit my repository!

egemenzeytinci/neo4j-example

In the present study, an example graph was created from e-commerce data using Neo4j, then customer segmentation was…

github.com

References

[1] Helen Vdovychenko, Neo4j Database: Use Cases and Advantages For Your Project (2020)

[2] Getting More Out of Neo4j

[3] Use ON CREATE and ON MATCH

[4] Avinash Navlani, Introduction to Customer Segmentation in Python (2018)