The world’s leading publication for data science, AI, and ML professionals.

Why You Need a Knowledge Graph, And How to Build It

A Guide to Migrating from a Relational Database to a Graph Database

TLDR: A knowledge graph organizes events, people, resources, and documents in a graph database for advanced analysis. This article will explain the purpose of a knowledge graph and show you the basics of how to translate a relational data model into a graph model, load the data into a graph database, and write some sample graph queries.

Why a Knowledge Graph?

Relational databases are great for creating lists, but terrible for managing networks of diverse entities. Have you ever tried to do any of these tasks with a relational database?

  • analyze a healthcare episode of care when a patient interacted with dozens of people, places and procedures
  • find patterns in financial fraud with a web of vendors, customers and transaction types involved
  • optimize the dependencies and interconnected elements of a supply chain

These are all examples of networks of events, people and resources that create huge headaches for SQL analysts using relational databases. Relational databases become exponentially slower as the network size increases, while graph databases have a relatively linear relationship. If you are managing a network, or web, of activities and things, a Graph Database is the right choice. In the future, we should expect to see enterprise data groups adopting a combination of relational databases for isolated analysis on one business function, and knowledge graphs for complex, networked processes that span functions.

A knowledge graph, based in graph database technology, is built to handle a diverse network of processes and entities. In a knowledge graph, you have nodes that represent people, events, places, resources, documents, etc. And you have relationships (edges) that represent links between the nodes. The relationships are physically stored in the database with a name and direction. Not every graph database is a knowledge graph. To be considered a knowledge graph, the design must embed the business semantic model, reflected in clear business names for nodes and relationships, in a diverse set of nodes that span multiple business functions. You are in essence creating a seamless web out of all parts of the business that interact, and using the business semantics to closely tie data to the processes they represent. This can serve as the foundation for future generative LLM model use.

To illustrate a diverse set of data in a Knowledge Graph, let’s look at a simple example for supply chain logistics. The business process might be modeled like this:

Supply chain graph database model. Image by the author.
Supply chain graph database model. Image by the author.

This model could be extended to include any related part of the business processes: customer returns, invoices, raw materials, manufacturing processes, employees, and even customer reviews. There is no pre-defined schema, so the model can expand in any direction or depth.


From Relational Model to Dimensional Model to Graph Model

Now let’s go through the process of translating a typical relational database model into a graph model using the scenario of an ecommerce vendor. Let’s assume that this vendor is running a series of digital marketing campaigns, receiving orders on their website, and shipping product to customers. The relational model could look like this:

Ecommerce relational database model. Image by the author.
Ecommerce relational database model. Image by the author.

If we were to convert this to a dimensional model for use in a data warehouse, the model could look like this:

Ecommerce dimensional model (data warehouse). Image by the author.
Ecommerce dimensional model (data warehouse). Image by the author.

Note that the fact tables are focused on events, and the dimension tables represent all of the attributes of a business entity combined into one table. This event-centered design gives faster query time, but creates other problems. Each event is a distinct fact table, and it is difficult to see the connection from one event to a related event. There is no easy way to understand all the relationships between a dimension entity, like a product, and all of the events that it shares with an entity in another dimension, like a carrier, when those relationships are split between multiple fact tables. The Dimensional Model focuses on one event at a time, but obscures the connections between the different events.

The graph model solves the problem of showing interrelatedness between entities by modeling the process like this:

Ecommerce graph database model. Image by the author.
Ecommerce graph database model. Image by the author.

On first look, this graph model has more similarities to the relational model than to the dimensional model, but it can be used for the same analytical purposes as the data warehouse. Note that each relationship is named and has a direction. And relationships can be created between any nodes – event to event, person to person, document to event, etc. The graph queries also allow you to traverse the graph in ways not possible with SQL.

For example you can gather any nodes related to a key event and study the pattern of occurrence. Hierarchies are preserved and each level can be referenced individually, unlike a denormalized dimension table. Most importantly, graphs are much more flexible in modeling any event or entity in the business without following a strict set of schema constraints. The graph is designed to match the Semantic Model of the business.


Extract, Transform and Load (ETL)

Now let’s look at a sample relational database table, and create some sample scripts to extract, transform and load the data into a graph database. For this article, I’m going to use Cypher language, which is used by Neo4j, the most popular commercial graph database. But the concepts would apply to other variations of graph query languages (GQL). We will use the following sample Product table:

Product table. Image by the author
Product table. Image by the author

Using this query we could pull the new products updated in the last 24 hours:

SELECT product_id,
  product_name,
  cost_usd,
  product_status
FROM Product
WHERE last_updated_date > current_date -1;

We could pull those results into a Python Pandas dataframe named "df", open a graph database connection, and then merge the dataframe into the graph using this script

UNWIND $df as row
MERGE INTO (p:Product {product_id: row.product_id})
SET p.product_name = row.product_name,
  p.cost_usd = row.cost_usd,
  p.product_status= row.product_status,
  p.last_updated_date = datetime();

The first line references a parameter "df", which is the dataframe from Pandas. We will merge into the node type "Product", which is referenced by an alias "P". Then the "product_id" section is used to bind to a unique identifier in the node. After that, the Merge statement looks similar to a merge in SQL.

After we have created each of the nodes using merge statements like the one above, we create relationships. Relationships may be created either in the same script, or in a post-processing script using a merge command like this:

MATCH (p:Product), (o:Order)
WHERE p.product_id = o.order_id
MERGE (o)-[:CONTAINS]->(p);

The Match statement looks like the legacy join usage in Oracle, with two node types declared after the Match and then the join happening in the Where clause.


Queries on the Graph Model

Let’s assume that we have built the graph, and now want to query it. We can use a query like this to see the Ad Groups that have driven orders from Arizona.

MATCH (ag:AdGroup)<-[:BELONGS_TO]-(a:Ad)-[:DRIVES]->(o:Order)<-[:PLACES]-(c:Customer)
WHERE c.state = 'AZ'
RETURN ag.group_name,
  COUNT(o) as order_count

This query would return the ad group name and count of orders, filtered on the state of Arizona. Note that no Group By clause is required in Cypher, unlike SQL. From that query, we would receive the following sample output:

Sample results from a graph query. Image by the author.
Sample results from a graph query. Image by the author.

This example might seem trivial because you could easily create a similar query in a relational database or data warehouse using the order fact table. But let’s consider a more complicated query. Suppose that you want to see the time it takes from the launch of a campaign until the attributable deliveries have been received. In a data warehouse, this query would cross fact tables (not a simple task) and take considerable resources. In a relational database, this query would involve a long series of joins. In a graph database, the query would look like this:

MATCH (cp:Campaign) )<-[:BELONGS_TO]-(ag:AdGroup)<-[:BELONGS_TO]-(a:Ad)
MATCH (a)-[:DRIVES]->(o:Order)<-[:FULFILLS]-(d:Delivery)
RETURN cp.campaign_name,
  cp.start_date as campaign_launch_date,
  MAX(d.receive_date) as last_delivery_date

I used one sample query path, but there are a variety of paths that a user could take to answer different business questions. In the query, note that the path from Campaign to Delivery goes through a relationship between the Order and Delivery. Also note that for readability, I split the path into two parts, starting with the alias for Ad in the second line. The output of the query would look like this:

Sample results from a graph query. Image by the author.
Sample results from a graph query. Image by the author.

Conclusion

We have looked at some sample steps to translate an e-commerce business process from a relational model to a graph model, but we can’t cover all of the design principles in this one article. Hopefully you have seen that graph databases require about the same level of technical skill as relational databases, and that the migration isn’t a huge hurdle.

The biggest challenge is to retrain your brain away from traditional relational modeling techniques and think in terms of semantic or business modeling. If you see a potential application for graph technology, give it a try with a proof-of-concept project. The possibilities for analysis with a knowledge graph reach far beyond what you can do with two dimensional tables!

All images are by the author


Related Articles