Graph Databases. What’s the Big Deal?

Continuing the analysis on semantics and data science, it’s time to talk about graph databases and what they have to offer us.

Favio Vázquez
Towards Data Science

--

Introduction

Should we invest our precious time in learning a new way on ingesting, storing and analyzing data? With the touch on mathematics on graphs?

For me the answer was unsure when I started my investigation, but after a little while, my answer was:

I’m using a surprising amount of Kenan Thompson’s gifs. Wow.

Here in this article, I’ll discuss some ideas and concepts of graph databases, what they are, what are their advantages and how they can help us in our daily tasks.

Btw, I’m really tired of writing tons of JOINs and loooong queries to calculate the numbers of customers (and their average salary) who bought item X between January 2017 and October 2018 in the state Y and that has been a customer for longer than Z months. So everything that helps me, and I think a lot of people, in reducing this time, and making it easier and more intuitive, I’m in.

What is a graph?

This is not the type of graph I’ll be talking about :)

There’s a problem in English when we talk about graphs (in Spanish we don’t have that problem). If you search for graph images online, this is what you will probably see:

google.com

But that’s not the kind of graph I want to talk about. When I talk about a graph here, this is what you should be picturing in your head:

I’m going to give two definitions of a graph. First the mathematical one, and then a more simplistic one.

According to Behzad and Chartrand:

A graph G is a finite, non-empty set V together with a (possibly empty) set E (disjoint from V) of two-element subsets of (distinct) elements of V. Each element of V is referred to as a vertex and V itself as the vertex set of G; the members of the edge set E are called edges. By an element of a graph we shall mean a vertex or an edge.

One of the most appealing features of graph theory lies in the geometric or pictorial aspect of the subject. Given a graph, it is often useful to express it diagrammatically, where by each element of the set is represented by a point in the plane and each edge by a line segment.

It is convenient to refer to such a diagram of G as G itself, since the sets V and E are easily discernible. In the figure bellow, a graph G is shown with vertex set V = {V1, V2, V3, V4} and edge set E = {V1V2, V1V3, V2V4, V3V4}

Copyright Favio Vázquez (you can use it of course)

As you can see the set V contains the number of vertex or points in the graph and E the relationships between them (read V1V2 like V1 is connected to V2).

So in simple words, a graph is a mathematical representation of objects (or entities or nodes) and their relationships (or edges). Each one of those points can represent different things depending on what you want. By the way, here nodes and vertexes mean the same, we’ll use them interchangeably.

We will review an example on how to use them when we go to graph databases.

What is a database?

https://www.bmc.com/blogs/dbms-database-management-systems/

From techopedia:

A database (DB), in the most general sense, is an organized collection of data. More specifically, a database is an electronic system that allows data to be easily accessed, manipulated and updated.

In other words, a database is used by an organization as a method of storing, managing and retrieving information. Modern databases are managed using a database management system (DBMS).

Do you want to know the truth? From my experience most databases are:

  • Not organized
  • Not easily accessed
  • Not easily manipulated
  • Not easily updated

When we talk about doing data science. In older years (like 20 lol) it was easier to maintain a database because the data was simple, smaller and slower.

Nowadays we can save almost whatever we want in a “database”, and that definition I think is stuck with another concept, the relational database.

In a relational database we have a set of “formally” described tables from which data can be accessed or reassembled in many different ways without having to reorganize the database tables. Basically we have schemas in where we can store different tables, and inside of those tables we have a set of columns and rows, and inside of an specific position (row and column) we have an observation.

We also have a relationship between those tables. But they’re not the most important thing, the data they contain is the most important thing. Normally they are pictured like this:

Image result for database
https://towardsdatascience.com/what-if-i-told-you-database-indexes-could-be-learned-6cf8f59bff94

What is a graph database?

https://www.cbronline.com/enterprise-it/software/graph-technology-data-standby-every-fortune-500-company/

Based upon the concept of a mathematical graph, a graph database contains a collection of nodes and edges. A node represents an object, and an edge represents the connection or relationship between two objects. Each node in a graph database is identified by a unique identifier that expresses key value pairs. Additionally, each edge is defined by a unique identifier that details a starting or ending node, along with a set of properties.

I’ll use an example from people at Cambidge Semantics to illustrate how a a graph database works.

Imagine we have some data that’s stored in a local restaurant chain. Normally in a relational database you’d store customer information in one database table, the items you offer in another and the sales that you’ve made in a third table.

This is fine when I want to understand what I sold, order inventory and who my best customer is. But what’s missing is the connective tissue, the connection between the items, along with functions in the database that can let me make the most of it.

A graph database stores the same sort of data, but is also able to store linkages between the things. John buys a lot of Pepsi, Jack is married to Valerie and buys different drinks. I don’t have to run JOINs to understand how I should market to each individual customer. I can see the relationships in the data without having to make a hypothesis and test it.

The people from neo4j mention that:

Accessing nodes and relationships in a native graph database is an efficient, constant-time operation and allows you to quickly traverse millions of connections per second per core.

Whereas relational databases store highly-structured data in tables with pre-determined columns and rows, graph databases can map multiple types of relational and complex data. Thus, graph databases are not rigid in their organization and structure, as relational databases are. All relationships are natively stored within the vertices of the edges, meaning that the vertices and edges can each have properties associated with them. This structure allows for a database that can depict complex relationships between unrelated data sets.

Uses of graph databases

Did you know that 2018 was touted as “The Year of the Graph”, as more and more organizations both large and small have recently begun to invest in graph database technology. So we aren’t on a crazy path here.

I’m not saying that everything we know from relational databases, and SQL will not work anymore. I’m saying that there are some cases (surprisingly a lot of them) where you are better using a graph database than a relational database.

I’m going to give you right now an idea on when you should be using a graph database instead of something else:

  • You have highly related data.
  • You need a flexible schema.
  • You want to have a structure and build queries that are more similar to way people think.

Instead if you have a highly structured data, you want to do a lot of grouping calculations and you don’t have that many relationships between your tables, then you may be better with a relational database.

A graph database has another, not obvious advantage. It allows you to build a knowledge-graph. Because they are graphs, knowledge-graphs are more intuitive. People don’t think in tables, but they do immediately understand graphs. When you draw the structure of a knowledge graph on a whiteboard, it is obvious what it means to most people.

And then you can start thinking on building a data fabric, which then can allow you to re-think the way you do machine learning and data science as a whole. But that’s material for a next article.

Implementing a graph database in your company

Like traditional RDBMS, graph databases can be either transactional or analytical.

Choose your focus when you choose your graph database. For example, the popular Neo4J, Neptune or JanusGraph are focused on transactional (OLTP) graph databases.

While something like AnzoGraph is an analytical (OLAP) graph database. However, be careful, you may need a different engine for running quick queries that touch upon single entities (e.g. What soda does Sean buy?) and analytical queries that poll the whole database. (e.g. What is the average price for a soda paid by people like Sean?). Graph OLAP databases are becoming very important as Machine Learning and AI grows since a number of Machine Learning algorithms are inherently graph algorithms and are more efficient to run on a graph OLAP database vs. running them on a RDBMS.

Here you can find great resources for different types of graph databases and computing tools:

The use cases for graph OLAP databases are vast. For example one can find key opinion leaders and book recommenders using PageRank algorithm. Also, conducting churn analysis to improve customer retention or even doing machine learning analysis to identify the top five factors that are driving books sales.

If you want to see more on why and how to implement a graph OLAP take a look at this article:

What’s next?

The following charts (taken from https://db-engines.com/) show the historical trend of the categories’ popularity. In the ranking of each month the best three systems per category are chosen and the average of their ranking scores is calculated. In order to allow comparisons, the initial value is normalized to 100.

Graph databases are getting a lot of attention
Together with Time Series databases, graph databases are on the top.

As the data sources continue to rapidly expand (with unstructured data growing the fastest of all), finding machine-based insights is becoming more and more important.

Graph databases provide an excellent infrastructure to link diverse data. With easy expression of entities and relationships between data, graph databases make it easier for programmers, users and machines to understand the data and find insights. This deeper level of understanding is vital for successful machine learning initiatives, where context-based machine learning is becoming important for feature engineering, machine-based reasoning and inferencing.

In the future I’ll discuss about how graph databases can help us do machine learning and data science in general.

Related articles:

--

--

Data scientist, physicist and computer engineer. Love sharing ideas, thoughts and contributing to Open Source in Machine Learning and Deep Learning ;).