Visualization

How I learned to stop worrying and love the graph database

Sometimes, the relational model doesn’t cut it

Ray Robinson
Towards Data Science
15 min readOct 5, 2019

--

In 1970, Edgar F. Codd, a British computer scientist working for IBM, had an idea that still governs the way we work to this day. That idea was the “relational model”, under which data would be stored in tables with specially designated columns used to relate the values in one table to another.

Edgar F. Codd (1923-2003). Father of the relational database.
Edgar F. Codd (1923–2003). Father of the relational database.

Codd decreed 12 rules for his model. And while most databases today don’t follow all of them, the Codd model still pretty much rules how data is modeled. It was all very structured and mathematical, and those who followed it were always using terms like “tuples,” “relational algebra” and “third normal form.”

If you had any formal computer science training back in the 20th century, you probably had Codd’s model drilled into you. If you’re like me, it’s not a pleasant memory. My professor had a picture of Codd on the wall of the classroom, glowering down at us like Big Brother in “1984”.

But the Codd model works. It adds a rigor to data modeling that ensures a clean, row-and-column structure that is perfect for visualization tools like Tableau. So before I go any further, hats off to Dr. Codd.

That said, there’s at least one area where it doesn’t work so well.

When we try to understand the connections between various data points, especially those that might be in the same column, the model breaks down. Who in the column, for example, are the gatekeepers and influencers? The SQL to answer that question can get pretty convoluted.

Enter the graph database, which basically throws Codd’s model out the window.

A graph database, typified by products like Neo4j, is basically a database of connections. It’s the technology that allows social networking sites like Facebook or LinkedIn to map connections between their users. When Amazon recommends a product to you, it’s because the graph shows connections between you and others who purchased the product.

For this article, I examined how the capabilities of graph databases might be applied to smaller scale data analytics projects than those being tackled at Facebook and LinkedIn. I also examined a few tools to visualize the output from graph databases.

I’m actually not too fond of the term “graph database”. To to uninitiated, it sounds like a database designed to produce bar charts or something similar. But that’s the name that’s been settled on, so I’ll go with it. In this article, though, I’ll refer to a visualization created from a graph database as a “network”. I think it’s more descriptive.

Before I go on, I should make two points:

  1. While the examples here will concentrate on Neo4j and a relatively small set of visualization tools, I mean no slight to their competitors. It’s just that Neo4j, Tableau and a few others are the tools I know best.
  2. The data models and examples presented here are very simple. You shouldn’t assume that they are the limit of the complexities that can be handled with graph databases.

Graph database basics

To understand graph databases, you have to put tables, rows, columns and foreign keys out of your heard for a moment and think in terms of four objects:

  1. Nodes, which are roughly the graph equivalent of a row in your database.
  2. Relationships (also known as Edges), which are the connections between nodes.
  3. Labels, which group nodes and edges with similar properties. You might envision a group of nodes grouped together under a label as akin to a table in a relational database.
  4. Properties, which are name/value pairs associated with nodes and edges.

Assume we wanted to create a graph database of people and their preferred tools for data visualization. A very simple implementation might go something like this:

“Ray Robinson” would be one of a number of nodes that are grouped together under the label “Person”. There could be properties for him such as his “city” or “company”.

Ray would be connected by a relationship with the label “TOOLS” to other nodes grouped together under labels for various products he uses, such as Tableau and Neo4j.

Each of the product nodes would have a property indicating what type of tool it is, such as “Graph DB” or “Visualization”.

A simple graph data model
A simple graph data model

And with that, you have a simple data model that provides some powerful query capabilities not easily done in a relational database.

Picture the model above with thousands of “Person” nodes, each attached to nodes for the products and types used by each person. What’s the likelihood, for example, that users will adopt one product based on the preferences of other users with whom they share connections?

It’s important to note here that graph databases do not require the same rigorous controls over data types and structure found in relational databases.

For example, one node might have a different set of properties than another. The same goes for relationships between the nodes. And new properties and labels can be added at any time.

And as for data types, no worries. If you’ve been populating some property with integer values and suddenly decide to start using it for text strings, your graph database won’t stop you.

For those of us who began our careers with relational databases like Oracle, SQL Server of Postgres, it’s a difficult concept to grasp. And it obviously carries some risks. You can make a mess of your data model pretty quickly.

But for an analyst trying to determine causes, effects and relationships, the graph model is often superior.

A word or two about what category graph databases fall into:

Some argue that graph databases are simply another type of “NoSQL” technology, along with document and key-value databases. There’s a certain logic to that because graph databases do not use SQL, although they have their own query languages.

But many graph evangelists contend that the unique capabilities of graph databases place them in a category all their own. I won’t try to settle that argument here.

With that (very) brief graph explainer out of the way, let’s move to some examples.

A graph database without the database: Gephi

The easiest way to get a taste of visualizing graph data is with a tool that’s not strictly speaking a database at all.

Gephi (https://gephi.org/) is free and open source software developed originally by students at the University of Technology of Compiegne in France and now supported and managed by the non-profit Gephi Consortium.

It is a Java-driven desktop application that runs on Windows, Mac OS and Linux, as long as you have Java 7 or 8, plenty of RAM and a dedicated graphics card.

Gephi is at once the most fun and most frustrating tool included in this article. The fun part is that it allows you to start working with graph data without setting up, configuring and learning the ins and outs of a graph database. We’ll get to the frustrating part later.

The Gephi user interface
The Gephi user interface

To demonstrate Gephi’s capabilities, I worked with a set of data captured from Twitter during the three days following the suicide of Jeffrey Epstein, a convicted abuser of young girls who had connections to a variety of prominent individuals, including Donald Trump and Bill Clinton.

The Epstein suicide was a story guaranteed to gin up conspiracy theorists, and it didn’t disappoint. Almost immediately, Twitter lit up with speculation on a possible Clinton connection to Epstein’s death, circulating under the hashtag #ClintonBodyCount.

It wasn’t long until the #ClintonBodyCount story broke into mainstream media, which often uses Twitter as an early warning system for stories that are commanding lots of public attention.

In fact, there were only 48,571 #ClintonBodyCount tweets over those three days, which isn’t much by Twitter standards. That number alone raises the question, despite the media attention, of how much of a story #ClintonBodyCount really was.

But more interesting was what was uncovered when the data was visualized as a network with Gephi.

For all the attention the #ClintonBodyCount story received, the Gephi visualization below shows that it was pushed by a tiny group of influencers whose Twitter profiles (“Comedian”…”#QAnon researcher”…”Host of Red, White and F You”) suggest they are hardly in the mainstream of American political discussion.

Gephi visualization of the #ClintonBodyCount network.
Gephi visualization of the #ClintonBodyCount network.

The most important step in a Gephi analysis is wrangling your data into two spreadsheets or .csv files that Gephi can import and read.

The first, known as the “nodes” file, is basically the raw data you wish to visualize.

The second, known as the “edges” file, defines the relationships between your nodes. It’s the edges file, shown below, that requires the data wrangling. I created it using Tableau Desktop and Tableau Prep Builder.

But it really depends on what tool you’re comfortable with. Excel would probably work fine if you’re familiar with functions like VLOOKUP.

You could also run a few queries and output the file from any relational database, or put a list together with R or Python.

Gephi "edges" file for the #ClintonBodyCounty analysis
Gephi “edges” file for the #ClintonBodyCounty analysis

The edges file has two required columns: “Source”, which is the node initiating some sort of action; and “Target”, which is the node receiving the action initiated by the Source column.

For the #ClintonBodyCount analysis, the logic worked like this:

1) Any user sending a tweet was listed in the Source column.

2) Any user who was mentioned (“@mention”) or retweeted (“retweet”) by the user in the Source column was listed in the Target column.

The more a user in the Target column was mentioned or retweeted, the more his or her influence in the #ClintonBodyCount network.

A third column, “Weight”, is optional in the Gephi edges file. It signifies the strength of the connection between a source and target node. In this case, I calculated it by adding the values in the “@mention” and “retweet” columns.

When the nodes and edges files are loaded into Gephi, the fun begins. You’ll first be presented with a big, incomprehensible block showing all the connections. But by using the various layout algorithms and statistical functions that are included in Gephi, you’ll be able to massage your visualization into something like the #ClintonBodyCount graphic shown above.

The statistic I used to visualize the #ClintonBodyCount network is known as modularity, which measures the strength of communities within networks.

You can also export your work from Gephi and enhance the look with the graphics tool of your choice. I used the Affinity Designer, a graphics program available in the Mac OS App Store.

So that’s the fun of Gephi. Now, a few words about the frustrating part:

  1. Gephi crashes. A lot. And since it doesn’t auto-save your work, crashes can cost you a lot of work. I found it useful to save often. And since there is no Cmd-Z or like function to roll back your changes, I also found it useful to create new versions when I saved.
  2. It is a memory and CPU hog. I got my best results by shutting down everything I could (browsers, email client, etc.) and letting Gephi grab all the system resources it wanted.
  3. Many of its best features are the result of plug-ins created by independent developers. When a new version of Gephi comes out, it may take a while for your favorite plug-ins to catch up.

All of which sounds pretty primitive. Still, for a tool that won’t cost you a dime and doesn’t require a backend database, it has impressive features that you won’t find in traditional visualization tools like Tableau.

Gephi resources

To learn more about it, there are numerous tutorials online. Just make sure that the one you’re viewing is consistent with your version of Gephi. The latest version is 0.9.2.

One of the best video tutorials on Gephi, created by the University of Kentucky Libraries, is available at: https://www.youtube.com/watch?v=2FqM4gKeNO4

Additionally, there’s a great tutorial by Paul Oldham on using Gephi to visualize networks of patents at: https://www.pauloldham.net/gephi_patent_network/

And now for an actual graph database: Neo4j

Graph databases are still viewed with a fair amount of skepticism by the priesthood of database administrators who maintain petabytes of corporate and government data. It’s probably fair to say that relational databases will rule the world for the foreseeable future.

In the graph world, the company that seems to be getting most of the buzz is Neo4j (https://neo4j.com/ ), which released its first open source graph database 12 years ago in Sweden. Today, it’s headquartered in San Mateo, Calif. and counts companies including UBS, eBay and Walmart among its customers. Its government customer list includes the U.S. Army and NASA.

The Neo4j desktop (left) and data browser (right).
The Neo4j desktop (left) and data browser (right).

To investigate Neo4j’s capabilities, I chose the Foreign Agents Registration Act (FARA) dataset maintained by the U.S. Department of Justice. The FARA law essentially requires “agents” (i.e. lobbyists) working on behalf of foreign governments in the U.S. to register with the federal government.

Those who fail to register (Paul Manafort, come on down!) can be prosecuted and sent to prison. But prosecutions are rare. So the first thing to know is that the dataset might not be complete.

To make the data more interesting, I joined it with the Corruption Perceptions Index maintained by Transparency International. The index, explained fully at https://www.transparency.org/cpi2018, ranks countries by perceived level of public corruption according to government and and business experts.

The combined data results in a dataset that shows which agents in the U.S. represent the counties perceived as most and least corrupt. Using the traditional row and column data structure, you can do some pretty interesting things with the data.

Here, for example, is a visualization from Tableau, showing the countries color-coded by perceived level of corruption. On clicking a country, the user can see who represents those countries in the U.S. I’ve highlighted Russia.

Foreign agents visualization with Tableau.
Foreign agents visualization with Tableau.

All well and good if that’s all you want. But if you want to investigate the networks of connections among those countries and agents, your best bet is going to be a graph database like Neo4j. And that means migrating the data from its row-and-column format into the nodes and edges required for a graph database.

Modeling data for a graph is a pretty deep subject. And if you’re so inclined, you can read all you want and more about it in the Neo4j modeling tutorial at https://neo4j.com/developer/data-modeling/.

But I would start by establishing a set of categories for your data and then considering which category each data point falls into. For the FARA data, it’s pretty apparent that there are three categories:

  1. Country, which includes all data about a particular for country.
  2. Foreign principals, which includes the entities such as businesses or government agencies in particular foreign country.
  3. Agents, who are the lobbyists registered to represent the foreign principals in the U.S.

You can load the nodes into Neo4j from .csv files using a simple “LOAD CSV FROM file…” script, which is detailed in the Neo4j documentation at https://neo4j.com/developer/guide-import-csv/. It’s very similar to loading data into a table in a relational database.

And with that, you have three types of nodes, each grouped together under its own label. It then becomes apparent that your data model needs two different relationships: one that connects each country to the foreign principals based there and another that connects the foreign principals to their registered agents.

You can create those relationships, shown below, by executing the commands shown with each relationship in the diagram.

Neo4j data model for the Foreign Agents Registration (FARA) database, including Cypher code to build relationships.
Neo4j data model for the Foreign Agents Registration (FARA) database, including Cypher code to build relationships.

Neo4j has a full featured query and command language known as “Cypher”, which is essentially the graph database version of SQL. I found it relatively intuitive and easy to learn. Details on Cypher are available at https://neo4j.com/developer/cypher-basics-i/.

The key benefit of Cypher is that it allows you to do with just a few lines of code what might take many lines of nested SQL.

For example, let’s say you want to create a network style view of the connections between agents representing Russian businesses and government agencies. The Tableau visualization above shows you what you can do with row-and-column data.

But with your data in a graph model, a few lines of Cypher can create a whole network-centric view, as shown below.

Graph of agents representing Russian interests in the U.S., with the query that produced it.
Graph of agents representing Russian interests in the U.S., with the query that produced it.

While Tableau can integrate with Neo4j through an API, it doesn’t give you this view of the data. The data is returned to Tableau in a row-and-column format. So unless you’re willing to go through a number of workarounds, you still won’t get a true network view of the data.

The Neo4j data browser is nice, especially for a tool that comes with the free, community version of Neo4j. But it’s relatively bare bones. What you see above is about as far as you’ll be able to take it.

Licensing

The enterprise versions of Neo4j come with a more full featured visualization tool known as Bloom. Enterprise Neo4j can also accommodate other network visualization tools such as Linkurious (https://linkurio.us/) and Keylines (https://cambridge-intelligence.com/keylines/), both of which are commercially licensed.

Which brings us to the issue of pricing. If your organization is interested in the enterprise version of Neo4j and meets the company’s definition of a startup (https://neo4j.com/startup-program/?ref=subscription), you can set up a limited deployment of Neo4j for free.

If you’re not a startup and still need the enterprise edition, pricing gets a little murkier. The company does not publish its pricing model asks that prospective customers contact it for a quote. “The configuration options are complex and the pricing is on a case-by-case scenario,” a representative told me.

The old “case-by-case” pricing model is never what you want to hear from an enterprise software company. But Neo4j has a great product and can apparently get away with it. If you pursue an enterprise license with them and get an answer, I’d love to hear it.

GraphXR

One interesting new visualization tool that can easily integrate with the free community edition of Neo4j is GraphXR https://www.kineviz.com/. Once installed, GraphXR simply appears as another available data browser on the Neo4j desktop.

You can enter the same sorts of Cypher queries in the GraphXR browser as you would in Neo4j. But the results are more dynamic and, frankly, almost psychedelic. For example, the following Cypher query is written to return a graph of the agents representing Chinese interests in the U.S.:

MATCH (c:Country)-:BASEOF->(f:ForeignPrincipal)-:REPRESENTEDBY->(a:Agent)

WHERE c.Country = “CHINA”

In Neo4j, the result would be similar to the Russia network we’ve already seen. But in GraphXR, it looks like this:

GraphXR visualization of agents representing Chinese interest in the U.S.
GraphXR visualization of agents representing Chinese interest in the U.S.

GraphXR provides a few controls for adjusting the look of your visualization. For example, you can adjust the size and color of your nodes. You can also zoom in and out of your visualization, move the nodes around and even spin it around to view it from a different angle.

The background color, however, is limited to one option: black. I also found the labels of the nodes to be a little blurry and faint. At this time, there appears to be no way to change that.

Interestingly, GraphXR also supports virtual reality visualization using an Oculus Rift headset. I’m not sure I fully understand the business case for that, but I hope to see it sometime. Apparently, the brave new world of VR data visualization is upon us.

Licensing

The visualization above was created with the “Explorer” edition of GraphXR, which is free and operates as a plug-on to the Neo4j desktop. It has a limit of three projects.

The next step up is the “Analyst” edition, which has a 30 projects limit and costs $1,320 per year.

Finally, there’s the “Enterprise” edition, which allows unlimited

projects. Pricing for that is available through GraphXR.

In closing

I think it’s safe to say that we have a gap in capabilities here. On the one hand, we have traditional visualization tools like Tableau, which have great analytic capabilities built in, along with granular controls over almost every aspect of a visualization.

But their capabilities are bounded to a large extent by the row-and-column data model that we all know. For analysts trying to understand the relationships between individuals and organizations, that’s less than ideal.

The graph databases, like Neo4j solve that problem, but tend to fall short in visualizing networks. Control over the “look and feel” of visualizations tends to be pretty rudimentary.

Commercial network visualization tools like Linkurious and Keylines solve that problem, but only at the cost of licensing an additional product just for graph databases like Neo4j. For small businesses and freelance analysts, that could be prohibitive.

An example of a Neo4j/Linkurious implementation by the International

Consortium of Investigative Journalists (ICIJ) can be seen at https://offshoreleaks.icij.org/stories/wilbur-louis-ross-jr.

The ideal solution would be for tools like Tableau and its competitors to add graph database visualizations to their existing capabilities. Tableau currently has a Web Data Connector that allows it to submit a query written in Cypher to Neo4j.

You can read more about it at https://neo4j.com/blog/neo4j-tableau-integration/. But the data is returned to Tableau in tabular format, so we still don’t have the ability to produce a network visualization.

I suspect that the reason we haven’t seen this capability from tools like Tableau is that a) the demand for it isn’t there yet; and b) developing it would entail a substantial level of effort.

But as the old song says, “I can dream, can’t I?”

--

--