Large Graph Visualization Tools and Approaches

Sviatoslav Kovalev
Towards Data Science
11 min readNov 15, 2019

--

What to do, if you need to visualize a large network graph but all tools you try can only draw a hairball or eat all your RAM and hang your machine? I used to work with large graphs (hundreds of millions of nodes and vertices) for more than two years and have tried a lot of tools and approaches. But I still did not found any good survey, so now I writing such a survey by myself.

(This is translation of my article on habr originally written in Russian)

Why visualize a graph at all?

  1. To find what to look for
    Usually, we simply have a set of vertices and edges as input. We can compute some statistics or graph metrics based on such data, but it is not enough to get an idea of structure. A good visualization can clearly show if there are some clusters or bridges in a graph, or maybe it is a uniform cloud, or something else.
  2. To impress the public
    It’s obvious that data visualizations are used for presentation. It is a good way to show conclusions from the work that was done. For example, if you solved a clustering problem, you can color your plot by labels and show how they are connected.
  3. To get features
    Despite most of the graph visualization tools were created only for making some pictures, they are also good as dimension reduction tools. A graph that represented as an adjacency matrix is data in high dimensional space. When drawing it we get two (usually) coordinates for each vertex. These coordinates can also be used as features. Closeness between vertices in this space means similarity.

What a problem with the large graphs?

I will mean by the “large graph” sizes approximately from 10K vertices and/or edges. There are usually no problems with smaller sizes. All the tools you can find by a few minutes’ search most likely will work at least acceptable. What’s wrong with large networks? There are two main problems: readability and speed. Commonly, the visualization of a large graph looks messy because there are too many objects in one plot. Also, graph visualization algorithms mostly have awful algorithmic complexity: quadratic or cubic dependency from the number of edges or vertices. Even if you wait for the result once, it will be too long to find better parameters.

What is already written about this problem?

[1] Helen Gibson, Joe Faith and Paul Vickers: “A survey of two-dimensional graph layout techniques for information visualization”

The authors of this paper tell which graph visualization methods exist and how do they work. There is also a good table with info about algorithms, their features, and complexity. I used a few pictures from that paper in this article.

[2] Oh-Hyun Kwon, Tarik Crnovrsanin and Kwan-Liu Ma “What Would a Graph Look Like in This Layout? A Machine Learning Approach to Large Graph Visualization”

The authors have big work done. They tried all the algorithms they could. Then they drew it and manually evaluated similarity. After that, they fit the model to predict what would a graph look like in this layout. I also used a couple of pictures from this work too.

Theoretical Part

Layout is a way to map a coordinate to each vertex. Usually, this is coordinates on a 2D plane.

What is a good layout?

It is easy to say if something looks good or bad. It is not so easy to name criteria, how could machine evaluate it. In order to make a “good” layout so-called, aesthetic metrics can be used. Here is some of them:

  1. Minimum edges intersection
    It is obvious: too many intersections make the plot look messy.
  2. Adjacent vertices are closer to each other than not adjacent
    This is logical that connected nodes should be close to each other. It represents the main information that is present in a graph by definition.
  3. Communities are grouped into clusters
    If there are a set of vertices that are connected to each other more frequent then to other parts of the graph, they should look like a dense cloud.
  4. Minimum overlapping edges and nodes.
    It is obvious too: if we cannot determine if there few vertices or one, then the readability of plot is poor.

Which kinds of layouts are exist?

I consider it important to mention these three types of layouts. However, it is possible to classify them in many other ways. But this classification is enough to navigate through all possible types.

  • Force-directed and Energy-Based
  • Dimension Reduction
  • Node Features Based

Force-Directed and Energy-Based

Force-directed layouts examples. Source of the picture is [1]

This family of methods are based on physical system simulation. Vertices are represented as charged particles, that repulse each other, and edges are treated as elastic strings. These methods try to model the dynamics of this system or find a minimum of energy.

Such methods typically give a very good result. Resulting plots reflect the topology of graph very well. But they are also computationally hard and have a lot of parameters to tune.

Important members of this family are Force Atlas, Fruchterman-Reingold, Kamada Kawaii and OpenOrd. The last one uses tricky optimizations to speed up computation, for example, it cuts long edges. As a useful side effect graph gets more clustered.

Dimension Reduction

Dimension reduction layouts examples. Source of the picture is [1]

A graph can be defined as adjacency matrix NxN, where N is the number of nodes. This matrix can also be treated as a table of N objects in N-dimensional space. This representation allows us to use general-purpose dimension-reduction methods such as PCA, UMAP, tSNE, etc. Another way is to compute theoretical distances between nodes and then try to save proportion when moving to lower-dimensional space.

Feature-Based Layout

Hive Plot example. Source of the picture is [1]

Usually, graph data are related to some objects in the real world. So vertices and edges can have own features. Therefore we can use these features to representing them on the plane. We can deal with node features as with usual tabular data using mentioned above dimension reduction methods or by directly drawing a scatter plot for pairs of features. It worth to mention Hive Plot, because it is very different from all other methods. In hive plot nodes are aligned to several radial axes, and edges are curves between them.

Tools for Large Graph Visualization

The joy of graph visualization.

Despite graph visualization problem is relatively old and popular, there is a very bad situation with tools that can handle large graphs. Most of them are abandoned by developers. Almost each of them has their big disadvantages. I’ll tell only about those who are worth to mention and can handle big graphs. Concerning small graphs, there is no problem. You can easily find a tool for your purpose and most probably it would work well.

GraphViz

Bitcoin transaction graph before 2011
Sometimes it’s hard to tune the parameters

This is an old-school CLI tool with its own graph definition language named “dot”. This is a package with several layouts. For large graphs, it has sfdp layout, from force-directed family. The pros and cons of this tool are in the same thing: it runs from the command line. It’s useful for automation, but it’s hard to tweak parameters without interactivity. You don’t even know how long you need to wait for the result, and also don’t know if you need to stop it and rerun with other parameters.

Gephi

Image from gephi.org
137K movies recommendation graph from iMDB
A few million is already too much for Gephi

The most powerful graph visualization tool that I know. It has GUI, it contains several layouts and a lot of graph analysis tools. There is also a lot of plugins written by the community. For example my favorite layout “Multigravity Force-Atlas 2” or sigma.js export tool, which creates an interactive web-page template based on your project. In Gephi users can color nodes and edges by its features.

But Gephi is abandoned by developers. It also has a little-bit old-fashioned GUI and a lack of some simple features.

igraph

Last.fm music recommendation graph. Source, description and interactive version are here

I need to pay a tribute to this general-purpose graph analysis package. One of the most impressive graph visualizations was made by one of the igraph authors.

Disadvantages of igraph is awful docs for python API, but sources are readable and well commented.

LargeViz

Several tens of million vertices (transactions and addresses) in one of the largest bitcoin clusters

It is a great savior when you need to draw a really huge graph. LargeViz is a dimension reduction tool and can be used not only for graphs but for arbitrary tabular data. It runs from the command line, works fast and consumes a little RAM.

Graphistry

Addresses that could be hacked in one week and their transactions
Intuitive and pretty looking GUI, but very limited

It is the only paid tool in this survey. Graphistry is a service, that takes your data and does all the calculations on its side. A client only looks at the pretty picture in his browser. Other features not much better then in Gephi, except Graphistry has reasonable default parameters, a good color scheme, and slightly better interactivity. It provides only one force-directed layout. It also has a limit of 800K nodes or edges.

Graph Embeddings

There is an approach for crazy sizes too. Starting from approximately one million vertices there is only reasonable to look at vertices density and not to draw edges and particular vertices at all. Simply because no one is able to make out individual objects on such a plot. Moreover, most of the algorithms that were made for graph visualization will work a lot of hours, or maybe days on such sizes. This problem can be solved if we would slightly change the approach. There are a lot of approaches to get a fixed-sized representation that reflects graph vertices features. After getting such representation the only thing you need is to reduce dimensionality to two in order to get a picture.

Node2Vec

Node2Vec + UMAP

This is the adaptation of word2vec for graphs. It uses a random walk in a graph instead of sequences of words. Thus this method only uses information about node neighborhood. In most of cases it’s enough.

VERSE

VERSE + UMAP

Advanced algorithm for versatile graph representation. One of the best in my experience.

Graph Convolutions

Graph Convolutions + Autoencoder. Bipartite graph.

There are a lot of ways how the convolution on the graph can be defined. But in fact, it is a simple “spreading” of features by vertices’ neighbors. We also can put the information on local topology in vertices features.

Little Bonus

I have made a little tutorial on simplified graph convolutions without neural networks. It is located here. I have also made a graph embeddings tutorial where I have shown how to use a few of mentioned above tools.

Links

Simplifying Graph Convolutional Networks
arxiv.org/pdf/1902.07153.pdf

GraphViz
graphviz.org

Gephi
gephi.org

igraph
igraph.org

LargeViz
arxiv.org/abs/1602.00370
github.com/lferry007/LargeVis

Graphistry
www.graphistry.com

Node2Vec
snap.stanford.edu/node2vec
github.com/xgfs/node2vec-c

VERSE
tsitsul.in/publications/verse
github.com/xgfs/verse

Notebook with simplified graph convolutions example
github.com/iggisv9t/graph-stuff/blob/master/Universal%20Convolver%20Example.ipynb

Graph Convolutions
List of papers about Graph Convolutional Networks: github.com/thunlp/GNNPapers

TL;DR:

Graphia [Update 5 june 2020]

This is the neuron connections of the part of the fly brain from https://neuprint.janelia.org/

The article was published about a year ago. Now there is a new very promising tool for graphs visualization, and large graphs particularly: Graphia. It’s under active development as opposed to abandoned long ago Gephi and it also works much faster. As Gephi it has a lot of tools integrated, you can read about them here: https://graphia.app/userguide.html

Disadvantages are excusable:

  • Currently, it has only one force-directed layout and very limited ways to tune it.
  • Rendering options not good for large graphs. At least option to turn off edges display or add opacity would be very useful.
  • Right now Graphia is relatively raw. For example, I had to convert graph formats with Gephi in order to put in in Grapia, which crashed on the same graphs represented as CSV. But I’m sure all these little things will be improved soon.

The advantages are too many to list. Imagine Gephi with all these analysis tools but rewritten from scratch on C++, much quicker, and under active development. For comparison: Gephi needs several hours to layout 173K nodes graph, but Graphia needs only several minutes.

I believe that this review will be outdated soon, so better check the current state of this app by yourself.

[Update July 2022]

I have to say that some points in this article are a bit outdated:

  • Gephi is not abandoned anymore
  • Graphia is still seems raw. It’s fast, but visualizations are to messy and it’s impossible to tweak the view to make it usable.
  • There are couple of great online tools now

New tools

Cosmograph

Superfast tool with force-directed layout on webgl, opensource core and sane defaults. Easy, pretty, minimalist but with a number of useful features. Can handle hundreds of thousand vertices and edges. Even faster then desktop tools.

Image from cosmograph’s site

Retina

I was able to use Retina with about 10K vertices, but main feature of this tool is publishing. Host gexf or graphml file somewhere online (they even provide instruction) and get the graph analysis tool that can be shared. No need to use desktop tool for a number of most frequent scenarios like node search, exploration of nodes’ neighbors, filtering by attributes etc.

Graph hosted online loaded in Retina

For example I’ve found Harry Potter characters interaction graph hosted on Github and now I can share the visualization

--

--