Graphs with Python: Overview and Best Libraries

Graph analysis, interactive visualizations, and graph machine learning

Dmytro Nikolaiev (Dimid)
Towards Data Science

--

Preview. Image by Author

A graph is a relatively old mathematical data entity that is a set of connected elements. Since the graph is a very flexible structure and allows you to store information in a form familiar and convenient to humans, graphs have always been used in computer science and technology. With the rise of machine learning and deep learning, graphs have gained even more popularity by creating the field of graph machine learning.

In this post, I would like to share with you the most useful Python libraries I’ve used for graph/network analysis, visualization, and machine learning. Today, we will review:

  • NetworkX for general graph analysis;
  • PyVis for interactive graph visualizations right in your browser;
  • PyG and DGL for solving various graph machine learning tasks.

Before that, let me tell you a few words about graph theory and graph machine learning and provide some learning resources that may be helpful to you. If you don’t know what graph or graph machine learning is, that is a great opportunity to lift the veil of secrecy!

Graph Theory and Graph Machine Learning: a Brief Introduction

The graph is simply a set of elements connected to each other.

Graph example. Public domain

However, the fact these elements (called nodes) can contain any information and can be connected in any way (with edges) makes the graph the most general data structure. Indeed, any complex data familiar to us can be represented as a simple graph: for example, an image — as a grid of pixels or text — as a sequence (or chain) of words.

You might wonder: are graphs really so important? Well, some tasks simply cannot be solved or even formulated without them, as some information cannot be structured in data. Imagine the following situation: you need to visit a list of cities, say for tourism or for work. You have information about the distance from one city to another, or say, the cost of tickets for different transport modes — it’s even more interesting! How to create an optimal route, that is, spend the minimum amount of money or drive a minimum distance?

For me, the task is quite practical — think at least about its application in logistics. And, this is an example of a problem that cannot be solved without the help of graphs. Think about how you will represent the data and in any case, you will still come to the weighted graph (a graph whose edges have some value, called weight). By the way, if each city needs to be visited exactly once, this task turns into the famous traveling salesman problem (TSP), which is not so easy to solve. One of the reasons is that the number of possible routes is growing very fast, and even for 7 cities, there are already 360 of them!

The solution to a TSP with 7 cities using brute force search. Public domain

Graph theory (originated in the 18th century) was engaged in the study of graphs and solving various graph problems: finding a possible or optimal path in a graph, building and researching trees (a special type of graph), and so on. Graph theory was successfully used in social sciences, chemistry, biology, and other fields. But with the development of computers, the process of using graphs has reached another level.

What is really important is that this base: a set of related elements, often with different elements and types of connections, is very useful for modeling real-world tasks and datasets. This is the place where graph machine learning comes into the picture (although amazing tasks were solved before it as well). After humanity collected the appropriate datasets and developed technologies to model them (like Graph Convolutional Networks (GCNs), by analogy with Convolutional Neural Networks (CNNs)) it becomes possible to solve a wide range of graph tasks:

  • Node-level tasks, like node classification — assign a label for every node in the graph. We will see an example a little below — divide a group of people into two clusters, knowing how they communicate with each other; but other applications can be very different. The intuition here comes from social science, which says that we are dependent on our environment. Indeed, any entity can be classified more effectively taking into account not only some set of features but also data about its neighborhood. For example, if your friends smoke, you are more likely to smoke, and if your friends go to the gym, you are more likely to go to the gym.
  • Edge-level tasks, like edge prediction — predict if two nodes have an edge or, more often, predict edge type (graphs that have several edge types are called multigraphs). This task is very interesting for the knowledge graphs, which we see in a couple of minutes.
  • Graph-level tasks. This can be graph classification, graph generation, and so on. This field is especially useful for biology and chemistry because molecules can be effectively represented as graphs. Molecule classification (determining if the molecule has certain properties) or molecule generation (and especially drug generation) sounds much cooler than some “graph-level tasks”!

Let’s take a look at examples of graphs from real life. One of the most famous graph datasets is the karate club dataset. Here, each node is a person (club member), and each edge represents the two members who interacted outside of the club.

Karate club dataset visualization. Public domain

A common problem is finding two groups of people into which the club split after an argument between two instructors (now we can treat it as binary (or 2-class) node classification). The dataset was collected back in 1977 and become a classic example of a human social network or community structure.

Another graph type, interpretable for humans, and therefore extremely useful for machine learning models is a knowledge graph. In a knowledge graph, a node is some entity or concept and an edge represents knowledge about the interaction of a pair of entities. Thus, the node-edge-node structure stores a certain fact about the world or a particular system.

A simple example of the knowledge graph. Public domain

The knowledge graph in the example above contains two types of edges: is and eat and is thus a multigraph we introduced earlier. The Dogs-is-Animals structure gives us the knowledge that the “dogs” set is a subset of the “animals” set, or, in simpler terms, that dogs are animals.

Wikidata is a huge free knowledge base by Wikipedia, which is constantly updated and has more than 100 million nodes now. There are more than 400 edge types, some of which are part of, different from, opposite of, population, and location, so definitely make sense.

Top 20 edge relations in the wikidata knowledge base for 2020. Public domain

That huge knowledge base contains a lot of information about the world around us. It’s still amazing to me how humanity has collected this data, and that machines are now able to process it!

One more thing I can’t keep silent about is wikidata's beautiful visualization capabilities. For example, here you can see the plot of connectivity of the United States states. Note that it is not drawn by anyone, it is just a subgraph of the entire wikidata graph: we took only American states as nodes and P47 (shares border with) as edges.

Connectivity of the USA states. Public domain

Take a look at Wikidata Graph Builder and other visualizations. Let me point you to some of them that I find entertaining:

Know More about Graphs

If after that brief overview you are now interested in graphs and want to know more about them, I refer you to the wonderful Gentle Introduction to Graph Neural Networks by Google Research. In this article, you can find more examples and interactive visualizations.

Check the Graph Theory Algorithms course by freeCodeCamp.org for various graph theory algorithms overviews or Stanford CS224W: Machine Learning with Graphs course to start your graph machine learning journey.

After that brief introduction, let’s actually start with Python libraries!

NetworkX — General Graph Analysis

If you have to do some operations on graphs and you use Python as your programming language, you will most likely find the NetworkX library pretty quickly. It is probably the most fundamental and commonly used library for network analysis that provides a wide range of functionality:

  • Data structures for storing and operating on undirected or directed graphs and multigraphs;
  • Many graph algorithms implemented;
  • Basic visualization tools.

The library is pretty intuitive and easy to use. Also, the majority of fundamentals, like graph data structures will remain the same or at least similar for all popular graph libraries. For clarity, you can create a simple graph and visualize it with the following code:

Basic NetworkX visualization. Image by Author

When it comes to algorithms, networkx is pretty powerful and has hundreds of graph algorithms implemented.

To summarize, this is an efficient, scalable, and powerful library, that will definitely be useful for you if you are dealing with graph analysis.

Reference

PyVis — Interactive Graph Visualizations

Using networkx for graph visualization can be pretty good for little graphs but if you need more flexibility or interactivity, you better give PyVis a chance. The situation is similar to matplotlib vs plotly. Using matplotlib for quick and straightforward visualizations is perfectly fine, but if you need to interact with your chart or present it to somebody else, you better use more powerful tools.

PyVis is built on the VisJS library and produces interactive visualizations in your browser with simple code. Let’s plot the same graph as in the example above.

This code will create a graph.html file. By opening it, you will be able to interact with your visualization: zoom it, drag it, and much more.

PyVis visualization example. Gif by Author

Looks interesting, right? The library even allows you to use web UI to dynamically tweak display configurations. Definitely check the official tutorial that will walk you through the main library’s capabilities.

Reference

DGL and PyG Graph Machine Learning

Let’s now switch to the more advanced topic — graph machine learning. I will mention two of the most popular libraries for it: DGL and PyG.

DGL (Deep Graph Library) was initially released in 2018. In contrast to PyG (PyTorch Geometric), which is built on top of the PyTorch and therefore supports only PyTorch tensors, DGL supports multiple deep learning frameworks, including PyTorch, TensorFlow, and MXNet.

Both libraries implement popular Graph Neural Network (GNN) cells such as GraphSAGE, GAT (Graph Attention Network), GIN (Graph Isomorphism Network), and others. It will not be difficult to build a model from pre-made blocks — the process is very similar to plain PyTorch or TensorFlow.

Here is how you can create a 2-layer GCN model for node classification in PyG:

And the same code for DGL:

Both code snippets are pretty straightforward if you are familiar with deep learning and PyTorch.

As you see, the model definition is very similar for both libraries. The training loop then can be written on the plain PyTorch for PyG and require some modifications for DGL (since DGL graph objects store the entire dataset, and you have to address train/validation/test sets using binary masks).

There is a slight difference in data representation here: you can see it at least based on the different input parameters for the forward method. Indeed, PyG stores everything as PyTorch tensors and DGL has a separate graph object that you have to use, and under the hood, it follows a more classical NetworkX style.

However, that is not a big deal — you can convert the PyG graph object to the DGL graph and vice versa with a few lines of code. The more important question is: how else are they different? And which one should you use?

DGL vs PyG

Trying to figure out which of the libraries is better, you will keep coming across the same answer — “try both and decide which works best for you”. Okay, but how are they at least different? Again, the answer that you will constantly encounter is “they are quite similar”.

And they really are! Moreover, you saw it for yourself by looking at the code a few minutes ago. But of course, you can find some differences digging deeper: here is a good resource list including a few thoughts by library authors, and here is a pretty detailed comparison on different sides.

In general, the answer is really to try both. In fact, DGL has more low-level API and can be harder to use in the sense of implementing new ideas. But this makes it more flexible: DGL is not limited to message-passing networks (classical Graph Convolutional Networks) and has the implementation of several concepts that PyG can not provide, for example, Tree-LSTM.

PyTorch Geometric, on the other hand, makes his API as easy as possible and then gains more popularity among researchers that can quickly implement new ideas, i.e. new GNN cells. In the past time, PyG becomes more and more popular due to important updates with PyG 2.0 and active and powerful teams of collaborators, including Stanford University.

Number of DGL vs PyG search queries over the last 5 years. Public domain

So I still encourage you to try both of them, giving PyG the chance first.

If you are working on a relatively familiar graph problem (be it node classification, graph classification, etc.), both PyG and DGL have a huge amount of GNN cells implemented. Also with PyG, it will be easier for you to implement your own GNN as part of any research.

However, if you want to get full control over what is happening under the hood or implement something more complicated than the message-passing framework, your choice will most likely fall on DGL.

References

Conclusion

The target audience of this article (people interested in graphs) is quite small. Well, machine learning is a fairly young field of computer science, and graph machine learning is even younger. The last mainly attracts the attention of the research community, but, believe it or not, it is used in important real-world applications such as recommendation systems and biology/chemistry studies.

In any case, I hope these materials were interesting or helpful for you — whether you were looking for anything specific or just learned something new today. As a recap, today we briefly reviewed what graph and graph machine learning is, and took a look a the following libraries:

  • NetworkX: general graph analysis;
  • PyVis: interactive graph visualizations;
  • PyG and DGL: machine learning on graphs.

Thank you for reading!

  • I hope these materials were useful to you. Follow me on Medium to get more articles like this.
  • If you have any questions or comments, I will be glad to get any feedback. Ask me in the comments, or connect via LinkedIn or Twitter.
  • To support me as a writer and to get access to thousands of other Medium articles, get Medium membership using my referral link (no extra charge for you).

--

--