Photo by Alina Grubnyak on Unsplash

Machine Learning on Graphs, Part 1

Collecting basic statistics

Konstantin Kutzkov
Towards Data Science
8 min readSep 9, 2021

--

In a series of posts, I will provide an overview of several machine learning approaches to learning from graph data. Starting with basic statistics that are used to describe graphs, I will go deeper into the subject by discussing node embeddings, graph kernels, graph signal processing, and eventually graph neural networks. The posts are intended to reflect on my personal experience in academia and industry, including some of my research papers. My main motivation is to present first some basic approaches to machine learning on graphs that should be used before digging into advanced algorithms like graph neural networks.

In the first post, I present some common techniques for graph analysis that should help us better understand our data.

Preliminaries

I assume familiarity with basic graph concepts. I will consider undirected graphs without self-loops (nodes connected to themselves by an edge) but this is only for the ease of presentation. A graph is denoted by G=(V, E) where V is the set of nodes or vertices, and E is the set of edges. The number of nodes is denoted by n = |V| and the number of edges by m = |E|.

What graph statistics to look for?

Before starting to work on a problem, we usually invest some time in exploratory data analysis. This provides us with insights into the data that might be crucial for our later work. Ideally, we would like to plot a graph like we do for a subway map but most real-life graphs are large and so easy to be visually represented. Instead, here are some important statistics to look for:

  • Degree distribution. The degree distribution of two real-life networks is shown in Figure 1. The first graph comes from email correspondence in an EU institution, and the second graph is the co-authorship graph from the DBLP database. We observe that the most common node degree in the email graph is 1, meaning that most people only communicate with a single person. In the co-authorship graph most authors have two or three co-authors.
Figure 1: degree distribution. Image by author.
  • Graph density. The quantity is defined as

The explanation is simple: the maximum possible number of edges is n*(n-1)/2 as every node can be connected to exactly n-1 other nodes and the edges are undirected. It measures how close is the graph to a fully connected graph, or a clique. Most real-life networks are very sparse, thus the measure is in most cases not very informative.

  • Connected components. We say that node u is reachable from node v if there exists a path, a sequence of edges, between u and v. In a connected component each node is reachable from every other node in the component. The number and the corresponding size of connected components can provide information about the graph. However, in most cases we are interested in connected graphs consisting of a single connected component.
  • Graph diameter. The shortest path between two nodes u and v is the minimum number of edges that need to be traversed in order to reach v after starting at u. The graph diameter is the maximum shortest path length between any two nodes in the graph. If the graph has more than one connected component, then its diameter is infinity. In such a case, we are interested in the diameter of each connected component. The graph diameter can be used to distinguish between different types of networks. For example, subway networks are likely to have a larger diameter than small social networks.
  • The number of triangles and transitivity coefficient. In graph theory, there is the fundamental concept of Erdős–Rényi graphs. This is a theoretical model where edges between nodes are generated at random, each with a fixed probability p, independent of other edges. Real-life graphs are very different from Erdős–Rényi graphs as edges are not created independently at random. In particular, for most real graphs it often holds that “the friends of my friends are also my friends”. This has led researchers to consider triangles as a fundamental unit in graph analytics [1]. At a high level, the number of triangles show how different is our graph from a random graph.

The transitivity coefficient of a graph is defined as

In the above, the numerator is for the number of triangles multiplied by 3 and the denominator denotes the number of 2-paths in the graph which can be computed as:

In the graph in Figure 2, there are two triangles, namely (u,v,w) and (u, v, z). The 2-paths are 3 nodes sequences connected by en edge such as z-u-w. Observe that each pair of neighbors of a given node results in a 2-path, hence the above formula for the total number of 2-paths. We multiply the number of triangles by 3 because we count each of its 3 2-paths. Thus, the transitivity coefficient is a value between 0 (no triangles at all) and 1 (fully connected graphs).

In Figure 2, the number of 2-paths is 8, thus the transitivity coefficient is 0.75.

In most real-life graphs the transitivity coefficient is constant. The larger the transitivity coefficient, the more tightly connected is the network. In fact, it was introduced in 1998 to describe the small-world networks model.

Figure 2. A densely connected graph. Image by author.
  • The local number of triangles and the clustering coefficient. The local clustering coefficient of node u is defined

i.e., the number of triangles having u as a node divided by the number of 2-paths centered at u.

This is then generalized to the global clustering coefficient

In Figure 2, node u has a local clustering coefficient of 2/3, and the global clustering coefficient of the graph is (2/3+2/3+1+1)/4 =0.833.

Finally, we would like to note that in the literature the transitivity coefficient is sometimes called clustering coefficient, and the global clustering coefficient, as defined above, is called average clustering coefficient. The confusion of notation stems from the fact that Watts and Strogatz wrongly claimed in their seminal paper [1] that the transitivity coefficient and clustering coefficient are identical.

How to count the number of triangles?

From the above discussion it is evident that the most important graph statistics appears to be the number of triangles and the corresponding transitivity and clustering coefficient. Let’s see how to compute it. I recommend using Python’s networkx library when working with graphs. It is easy and intuitive to use and provides many fundamental graph algorithms such as shortest paths, number of connected components, and even the transitivity coefficient.

import networkx as nx
import numpy as np
G = nx.Graph()
for u, v in <list of edges or edge generator>:
G.add_edge(u, v)

We can count triangles by using the third power of the adjacency matrix of the graph. The precise formula is

The trace is defined as the sum of the diagonal entries in a matrix. Each diagonal entry in A³ measures the number of paths of length exactly 3 from the corresponding node to itself. Consider a triangle between the three nodes (u, v, w). There are 6 permutations of these three nodes, this is why we need to divide the trace by 6. In Python, the code looks as

def count_triangles(G):
A = nx.to_scipy_sparse_matrix(G)
A2 = A.dot(A)
A3 = A2.dot(A)
return A3.diagonal().sum()/6

Note that the above uses sparse matrix multiplication. However, the third power of the adjacency matrix is not guaranteed to be sparse, especially for graphs of a small diameter, and the computation might become a bottleneck. In such a case I recommend using the Doulion algorithm [2]. Doulion works by sampling edges from the original graph. Each edge is kept with probability p. Thus, a triangle survives in the sparsified graph with probability p³. We estimate the number of triangles in the original graph by multiplying the number of triangles in the sparsified graph by 1/p³.

def sparsify_graph(G, p):
G_sparse = nx.Graph()
for u,v in G.edges():
if np.random.random() <= p:
G_sparse.add_edge(u,v)
return G_sparse

I run an experiment on the graph obtained from DBLP network. The graph consists of 317,080 nodes and just above 1 million edges. By sampling edges with a probability of 10%, I obtained the following running times for the exact counting algorithm and for Doulion. And the achieved approximation of the number of triangles is excellent.

Elapsed time exact:   13.21 secs
Elapsed time Doulion: 1.19 secs
Approximation exact/estimated: 1.030

Some more advanced statistics.

Let us briefly mention there are also more advanced graph statistics.

  • Betweenness centrality. This measure captures the influence nodes have in the graph. For each node u, we count how many shortest paths between node pairs go through u. Nodes with high betweenness centrality are considered more important in the network. The distribution of betweenness centrality scores provides us with further insights into the graph structure. However, the computational complexity for computing the centrality for all nodes is O(m*n).
  • Densely connected subgraphs. An important research topic is the detection of densely connected subgraphs [3]. These are subgraphs in the original graph where almost all node pairs are connected by an edge. This is the basis of algorithms for community detection. But the number and density of these dense components are also important statistics.
  • Mining graph motifs. Detecting frequent graph patterns is an active area of research [4]. It generalizes algorithms for triangle counting to mining and counting more complex subgraphs, or motifs. These counts can be used in graph classifications tasks. In Figure 3, I show three different motifs on 4 nodes. However, counting motifs can be computationally expensive and is limited to small subgraphs. One can argue that the success of graph neural networks is based on their ability to learn complex motifs that are relevant to the problem at hand.
Figure 3. Subgraph motifs on 4 nodes.

Code

The code for the above examples can be found at: https://github.com/konstantinkutzkov/ML_on_graphs/

[1] Duncan J. Watts, Steven H. Strogatz. Collective dynamics of ‘small-world’ networks. Nature 393, 1998.

[2] Charalampos E. Tsourakakis, U Kang, Gary L. Miller, and Christos Faloutsos. DOULION: Counting Triangles in Massive Graphs with a Coin. KDD 2009.

[3] Charalampos E. Tsourakakis, Francesco Bonchi, Aristides Gionis,
Francesco Gullo, and Maria A. Tsiarli. Denser than the Densest Subgraph:
Extracting Optimal Quasi-Cliques with Quality Guarantees
. KDD 2013.

[4] Guyue Han and Harish Sethu. Waddling Random Walk: Fast and Accurate Mining of Motif Statistics in Large Graphs. ICDM 2016.

--

--

Data scientist with a PhD in Computer science from IT University of Copenhagen. Interested in machine learning and algorithms.