The world’s leading publication for data science, AI, and ML professionals.

Machine Learning on Graphs, Part 3

Graph kernels methods that are easy to implement and yield models that can be efficiently trained

Photo by Alina Grubnyak on Unsplash
Photo by Alina Grubnyak on Unsplash

In the third post of these series I will discuss graph kernels. Graph kernels are an approach that generalizes kernel machines such as support-vector machines to working with graphs. In a nutshell, the idea of graph kernels is to compare two graphs by a function that has as input the two graphs. Intuitively, this function should measure similarity between the graphs for different notions of similarity.

Graph kernels have been the first major approach to machine learning from graph data. After the rise of deep learning and graph neural networks, they lost in popularity but still remain a powerful tool. In this post, I will concentrate on graph kernel methods that are easy to implement and yield models that can be efficiently trained. As discussed in the previous post in the series, the main objective is to present simple algorithms for machine learning on graphs that should be tried out before going into graph neural networks.

General setting

The premise in graph kernels is that we have a collection of graphs as input and we want to train a classification or regression algorithm. Note that these graphs can be subgraphs of a large graph, for instance different communities in a large social network graph. About 20 years ago, before the dominance of deep learning, researchers have come up with the idea to train support vector machines for the graph classification problem. In order to do so, one needs to design a graph specific kernel function that compares two graphs. Then we can use the standard SVM algorithm using the kernel trick that creates a Gram matrix consisting of the kernel function of all pairs of graphs in the training set. Then we learn the support vectors, or in our case support graphs, that allow us to classify new graphs by simply evaluating the kernel function having as input the new graph and all support graphs.

Next, I will present a general family of graph kernels.

Convolutional graph kernels

Let us for a moment assume that we have collected a vector at each node. For example, this can be a node embedding vector like discussed in the previous post. Somewhat simplified, a convolutional graph kernel for two graphs G and H is defined as

In the above, κ **** is a (non-linear) kernel function between two vectors. For example, the polynomial kernel:

Or the RBF kernel:

I refer the reader to my blog post on non-linear kernel functions for more details on kernel functions.

Please note that there is a normalized version of the kernel definition:

Let us analyze the definition of the convolutional graph kernel.

  1. We compare all pairs of nodes because a graph is unordered, so we need to compare two sets. The kernel is called convolutional because each node vector from the one graph is compared against each vector from the other graph. (Just like in convolutional neural networks where we compare each filter against all patches in the input image.)
  2. We use a (non-linear) kernel function for the comparison of two vectors in order to capture more complex dependencies between nodes. The choice of the kernel function can assign different meanings to similarity. For example, the right parameters of the RBF kernels can force us to consider only vector pairs where the vectors are almost identical to each other.

The obvious shortcoming of the above is the quadratic computational complexity. For two graphs, each on n nodes, we need to evaluate the kernel function n² times. Moreover, for a collection of m graphs in the training dataset, we need a Gram matrix that consists of m² pairs of the kernel function. Next, I will present a very efficient approach to the approximate computation of graph kernels.

Explicit graph feature maps

As discussed in my post on explicit feature maps for non-linear kernels, we can approximate a non-linear kernel function as

In the above, ψ is a function that maps the vector onto a new feature space of reasonable, user-defined dimensionality. The larger the dimensionality the better the approximation.

The main result for Explicit Feature Maps allows us to rewrite the graph kernel expression as follows. The following is simplification of the main result from [3]:

The above shows that we can obtain a feature representation for the entire graph from the embeddings for individual nodes. The inner product of two graph feature vectors approximates the kernel function for two graphs.

Observe that for the normalized version of the graph kernel and the linear kernel for node vectors, i.e., κ(u, v) = u^T v, the above corresponds to taking the mean vector of all node embeddings in the graph.

Implementation

There is a large collection of graph benchmark datasets for graph kernel evaluation. In the Jupyter notebook for this post I present code that reads an input graph and converts it to a representation that we can use to train node embeddings. Note that the graphs are disjoint, thus a standard approach like node2vec won’t work. Instead, I implemented a variation of the approach outlined in the second half of my previous post on node embeddings. We assume the nodes are labeled. If no labels are provided, we can create our own labels that describe the node degree. I ran a number of short random walks starting from each node in each graph. For each walk I recorded the labels of nodes encountered during the walk. Then I grouped together nodes which have identical walks, i.e., walks with identical label sequences. From these groups I sample positive pairs for a word2vec-like node embedding algorithm with negative sampling. The reader is referred to Jupyter norebook for more details.

Other graph kernels

Using node embeddings as node vectors is a natural choice as node embeddings capture the graph structure. However, I would like to briefly mention some of the most widely used methods for computing Graph Kernels. Note however that these methods were designed before the emergence of node embeddings algorithms. For a thorough overview of graph kernel algorithms I refer the reader to [2].

k-walk kernel

From each node, we start a number of random walks of length exactly k. The vector at the node is then the distribution of the labels in all the walks. (Note that for unlabeled graphs it is usual to use the node degree as a label.)

Shortest path kernel

For each node, record the shortest path to each other node in the graph. Then, similarly to the k-walk kernel, create vectors that describe the label distribution in the collected paths.

Weisfeiler-Lehman kernel

For each node in the graph, collect the labels from its neighbors and sort them. Then create a new label from the resulting string. Iterate t times and create t+1 labels for each node. The vectors consist of the ids of these labels.

The Weisfeiler-Lehman relabeling algorithm. Source: [1]
The Weisfeiler-Lehman relabeling algorithm. Source: [1]

Code

The implementation can be found in the this Jupyter notebook.

[1] Nino Shervashidze, Pascal Schweitzer, Erik Jan van Leeuwen, Kurt Mehlhorn, Karsten M. Borgwardt: Weisfeiler-Lehman Graph Kernels. J. Mach. Learn. Res. 12: 2539–2561 (2011): https://www.jmlr.org/papers/volume12/shervashidze11a/shervashidze11a.pdf

[2] Karsten Borgwardt, Elisabetta Ghisu, Felipe Llinares-López, Leslie O’Bray, Bastian Rieck. Graph Kernels. State-of-the-Art and Future Challenges. https://arxiv.org/pdf/2011.03854.pdf

[3] Moez Draief, Konstantin Kutzkov, Kevin Scaman, Milan Vojnovic: KONG: Kernels for ordered-neighborhood graphs. NeurIPS 2018: 4055–4064 https://arxiv.org/pdf/1805.10014.pdf


Related Articles