Anisotropic, Dynamic, Spectral and Multiscale Filters Defined on Graphs

As part of the “Tutorial on Graph Neural Networks for Computer Vision and Beyond”

Boris Knyazev
Towards Data Science

--

I’m presenting an overview of important Graph Neural Network works, by distilling key ideas and explaining simple intuition behind milestone methods using Python and PyTorch. This post continues the first part of my tutorial.

Graph of Graph Neural Network (GNN) and related works. Some other important works and edges are not shown to avoid further clutter. For example, there is a large body of works on dynamic graphs that deserve a separate overview. Best viewed on a very wide screen in color.

20+ years of Graph Neural Networks

In the “Graph of Graph Neural Network (GNN) and related works” above, I added papers on graphs that I have come across in the last year. In this graph, a directed edge between two works denotes that one paper is based on the other (while not necessary citing it) and a color of the work denotes:

  • Red — spectral methods (require eigen-decomposition of the graph Laplacian, which will be explained below)
  • Green — methods that work in the spatial domain (do not require eigen-decomposition of the graph Laplacian)
  • Blue — equivalent to spectral methods, but do not require eigen-decomposition (so, effectively, spatial methods)
  • Black — are methods complementary to GNNs and agnostic to the choice of a GNN itself (i.e. pooling, attention).

Note, that some other important works and edges are not shown to avoid further clutter, and only a tiny fraction of works, highlighted in bold boxes, will be covered in this post. Disclaimer: I still found room to squeeze our own recent works there 😊.

Most of the important methods are covered in this non-exhaustive list of works:

The first work where graphs were classified using a neural network seems to be a 1997 paper by Alessandro Sperduti and Antonina Starita on “Supervised Neural Networks for the Classification of Structures”.

A figure from (Sperduti & Starita, 1997), which is strikingly similar to what we are doing now, after more than 20 years.

Sperduti & Starita, 1997: “Until now neural networks have been used for classifying unstructured patterns and sequences. However, standard neural networks and statistical methods are usually believed to be inadequate when dealing with complex structures because of their feature-based approach.”

From 1997, the body of works on learning from graphs has grown so much and in so many diverse directions that it is very hard to keep track without some smart automated system. I believe we are converging to using methods based on neural networks (based on our formula (2) explained in the first part of my tutorial), or some combination of neural networks and other methods.

Graph neural layer’s formula (2) from the first part of my tutorial that we will also need in this part. Keep in mind, that if we need to compute a specific loss for the output features or if we need to stack these layers, we apply some activation like ReLU or Softmax.

To recap the notation we used in the first part, we have some undirected graph G with N nodes. Each node in this graph has a C-dimensional feature vector, and features of all nodes are represented as an N×C dimensional matrix X⁽ˡ⁾. In a typical graph network, such as GCN (Kipf & Welling, ICLR, 2017), we feed these features X⁽ˡ⁾ to a graph neural layer with C×F dimensional trainable weights W⁽ˡ⁾ , so that the output of this layer is an N×F matrix X⁽ˡ⁺¹⁾ encoding updated (and hopefully better in some sense) node features. 𝓐 is an N×N matrix, where the entry 𝓐ᵢⱼ indicates if node i is connected (adjacent) to node j. This matrix is called an adjacency matrix. I use 𝓐 instead of plain A to highlight that this matrix can be normalized in a way to facilitate feature propagation in a deep network. For the purpose of this tutorial, we can assume that 𝓐=A, i.e. each i-th row of the matrix product 𝓐X⁽ˡ⁾ will contain a sum of features of node i neighbors.

In the rest of this part of the tutorial, I’ll briefly explain works of my choice showed in bold boxes in the overview graph. I recommend Bronstein et al.’s review for a more comprehensive and formal analysis.

Note that even though I dive into some technical details of spectral graph convolution below, many recent works (e.g., GIN in Xu et al., ICLR, 2019) are built without spectral convolution and show great results in some tasks. However, knowing how spectral convolution works is still helpful to understand and avoid potential problems with other methods.

1. Spectral graph convolution

Bruna et al., 2014, ICLR 2014

I explain spectral graph convolution in detail in my another post.

I’ll briefly summarize it here for the purpose of this part of the tutorial. A formal definition of spectral graph convolution, which is very similar to the convolution theorem in signal/image processing, can be written as:

Spectral graph convolution, where ⊙ means element-wise multiplication.

where V are eigenvectors and Λ are eigenvalues of the graph Laplacian L, which can be found by eigen-decomposition: L=VΛVᵀ; W_spectral are filters. Throughout this tutorial I’m going to assume “symmetric normalized Laplacian”. It is computed based only on an adjacency matrix A of a graph, which can be done in a few lines of Python code as follows:

# Computing the graph Laplacian
# A is an adjacency matrix
import numpy as np
N = A.shape[0] # number of nodes in a graph
D = np.sum(A, 0) # node degrees
D_hat = np.diag((D + 1e-5)**(-0.5)) # normalized node degrees
L = np.identity(N) — np.dot(D_hat, A).dot(D_hat) # Laplacian

Here, we assume that A is symmetric, i.e. A = Aᵀ and our graph is undirected, otherwise node degrees are not well-defined and some assumptions must be made to compute the Laplacian. In the context of computer vision and machine learning, the graph Laplacian defines how node features will be updated if we stack several graph neural layers in the form of formula (2).

So, given graph Laplacian L, node features X and filters W_spectral, in Python spectral convolution on graphs looks very simple:

# Spectral convolution on graphs
# X is an N×1 matrix of 1-dimensional node features
# L is an N×N graph Laplacian computed above
# W_spectral are
N×F weights (filters) that we want to train
from scipy.sparse.linalg import eigsh # assumes L to be symmetric
Λ,V = eigsh(L,k=20,which=’SM’) # eigen-decomposition (i.e. find Λ,V)
X_hat = V.T.dot(X) # 20×1 node features in the "spectral" domain
W_hat = V.T.dot(W_spectral) # 20×F filters in the "spectral" domain
Y = V.dot(X_hat * W_hat) # N×F result of convolution

where we assume that our node features X⁽ˡ⁾ are 1-dimensional, e.g. MNIST pixels, but it can be extended to a C-dimensional case: we will just need to repeat this convolution for each channel and then sum over C as in signal/image convolution.

Formula (3) is essentially the same as spectral convolution of signals on regular grids using the Fourier Transform, and so creates a few problems for machine learning:

  1. the dimensionality of trainable weights (filters) W_spectral depends on the number of nodes N in a graph;
  2. W_spectral also depends on the graph structure encoded in eigenvectors V.

These issues prevent scaling to datasets with large graphs of variable structure.

To solve the first issue, Bruna et al. proposed to smooth filters in the spectral domain, which makes them more local in the spatial domain according to the spectral theory. The idea is that you can represent our filter W_spectral from formula (3) as a sum of 𝐾 predefined functions, such as splines, and instead of learning N values of W, we learn K coefficients α of this sum:

We can approximate our N dimensional filter W_spectral as a finite sum of K functions f, such as splines shown below. So, instead of learning N values of W_spectral, we can learn K coefficients (alpha) of those functions; it becomes efficient when K << N.

While the dimensionality of fk does depend on the number of nodes N, these functions are fixed, so we don’t learn them. The only thing we learn are coefficients α, and so W_spectral is no longer dependent on N. To make our approximation in formula (4) reasonable, we want K<<N to reduce the number of trainable parameters from N to K and, more importantly, make it independent of N, so that our GNN can digest graphs of any size.

While solves the first issue, this smoothing method does not address the second issue.

2. Chebyshev graph convolution

Defferrard et al., NeurIPS, 2016

The main drawback of spectral convolution and its smooth version above is that it still requires eigen-decomposition of an N×N dimensional graph Laplacian L, which creates two main problems:

  1. 🙁 The complexity of eigen-decomposition is huge, O(). Moreover in case of large graphs, keeping the graph Laplacian in a dense format in RAM is infeasible. One solution is to use sparse matrices and find eigenvectors using scipy.sparse.linalg.eigs in Python. Additionally, you may preprocess all training graphs on a dedicated server with a lot of RAM and CPU cores. In many applications, your test graphs can also be preprocessed in advance, but if you have a constant influx of new large graphs, eigen-decomposition will make you sad.
  2. 🙁 Another problem is that the model you train ends up being closely related to the eigenvectors V of the graph. This can be a big problem if your training and test graphs have very different structures (numbers of nodes and edges). Otherwise, if all graphs are very similar, it is less of a problem. Moreover, if you use some smoothing of filters in the frequency domain like splines discussed above, then your filters become more localized and the problem of adapting to new graphs seems to be even less noticeable. However, the models will still be quite limited.

Now, what does Chebyshev graph convolution have to do with all that?

It turns out that it solves both problems at the same time! 😃

That is, it avoids computing costly eigen-decomposition and the filters are no longer “attached” to eigenvectors (yet they still are functions of eigenvalues Λ). Moreover, it has a very useful parameter, usually denoted as K having a similar intuition as K in our formula (4) above, determining the locality of filters. Informally: for K=1, we feed just node features X⁽ˡ⁾ to our GNN; for K=2, we feed X⁽ˡ⁾ and 𝓐X⁽ˡ⁾; for K=3, we feed X⁽ˡ⁾, 𝓐X⁽ˡ⁾ and 𝓐²X⁽ˡ⁾; and so forth for larger K (I hope you’ve noticed the pattern). See more accurate and formal definition in Defferrard et al. and my code below, plus additional analysis is given in (Knyazev et al., NeurIPS-W, 2018).

Due to the power property of adjacency matrices, when we perform 𝓐²X⁽ˡ⁾ we actually average (or sum depending on how 𝓐 is normalized) over 2-hop neighbors, and analogously for any n in 𝓐ⁿX⁽ˡ⁾ as illustrated below, where we average over n-hop neighbors.

Chebyshev convolution for K=3 for node 1 (dark blue). Circled nodes denote the nodes affecting feature representation of node 1. The [,] operator denotes concatenation over the feature dimension. W⁽ˡ⁾ are 3C×F dimensional weights.

Note that to satisfy the orthogonality of the Chebyshev basis, 𝓐 assumes no loops in the graph, so that in each i-th row of matrix product 𝓐X⁽ˡ⁾ we will have features of the neighbors of node i, but not the features of node i itself. Features of node i will be fed separately as a matrix X⁽ˡ⁾.

If K equals the number of nodes N, the Chebyshev convolution closely approximates a spectral convolution, so that the receptive field of filters will be the entire graph. But, as in the case of convolutional networks, we don’t want our filters to be as big as the input images for a number of reasons that I already discussed, so in practice, K takes reasonably small values.

In my experience, this is one of the most powerful GNNs, achieving great results in a very wide range of graph tasks. The main downside is the necessity to loop over K in the forward/backward pass (since Chebyshev polynomials are recursive, so it’s not possible to parallelize them), which slows down the model.

Same as with Splines discussed above, instead of training filters, we train coefficients, but this time, of the Chebyshev polynomial.

Chebyshev basis used to approximate convolution in the spectral domain.

To generate the Chebyshev basis, you can use the following Python code:

# Set K to some integer > 0, like 4 or 5 in our plots above
# Set n_points to a number of points on a curve (we set to 100)
import numpy as np
x = np.linspace(-1, 1, n_points)
T = np.zeros((K, len(x)))
T[0,:] = 1
T[1,:] = x
for n in range(1, K-1):
T[n+1, :] = 2*x*T[n, :] - T[n-1, :] # recursive computation
return T

The full code to generate spline and Chebyshev bases is in my github repo.

To illustrate how a Chebyshev filter can look on a irregular grid, I follow the experiment from Bruna et al. again and sample 400 random points from the MNIST grid in the same way as I did to show eigenvectors of the graph Laplacian. I trained a Chebyshev graph convolution model on the MNIST images sampled from these 400 locations (same irregular grid is used for all images) and one of the filter for K=1 and K=20 is visualized below.

A single Chebyshev filter (K=3 on the left and K=20 on the right) trained on MNIST and applied at different locations (shown as a red pixel) on a irregular grid with 400 points. Compared to filters of standard ConvNets, GNN filters have different shapes depending on the node at which they are applied, because each node has a different neighborhood structure.

3. GCN

Kipf & Welling, ICLR, 2017

As you may have noticed, if you increase K of the Chebyshev convolution, it increases the total number of trainable parameters. For example, for K=2, our weights W⁽ˡ⁾ will be 2C×F instead of just C×F. This is because we concatenate features X⁽ˡ⁾ and 𝓐X⁽ˡ⁾ into a single N×2C matrix. More training parameters means the model is more difficult to train and more data must be labeled for training. Graph datasets are often extremely small. Whereas in computer vision, MNIST is considered a tiny dataset, because images are just 28×28 dimensional and there are only 60k training images, in terms of graph networks MNIST is quite large, because each graph would have N=784 nodes and 60k is a large number of training graphs. In contrast to computer vision tasks, many graph datasets have only around 20–100 nodes and 200–1000 training examples. These graphs can represent certain small molecules and labeling chemical/biological data is usually more expensive than labeling images. Therefore, training Chebyshev convolution models can lead to severe overfitting of the training set (i.e. the model will have the training loss close to 0 yet will have a large validation or test error). So, GCN of Kipf & Welling essentially “merged” matrices of node features X⁽ˡ⁾ and 𝓐X⁽ˡ⁾ into a single N×C matrix. As a result, the model has two times fewer parameters to train compared to Chebyshev convolution with K=2, yet has the same receptive field of 1 hop. The main trick involves adding “self-loops” to your graph by adding an identity matrix I to 𝓐 and normalizing it in a particular way, so now in each i-th row of matrix product 𝓐X⁽ˡ⁾ we will have features of the neighbors of node i, as well as features of node i.

This model seems to be a standard baseline choice well-suited for many application due to its lightweight, good performance and scalability to larger graphs.

3.1. GCN vs Chebyshev layer

The difference between GCN and Chebyshev convolution is illustrated below.

The code above follows the same structure as in the first part of my tutorial, where I compared classical NN and GNN. One of the main steps both in GCN and Chebyshev convolution is computation of the rescaled graph Laplacian L. This rescaling is done to make eigenvalues in the range [-1,1] to facilitate training (this might be not a very important step in practice as weights can adapt during training). In GCN, self-loops are added to the graph by adding an identity matrix before computing the Laplacian as discussed above. The main difference between the two methods is that in the Chebyshev convolution we recursively loop over K to capture features in the K-hop neighborhood. We can stack such GCN or Chebyshev layers interleaved with nonlinearities to build a Graph Neural Network.

Now, let me politely interrupt 😃 our spectral discussion and give a general idea behind two other exciting methods: Edge-conditioned filters by Simonovsky & Komodakis, CVPR, 2017 and MoNet by Monti et al., CVPR, 2017, which share some similar concepts.

4. Edge-conditioned filters

Simonovsky & Komodakis, CVPR, 2017

As you know, in ConvNets we learn the weights (filters) by optimizing some loss like Cross Entropy. In the same way, we learn our W⁽ˡ⁾ in GNNs. Imagine that instead of learning these weights, you have another network that predicts them. So during training, we learn the weights of that auxiliary network, which takes an image or a graph as an input and returns weights W⁽ˡ⁾ (Θ in their work) as the output. The idea is based on Dynamic Filter Networks (Brabandere et al., NIPS, 2016), where “dynamic” means that filters W⁽ˡ⁾ will be different depending on the input as opposed to standard models in which filters are fixed (or static) after training.

Using an auxiliary “filter generating network” Fˡ to predict edge-specific weights Θ for the main network. Xˡ⁻¹ are input node features and Xˡ are output features. The figure shows a single iteration of “dynamic convolution” for node 1 (in yellow). Standard GNNs typically would simply average (or sum) features of node 1 neighbors (nodes 2, 3, 4, 5) , which would correspond to having an isotropic filter (Θ would be a constant vector). In contrast, this model has anisotropic filters, because it predicts different edge values between node 1 and all it’s neighbors based on edge labels L, so that features Xˡ(1) are computed as a weighted average of neighbors’ features. Figure from (Simonovsky & Komodakis, CVPR, 2017).

This is a very general form of convolution that, besides images, can be easily applied to graphs or point clouds as they did in their CVPR paper and got excellent results. However, there is no “free lunch”, and training such models is quite challenging, because the regular grid constraint is now relaxed and the scope of solutions increases dramatically. This is especially true for larger graphs with many edges or for convolution in deeper layers, which often have hundreds of channels (number of features, C), so you might end up generating thousands of numbers in total for each input! In this regard, standard ConvNets are so good, because we don’t waste the model’s capacity on training to predict these weights, instead we directly enforce that the filters should be the same for all inputs. But this prior makes ConvNets limited and we cannot directly apply them to graphs or point clouds. So, as always, there’s some trade-off between flexibility and performance in a particular task.

When applied to images, like MNIST, the Edge-conditioned model can learn to predict anisotropic filters — filters that are sensitive to orientation, such as edge detectors. Compared to Gaussian filters discussed in the first part of my tutorial, these filters are able to better capture certain patterns in images, such as strokes in digits.

Convolutional filters learned on MNIST sampled in low (left) and high (right) resolutions. Figure from (Simonovsky & Komodakis, CVPR, 2017).

I want to highlight one more time that whenever we have a complicated model with auxiliary networks, it becomes a chicken-or-the-egg problem in some sense. To solve it, one of the networks — the auxiliary or the main one — should receive a very strong signal, so that it can implicitly supervise another network. In our BMVC paper, which is similar to Simonovsky & Komodakis’s work, we apply additional constraints on the edge-generating network to facilitate training. I will describe our work in detail in later posts.

5. MoNet

Monti et al., CVPR, 2017

MoNet is different from other works discussed in this post, as it assumes to have the notion of node coordinates, and therefore is more suited for geometric tasks such as 3D mesh analysis or image/video reasoning. It is somewhat similar to edge-conditioned filters of Simonovsky & Komodakis, since they also introduce an auxiliary learnable function 𝐷(𝑤, 𝜃, ρ) that predicts weights. The difference is that these weights depend on the node polar coordinates (angle 𝜃 and radius ρ); and trainable parameters 𝑤 of that function are constrained to be means and variances of Gaussians, so that instead of learning N×N matrices, we only learn fixed-size vectors (means and variances) independently of the graph size N. In terms of standard ConvNets, it would be the same as learning only 2 values (the mean and variance of a Gaussian) for each filter instead of learning 9, 25 or 121 values for 3×3, 5×5 or 11×11 dimensional filters respectively. This parameterization would greatly reduce the number of parameters in a ConvNet, but the filters would be very limited in their power to capture image features.

Monti et al. train 𝐽 means and variances of Gaussians and the process of transforming node coordinates is similar to fitting them into a Gaussian Mixture Model. The model is quite computationally intensive to train if we want our filters to be global enough, but it can be a good choice for visual tasks (see our BMVC paper for comparison), yet it is often worse than simple GCN on non-visual tasks (Knyazev et al., NeurIPS-W, 2018). Since function D depends on coordinates, the generated filters are also anisotropic and have a shape of oriented and elongated Gaussians as illustrated below.

Filters trained with MoNet in polar coordinates 𝜃 and ρ. Each ellipse corresponds to a slice of a Gaussian at some fixed level. The idea is that if the coordinates of the i-th node are close to the middle of the j-th Gaussian, then the generated weight at index (i,j) will have a value close to 1.
Pseudo-code of the MoNet layer using PyTorch# assume X to be input N×C node features
# coord are N×N×2 node coordinate differences between all pairs of nodes (node degrees for non-geometric tasks)
# coord can be viewed as angular and radial edges between nodes
1. Generate J Gaussian-shaped filters based on coordinates of nodes using some trainable function D
weights = D(coord) # weights: J×N×N
2. Multiply node features X by these weights
X = torch.bmm(weights, X.expand(J, N, C)) # X: J×N×C
3. Project features by a learnable linear transformation
X = fc(X.permute(1, 2, 0).view(N, J*C)) # X: N×F
4. Feed X to the next layer

Conclusion

Despite a lengthy discussion, we have only scratched the surface. Applications of graph neural networks are expanding far beyond typical graph reasoning tasks, like molecule classification. The number of different graph neural layers is increasing very quickly, similar to how it was for convolutional networks a few years ago, so it’s hard to keep track of them. On that note, PyTorch Geometric (PyG) — a nice toolbox to learn from graphs — frequently populates its collection with novel layers and tricks.

Acknowledgement: A large portion of this tutorial was prepared during my internship at SRI International under the supervision of Mohamed Amer (homepage) and my PhD advisor Graham Taylor (homepage). I also thank Carolyn Augusta for useful feedback.

Find me on Github, LinkedIn and Twitter. My homepage.

If you want to cite this blog post in your paper, please use:
@misc{knyazev2019tutorial,
title={Tutorial on Graph Neural Networks for Computer Vision and Beyond},
author={Knyazev, Boris and Taylor, Graham W and Amer, Mohamed R},
year={2019}
}

--

--