Anisotropic, Dynamic, Spectral and Multiscale Filters Defined on Graphs
As part of the “Tutorial on Graph Neural Networks for Computer Vision and Beyond”
I’m presenting an overview of important Graph Neural Network works, by distilling key ideas and explaining simple intuition behind milestone methods using Python and PyTorch. This post continues the first part of my tutorial.
20+ years of Graph Neural Networks
In the “Graph of Graph Neural Network (GNN) and related works” above, I added papers on graphs that I have come across in the last year. In this graph, a directed edge between two works denotes that one paper is based on the other (while not necessary citing it) and a color of the work denotes:
- Red — spectral methods (require eigen-decomposition of the graph Laplacian, which will be explained below)
- Green — methods that work in the spatial domain (do not require eigen-decomposition of the graph Laplacian)
- Blue — equivalent to spectral methods, but do not require eigen-decomposition (so, effectively, spatial methods)
- Black — are methods complementary to GNNs and agnostic to the choice of a GNN itself (i.e. pooling, attention).
Note, that some other important works and edges are not shown to avoid further clutter, and only a tiny fraction of works, highlighted in bold boxes, will be covered in this post. Disclaimer: I still found room to squeeze our own recent works there 😊.
Most of the important methods are covered in this non-exhaustive list of works:
- Nicket et al., 2015, A Review of Relational Machine Learning for Knowledge Graphs
- Bronstein et al., 2016, Geometric deep learning: going beyond Euclidean data
- Hamilton et al., 2017, Representation Learning on Graphs: Methods and Applications
- Kipf et al., 2018, Structured deep models: Deep learning on graphs and beyond, presentation slides.
- Battaglia et al., 2018, Relational inductive biases, deep learning, and graph networks
- Zhang et al., 2018 Deep Learning on Graphs: A Survey
- Zhou et al., 2018, Graph Neural Networks: A Review of Methods and Applications
- Wu et al., 2019, A Comprehensive Survey on Graph Neural
Networks - Petar Veličković, 2019, The resurgence of structure in deep neural networks, PhD Thesis.
- NIPS and CVPR video tutorials
The first work where graphs were classified using a neural network seems to be a 1997 paper by Alessandro Sperduti and Antonina Starita on “Supervised Neural Networks for the Classification of Structures”.
Sperduti & Starita, 1997: “Until now neural networks have been used for classifying unstructured patterns and sequences. However, standard neural networks and statistical methods are usually believed to be inadequate when dealing with complex structures because of their feature-based approach.”
From 1997, the body of works on learning from graphs has grown so much and in so many diverse directions that it is very hard to keep track without some smart automated system. I believe we are converging to using methods based on neural networks (based on our formula (2) explained in the first part of my tutorial), or some combination of neural networks and other methods.
To recap the notation we used in the first part, we have some undirected graph G with N nodes. Each node in this graph has a C-dimensional feature vector, and features of all nodes are represented as an N×C dimensional matrix X⁽ˡ⁾. In a typical graph network, such as GCN (Kipf & Welling, ICLR, 2017), we feed these features X⁽ˡ⁾ to a graph neural layer with C×F dimensional trainable weights W⁽ˡ⁾ , so that the output of this layer is an N×F matrix X⁽ˡ⁺¹⁾ encoding updated (and hopefully better in some sense) node features. 𝓐 is an N×N matrix, where the entry 𝓐ᵢⱼ indicates if node i is connected (adjacent) to node j. This matrix is called an adjacency matrix. I use 𝓐 instead of plain A to highlight that this matrix can be normalized in a way to facilitate feature propagation in a deep network. For the purpose of this tutorial, we can assume that 𝓐=A, i.e. each i-th row of the matrix product 𝓐X⁽ˡ⁾ will contain a sum of features of node i neighbors.
In the rest of this part of the tutorial, I’ll briefly explain works of my choice showed in bold boxes in the overview graph. I recommend Bronstein et al.’s review for a more comprehensive and formal analysis.
Note that even though I dive into some technical details of spectral graph convolution below, many recent works (e.g., GIN in Xu et al., ICLR, 2019) are built without spectral convolution and show great results in some tasks. However, knowing how spectral convolution works is still helpful to understand and avoid potential problems with other methods.
1. Spectral graph convolution
I explain spectral graph convolution in detail in my another post.
I’ll briefly summarize it here for the purpose of this part of the tutorial. A formal definition of spectral graph convolution, which is very similar to the convolution theorem in signal/image processing, can be written as:
where V are eigenvectors and Λ are eigenvalues of the graph Laplacian L, which can be found by eigen-decomposition: L=VΛVᵀ; W_spectral are filters. Throughout this tutorial I’m going to assume “symmetric normalized Laplacian”. It is computed based only on an adjacency matrix A of a graph, which can be done in a few lines of Python code as follows:
# Computing the graph Laplacian
# A is an adjacency matrix
import numpy as npN = A.shape[0] # number of nodes in a graph
D = np.sum(A, 0) # node degrees
D_hat = np.diag((D + 1e-5)**(-0.5)) # normalized node degrees
L = np.identity(N) — np.dot(D_hat, A).dot(D_hat) # Laplacian
Here, we assume that A is symmetric, i.e. A = Aᵀ and our graph is undirected, otherwise node degrees are not well-defined and some assumptions must be made to compute the Laplacian. In the context of computer vision and machine learning, the graph Laplacian defines how node features will be updated if we stack several graph neural layers in the form of formula (2).
So, given graph Laplacian L, node features X and filters W_spectral, in Python spectral convolution on graphs looks very simple:
# Spectral convolution on graphs
# X is an N×1 matrix of 1-dimensional node features
# L is an N×N graph Laplacian computed above
# W_spectral are N×F weights (filters) that we want to train
from scipy.sparse.linalg import eigsh # assumes L to be symmetricΛ,V = eigsh(L,k=20,which=’SM’) # eigen-decomposition (i.e. find Λ,V)
X_hat = V.T.dot(X) # 20×1 node features in the "spectral" domain
W_hat = V.T.dot(W_spectral) # 20×F filters in the "spectral" domain
Y = V.dot(X_hat * W_hat) # N×F result of convolution
where we assume that our node features X⁽ˡ⁾ are 1-dimensional, e.g. MNIST pixels, but it can be extended to a C-dimensional case: we will just need to repeat this convolution for each channel and then sum over C as in signal/image convolution.
Formula (3) is essentially the same as spectral convolution of signals on regular grids using the Fourier Transform, and so creates a few problems for machine learning:
- the dimensionality of trainable weights (filters) W_spectral depends on the number of nodes N in a graph;
- W_spectral also depends on the graph structure encoded in eigenvectors V.
These issues prevent scaling to datasets with large graphs of variable structure.
To solve the first issue, Bruna et al. proposed to smooth filters in the spectral domain, which makes them more local in the spatial domain according to the spectral theory. The idea is that you can represent our filter W_spectral from formula (3) as a sum of 𝐾 predefined functions, such as splines, and instead of learning N values of W, we learn K coefficients α of this sum:
While the dimensionality of fk does depend on the number of nodes N, these functions are fixed, so we don’t learn them. The only thing we learn are coefficients α, and so W_spectral is no longer dependent on N. To make our approximation in formula (4) reasonable, we want K<<N to reduce the number of trainable parameters from N to K and, more importantly, make it independent of N, so that our GNN can digest graphs of any size.
While solves the first issue, this smoothing method does not address the second issue.
2. Chebyshev graph convolution
Defferrard et al., NeurIPS, 2016
The main drawback of spectral convolution and its smooth version above is that it still requires eigen-decomposition of an N×N dimensional graph Laplacian L, which creates two main problems:
- 🙁 The complexity of eigen-decomposition is huge, O(N³). Moreover in case of large graphs, keeping the graph Laplacian in a dense format in RAM is infeasible. One solution is to use sparse matrices and find eigenvectors using
scipy.sparse.linalg.eigs
in Python. Additionally, you may preprocess all training graphs on a dedicated server with a lot of RAM and CPU cores. In many applications, your test graphs can also be preprocessed in advance, but if you have a constant influx of new large graphs, eigen-decomposition will make you sad. - 🙁 Another problem is that the model you train ends up being closely related to the eigenvectors V of the graph. This can be a big problem if your training and test graphs have very different structures (numbers of nodes and edges). Otherwise, if all graphs are very similar, it is less of a problem. Moreover, if you use some smoothing of filters in the frequency domain like splines discussed above, then your filters become more localized and the problem of adapting to new graphs seems to be even less noticeable. However, the models will still be quite limited.
Now, what does Chebyshev graph convolution have to do with all that?
It turns out that it solves both problems at the same time! 😃
That is, it avoids computing costly eigen-decomposition and the filters are no longer “attached” to eigenvectors (yet they still are functions of eigenvalues Λ). Moreover, it has a very useful parameter, usually denoted as K having a similar intuition as K in our formula (4) above, determining the locality of filters. Informally: for K=1, we feed just node features X⁽ˡ⁾ to our GNN; for K=2, we feed X⁽ˡ⁾ and 𝓐X⁽ˡ⁾; for K=3, we feed X⁽ˡ⁾, 𝓐X⁽ˡ⁾ and 𝓐²X⁽ˡ⁾; and so forth for larger K (I hope you’ve noticed the pattern). See more accurate and formal definition in Defferrard et al. and my code below, plus additional analysis is given in (Knyazev et al., NeurIPS-W, 2018).
Due to the power property of adjacency matrices, when we perform 𝓐²X⁽ˡ⁾ we actually average (or sum depending on how 𝓐 is normalized) over 2-hop neighbors, and analogously for any n in 𝓐ⁿX⁽ˡ⁾ as illustrated below, where we average over n-hop neighbors.
Note that to satisfy the orthogonality of the Chebyshev basis, 𝓐 assumes no loops in the graph, so that in each i-th row of matrix product 𝓐X⁽ˡ⁾ we will have features of the neighbors of node i, but not the features of node i itself. Features of node i will be fed separately as a matrix X⁽ˡ⁾.
If K equals the number of nodes N, the Chebyshev convolution closely approximates a spectral convolution, so that the receptive field of filters will be the entire graph. But, as in the case of convolutional networks, we don’t want our filters to be as big as the input images for a number of reasons that I already discussed, so in practice, K takes reasonably small values.
In my experience, this is one of the most powerful GNNs, achieving great results in a very wide range of graph tasks. The main downside is the necessity to loop over K in the forward/backward pass (since Chebyshev polynomials are recursive, so it’s not possible to parallelize them), which slows down the model.
Same as with Splines discussed above, instead of training filters, we train coefficients, but this time, of the Chebyshev polynomial.
To generate the Chebyshev basis, you can use the following Python code:
# Set K to some integer > 0, like 4 or 5 in our plots above
# Set n_points to a number of points on a curve (we set to 100)
import numpy as npx = np.linspace(-1, 1, n_points)
T = np.zeros((K, len(x)))
T[0,:] = 1
T[1,:] = x
for n in range(1, K-1):
T[n+1, :] = 2*x*T[n, :] - T[n-1, :] # recursive computation
return T
The full code to generate spline and Chebyshev bases is in my github repo.
To illustrate how a Chebyshev filter can look on a irregular grid, I follow the experiment from Bruna et al. again and sample 400 random points from the MNIST grid in the same way as I did to show eigenvectors of the graph Laplacian. I trained a Chebyshev graph convolution model on the MNIST images sampled from these 400 locations (same irregular grid is used for all images) and one of the filter for K=1 and K=20 is visualized below.
3. GCN
As you may have noticed, if you increase K of the Chebyshev convolution, it increases the total number of trainable parameters. For example, for K=2, our weights W⁽ˡ⁾ will be 2C×F instead of just C×F. This is because we concatenate features X⁽ˡ⁾ and 𝓐X⁽ˡ⁾ into a single N×2C matrix. More training parameters means the model is more difficult to train and more data must be labeled for training. Graph datasets are often extremely small. Whereas in computer vision, MNIST is considered a tiny dataset, because images are just 28×28 dimensional and there are only 60k training images, in terms of graph networks MNIST is quite large, because each graph would have N=784 nodes and 60k is a large number of training graphs. In contrast to computer vision tasks, many graph datasets have only around 20–100 nodes and 200–1000 training examples. These graphs can represent certain small molecules and labeling chemical/biological data is usually more expensive than labeling images. Therefore, training Chebyshev convolution models can lead to severe overfitting of the training set (i.e. the model will have the training loss close to 0 yet will have a large validation or test error). So, GCN of Kipf & Welling essentially “merged” matrices of node features X⁽ˡ⁾ and 𝓐X⁽ˡ⁾ into a single N×C matrix. As a result, the model has two times fewer parameters to train compared to Chebyshev convolution with K=2, yet has the same receptive field of 1 hop. The main trick involves adding “self-loops” to your graph by adding an identity matrix I to 𝓐 and normalizing it in a particular way, so now in each i-th row of matrix product 𝓐X⁽ˡ⁾ we will have features of the neighbors of node i, as well as features of node i.
This model seems to be a standard baseline choice well-suited for many application due to its lightweight, good performance and scalability to larger graphs.
3.1. GCN vs Chebyshev layer
The difference between GCN and Chebyshev convolution is illustrated below.
The code above follows the same structure as in the first part of my tutorial, where I compared classical NN and GNN. One of the main steps both in GCN and Chebyshev convolution is computation of the rescaled graph Laplacian L. This rescaling is done to make eigenvalues in the range [-1,1] to facilitate training (this might be not a very important step in practice as weights can adapt during training). In GCN, self-loops are added to the graph by adding an identity matrix before computing the Laplacian as discussed above. The main difference between the two methods is that in the Chebyshev convolution we recursively loop over K to capture features in the K-hop neighborhood. We can stack such GCN or Chebyshev layers interleaved with nonlinearities to build a Graph Neural Network.
Now, let me politely interrupt 😃 our spectral discussion and give a general idea behind two other exciting methods: Edge-conditioned filters by Simonovsky & Komodakis, CVPR, 2017 and MoNet by Monti et al., CVPR, 2017, which share some similar concepts.
4. Edge-conditioned filters
Simonovsky & Komodakis, CVPR, 2017
As you know, in ConvNets we learn the weights (filters) by optimizing some loss like Cross Entropy. In the same way, we learn our W⁽ˡ⁾ in GNNs. Imagine that instead of learning these weights, you have another network that predicts them. So during training, we learn the weights of that auxiliary network, which takes an image or a graph as an input and returns weights W⁽ˡ⁾ (Θ in their work) as the output. The idea is based on Dynamic Filter Networks (Brabandere et al., NIPS, 2016), where “dynamic” means that filters W⁽ˡ⁾ will be different depending on the input as opposed to standard models in which filters are fixed (or static) after training.
This is a very general form of convolution that, besides images, can be easily applied to graphs or point clouds as they did in their CVPR paper and got excellent results. However, there is no “free lunch”, and training such models is quite challenging, because the regular grid constraint is now relaxed and the scope of solutions increases dramatically. This is especially true for larger graphs with many edges or for convolution in deeper layers, which often have hundreds of channels (number of features, C), so you might end up generating thousands of numbers in total for each input! In this regard, standard ConvNets are so good, because we don’t waste the model’s capacity on training to predict these weights, instead we directly enforce that the filters should be the same for all inputs. But this prior makes ConvNets limited and we cannot directly apply them to graphs or point clouds. So, as always, there’s some trade-off between flexibility and performance in a particular task.
When applied to images, like MNIST, the Edge-conditioned model can learn to predict anisotropic filters — filters that are sensitive to orientation, such as edge detectors. Compared to Gaussian filters discussed in the first part of my tutorial, these filters are able to better capture certain patterns in images, such as strokes in digits.
I want to highlight one more time that whenever we have a complicated model with auxiliary networks, it becomes a chicken-or-the-egg problem in some sense. To solve it, one of the networks — the auxiliary or the main one — should receive a very strong signal, so that it can implicitly supervise another network. In our BMVC paper, which is similar to Simonovsky & Komodakis’s work, we apply additional constraints on the edge-generating network to facilitate training. I will describe our work in detail in later posts.
5. MoNet
MoNet is different from other works discussed in this post, as it assumes to have the notion of node coordinates, and therefore is more suited for geometric tasks such as 3D mesh analysis or image/video reasoning. It is somewhat similar to edge-conditioned filters of Simonovsky & Komodakis, since they also introduce an auxiliary learnable function 𝐷(𝑤, 𝜃, ρ) that predicts weights. The difference is that these weights depend on the node polar coordinates (angle 𝜃 and radius ρ); and trainable parameters 𝑤 of that function are constrained to be means and variances of Gaussians, so that instead of learning N×N matrices, we only learn fixed-size vectors (means and variances) independently of the graph size N. In terms of standard ConvNets, it would be the same as learning only 2 values (the mean and variance of a Gaussian) for each filter instead of learning 9, 25 or 121 values for 3×3, 5×5 or 11×11 dimensional filters respectively. This parameterization would greatly reduce the number of parameters in a ConvNet, but the filters would be very limited in their power to capture image features.
Monti et al. train 𝐽 means and variances of Gaussians and the process of transforming node coordinates is similar to fitting them into a Gaussian Mixture Model. The model is quite computationally intensive to train if we want our filters to be global enough, but it can be a good choice for visual tasks (see our BMVC paper for comparison), yet it is often worse than simple GCN on non-visual tasks (Knyazev et al., NeurIPS-W, 2018). Since function D depends on coordinates, the generated filters are also anisotropic and have a shape of oriented and elongated Gaussians as illustrated below.
Pseudo-code of the MoNet layer using PyTorch# assume X to be input N×C node features
# coord are N×N×2 node coordinate differences between all pairs of nodes (node degrees for non-geometric tasks)
# coord can be viewed as angular and radial edges between nodes1. Generate J Gaussian-shaped filters based on coordinates of nodes using some trainable function D
weights = D(coord) # weights: J×N×N
2. Multiply node features X by these weights
X = torch.bmm(weights, X.expand(J, N, C)) # X: J×N×C
3. Project features by a learnable linear transformation
X = fc(X.permute(1, 2, 0).view(N, J*C)) # X: N×F
4. Feed X to the next layer
Conclusion
Despite a lengthy discussion, we have only scratched the surface. Applications of graph neural networks are expanding far beyond typical graph reasoning tasks, like molecule classification. The number of different graph neural layers is increasing very quickly, similar to how it was for convolutional networks a few years ago, so it’s hard to keep track of them. On that note, PyTorch Geometric (PyG) — a nice toolbox to learn from graphs — frequently populates its collection with novel layers and tricks.
Acknowledgement: A large portion of this tutorial was prepared during my internship at SRI International under the supervision of Mohamed Amer (homepage) and my PhD advisor Graham Taylor (homepage). I also thank Carolyn Augusta for useful feedback.
Find me on Github, LinkedIn and Twitter. My homepage.
If you want to cite this blog post in your paper, please use:
@misc{knyazev2019tutorial,
title={Tutorial on Graph Neural Networks for Computer Vision and Beyond},
author={Knyazev, Boris and Taylor, Graham W and Amer, Mohamed R},
year={2019}
}