Graph Convolution Network (GCN) can be mathematically very challenging to be understood, but let’s follow me in this fourth post where we’ll decompose step by step GCN

My previous posts about Graphs and ML:
- Graph Neural Networks: A learning journey since 2008 – Part 1
- Graph Neural Networks: a learning journey since 2008 – Part 2
- Graph Neural Networks: a learning journey since 2008 – Deep Walk
- Graph Neural Networks: a learning journey since 2008 – Python & Deep Walk
In our previous posts, we saw how Scarselli’s Graph Neural Network idea [1–4] has drastically changed the way to frame a typical ML problem. In a few years since 2008, ML on graphs has become a hot topic, completing and improving more and more Scarselli’s first seminal approach. Last time we saw an evolution of GNN with Perozzi’s DeepWalk approach [5], where, for the very first time, the community has started to talk about "nodes-embedding" [6–8].
Today, I want to show you a very popular algorithm: Graph Convolution Network [9]. This post covers publications between 2014 and 2017 [9–14]. In these 3 years, scientists have devoted a lot of effort to applying the famous convolutional neural network approach to graphs. Indeed, why do convolutions work perfectly on images? Why can’t they be directly applied to graphs? How can we translate graphs to convolutional terms? Is there any mathematics that we can use for this purpose?
In this post, we’ll cover the theoretical aspects of GCN, from defining the convolution operation to graph-convolution. Next time, we’ll go through the Tensorflow implementation of Kipf’s GCN algorithm.
Convolution what?
A neural network is a continuous affine transformation [15,16]: given an input vector, the network multiplies it with some matrices to generate an output. The input vector can be a 1D signal, an image, a video and, generally, is a multidimensional array. This array has different axes, for example, images have an axis dedicated to the RGB channel, another axis defines the height and another one the width. Axes are not exploited in an affine transformation and here convolution starts to play an important role, in order to help neural networks output production. A discrete convolution [17] is a linear transformation, where an input feature map and a convolution kernel are multiplied together

Each overlapping input feature map piece is a dotted-product with the sliding convolution kernel, returning the final output image/vector. To fix some general concepts about convolution, we can mathematically define a convolution as an integral:

Where f is the input feature map, the circle star is the convolution operation in the time domain, g is the input convolution kernel. The convolution is an integral, over the input feature map domain τ, defined by dτ (e.g. this can be a cube, a plane defined by an image, a vector etc), g(t- τ) is the convolution kernel which is sliding by τ steps on the input feature map f(τ). This integral can be converted to a sum, to get a discrete convolution. Here, the kernel can be parametrized based on:
- n the number of output feature maps
- m the number of input feature maps
- _k_ⱼthe kernal size defined on axis j
The output size oⱼ of a convolution layer along an axis j is than influenced by:
- _i_ⱼ input size along axis j
- kⱼ the kernel size along axis j
- _sⱼ_the stride along axis j
- _pⱼ_the zero padding along axis j
We may find these terms appearing from time to time in our convolution codes, so to make it clearer:
- the stride is the distance between two consecutive positions of the kernel
- zero padding is an additional "frame" of zeroes added to an image, in order to usually make the image size up to a power of 2
Convolution works nicely with neural networks, providing a linear transformation of an input signal. However, convolution works greatly only and if only:
- the input signal can be described on a grid (e.g. a vector, an image, a 3D cubic object)
- the input has dominant local statistics (e.g. a group of pixels dominates the information content of the image)
This is a great problem when dealing with 3D mesh or social media data, namely graphs. These objects do not have a grid-based mathematical domain, thus convolution cannot process graphs as input signals. However, mathematics can come to help us, offering a brilliant solution.
Graphs: mathematical insights
To apply convolution on a graph a first trick is to get a new definition of map locality [10]. Locality can be thought as the dominant local statistics in the image case. In a graph G=(N, E) (or G=(V, E) as per Kipf’s paper[9]) the locality can be expressed based on nodes’ neighbourhoods, given a threshold δ:

where given a node j, neighbourhoods locality can be expressed as all those nodes i whose features _W_ᵢⱼare greater than a given threshold.
From here it is possible to have a compact representation of the graph, where nodes’ neighbourhoods averages can be used as values for nodes features, as depicted in fig.2
![Fig.2: Locality in a graph. Defining locality through eq.2 it is possible to get a hidden representation of the red node by averaging its feature with all its neighbours (red circle) features' values. Image by the author, inspired by Wu et al.[18]](https://towardsdatascience.com/wp-content/uploads/2021/12/14plGqEfhzUfFfHCX7uu71w.png)
With average locality, convolutional filters can be receptive for neighbourhood dominant statistics, with an input layer whose size is up to O(S*n) where S is the average of the neighbourhood size, given n nodes.
A second fundamental trick is to extract properties from the graph’s mathematical representations, the Laplacian matrix L [19]. A Laplacian is defined as L = D – A, where D is the degree matrix and A is the adjacency matrix. D matrix points out how many connections each nodes has, while A which nodes are connected to each other (fig.3)

Thus, Laplacian is a great, concise and highly informative way to describe a graph and how nodes’ connections influence the graph dynamics. To further appreciate and understand the role of Laplacian on a graph, we can see a classical example, which explains how heat diffuses on a given graph. Given a heat variable Φ, where Φᵢ is the heat in node i and _Φ_ⱼ in node j, from classical physics we know that the heat transferred from i to j is proportional to the material heat capacity κ as κ(Φᵢ – Φⱼ). Thus, the evolution of heat in time through the graph can be described through the following equation:

This equation can be implemented in a Python code and we can get the heat diffusion:

As you can see, the Laplacian matrix L tells us how the heat diffuses uniformly across the graph, from (three in this case) starting points.
Fill the gap: Laplacian and Fourier transform
L can be expressed as a symmetric and orthogonal matrix (eq.4)

where I is the identity matrix, -1/2 is the power for the degree matrix D and A is the adjacency matrix. The symmetric orthogonal L matrix is semi-definite positive, which guarantees it has real eigenvectors and eigenvalues.
This condition allows to express L through the spectral decomposition (eq.5)

where U is an orthonormal matrix basis, Λ a diagonal matrix with positive eigenvalues, and T power stands for transpose. The UΛ constitutes an orthonormal basis in the real numbers domain. It follows that each eigenvector of L can be associated with a corresponding node in the graph. If the node changes position so the eigenvector will re-arrange accordingly – permutation invariant. Secondly, the eigenfunctions of L can be further rearranged as complex exponentials.
At this point, a bell should ring in your mind. Complex exponentials can be visualized as a series of sinusoids in the time domain, through Euler’s formula. This means that there is a relationship between the Fourier Transform and the Laplacian eigenvalues. In the Fourier domain a convolution (eq.1) is a multiplication between the input signal x and the convolution kernel gϑ:

where F stands for Fourier transform and U is the matrix of eigenvectors of the normalized graph Laplacian (eq.5). This means that the Laplacian defines a Fourier basis, thus it returns a Fourier basis view of the input graph, thus Laplacian can be used to compute convolution on a graph. In particular, the convolution can be computed on the spectrum of a given graph, with its weights, by computing the eigenvectors of the graph’s Laplacian [20].
Graph Convolution Networks by Thomas Kipf and Max Welling, 2017
Now, all this mathematical framework is great and we could think of applying the Laplacian spectral decomposition to a graph, passing this to some neural network layers and activation function, and the job is done. Unfortunately, the convolution through Laplacian is not computationally feasible and it is very expensive. To circumvent this problem, Kipf and Welling, together with Hammond’s 2011 paper, propose a truncated expansion in terms of Chebyshev polynomials Tₖ(x) of the Fourier convolution filter gϑ:

where θ’ represents the vector of Chebyshev polynomial coefficients, Tₖ is the recursive formulation of Chebyshev polynomial, Λ is a rescaled constant, based on the maximum eigenvalue of the graph Laplacian matrix. Incorporating eq.7 in eq.7 the final Chebyshev approximation of convolution can be gained:


where tilded L is the rescaled graph Laplacian (eq.9). This expression is K-localized, namely it is a Kth-order Chebyshev polynomial in the Laplacian, so it depends only on nodes N that are at a maximum K steps away from the central node (redefining graph locality). The computational complexity is then reduced to O(E), so it depends on the number of edges.
Given eq. 8 above a Graph Convolution Network can be implemented by stacking multiple convolutional layers, each layer followed by a point-wise non-linearity. Thus, for a given layer l+1, the graph layer-wise propagation will be:

where H is the matrix of activation for the l-th or l+1-th layer, σ is an activation function like ReLu, W is the layer-specific trainable weight matrix. This is the core formulation of Graph Convolution Neural Networks, proposed by Kipf in 2017. Let’s see now a Python implementation for the Karate club!
Karate Club and the GCN
First, let’s load the karate club directly from networkx
Python library:
Secondly, we can create the graph adjacency matrix adj
and a simple one-hot encoded graph feature matrix X
:
Then, we are going to create the degree matrix deg
, through a self-connected adjacency matrix adj_self
, namely the adjacency matrix with self-terms connections:
Now, it’s time to implement the GCN approach, as shown in fig.8. Initially, we need to define the ReLu function, which is just np.maximum(input_value, 0)
. Secondly, we can define the layer-wise propagation rule in GCN, which is eq.10. This function takes as input the adjacency matrix, the degree matrix, the graph features matrix and the i-th layer weights. Finally, initialise layers weights and set up 2 layers – but you can set up as many layers as you want – and give GCN a try!
Stay tuned for our next post: we’ll get deeper in Kipf’s GCN implementation using Tensorflow.
Please, feel free to send me an email for questions or comments at: [email protected] or directly here in Medium.
Bibliography
- Scarselli, Franco, et al. "Graph neural networks for ranking web pages." The 2005 IEEE/WIC/ACM International Conference on Web Intelligence (WI’05). IEEE, 2005.
- Goldberg, Andrew V., and Chris Harrelson. "Computing the shortest path: A search meets graph theory." SODA. Vol. 5. 2005.
- Scarselli, Franco, et al. "Graph neural networks for ranking web pages." The 2005 IEEE/WIC/ACM International Conference on Web Intelligence (WI’05). IEEE, 2005.
- Scarselli, Franco, et al. "The graph neural network model." IEEE transactions on neural networks 20.1 (2008): 61–80.
- Perozzi, Bryan, Rami Al-Rfou, and Steven Skiena. "Deepwalk: Online learning of social representations." Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining. 2014.
- Mikolov, Tomas, et al. "Efficient estimation of word representations in vector space." arXiv preprint arXiv:1301.3781 (2013)
- Tang, Lei, and Huan Liu. "Relational learning via latent social dimensions." Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining. 2009.
- Tang, Lei, and Huan Liu. "Leveraging social media networks for classification." Data Mining and Knowledge Discovery 23.3 (2011): 447–478.
- Kipf, Thomas N., and Max Welling. "Semi-supervised classification with graph convolutional networks." arXiv preprint arXiv:1609.02907 (2016).
- Estrach, Joan Bruna, et al. "Spectral networks and deep locally connected networks on graphs." 2nd International Conference on Learning Representations, ICLR. Vol. 2014. 2014.
- Atwood, James, and Don Towsley. "Diffusion-convolutional neural networks." Advances in neural information processing systems. 2016.
- Defferrard, Michaël, Xavier Bresson, and Pierre Vandergheynst. "Convolutional neural networks on graphs with fast localized spectral filtering." Advances in neural information processing systems 29 (2016): 3844–3852.
- Duvenaud, David, et al. "Convolutional networks on graphs for learning molecular fingerprints." arXiv preprint arXiv:1509.09292 (2015).
- Hammond, David K., Pierre Vandergheynst, and Rémi Gribonval. "Wavelets on graphs via spectral graph theory." Applied and Computational Harmonic Analysis 30.2 (2011): 129–150.
- Bebis, George, et al. "Learning affine transformations." Pattern recognition 32.10 (1999): 1783–1799.
- Li, Wen-Jing, and Tong Lee. "Hopfield neural networks for affine invariant matching." IEEE transactions on neural networks 12.6 (2001): 1400–1410.
- Lu, R. T. M. A. C. Algorithms for discrete Fourier transform and convolution. Springer, 1989.
- Wu, Zonghan, et al. "A comprehensive survey on graph neural networks." IEEE transactions on neural networks and learning systems 32.1 (2020): 4–24.
- Weisstein, Eric W. "Laplacian matrix." https://mathworld. wolfram. com/ (1999).
- Singh, Rahul, Abhishek Chakraborty, and B. S. Manoj. "Graph Fourier transform based on directed Laplacian." 2016 International Conference on Signal Processing and Communications (SPCOM). IEEE, 2016.