An introduction to Graph Neural Networks

Neural Networks aimed at effectively handling graph data.

Joao Schapke
Towards Data Science

--

Graph structured data is common across various domains, examples such as molecules, { social, citation, road } networks, are just a few of the vast array of data which can be represented with a graphs. With the advancements of machine learning we witness the potential for applying intelligent algorithms on the data which is available. Graph Neural Network is the branch of Machine Learning which concerns on building neural networks for graph data in the most effective manner.

Notwithstanding the progress made with ML in the computer vision domain with convolutional networks, Graph Neural Networks (GNNs) face a more challenging problem, they deal with the awkward nature of graphs. Differently from images and text, graphs do not have a well defined structure. A graph’s node might have no connections or many, of which could be directed or undirected. Graphs in a dataset may have a variable number of nodes and edges and cannot be properly ‘resized’, and they could be either acyclic, cyclic, joint, disjoint. All in all it makes the process of handling the data more challenging.

From Saleens et al.

A simple approach to deal with graphs is to concatenate nodes, and build a more friendly structured dataset which could be fed to a Multi Layer Perceptron (MLP), and although is not a quite elegant method it could provide good results. Although it’s not sufficient to handle problems in which you deal with multiple graphs of variable sizes and topologies or with large graphs which would severely increase the dimensionality of the input. Variable sized graphs could be handled by using Recurrent Neural Networks (RNN’s) by converting graphs into a sequences of nodes. But, this strategy is doomed to failure as well, one of the most important information within a graph is its structure, which would be left no trace of in the node sequence. Graph data require more robust approaches which are tailored to handle their structure.

Nevertheless the difficulty of dealing with graphs, the progress of convolutional networks inspired much research and new approaches on the field of GNN. Convolutional layers enjoy some useful properties such as: local connectivity, they assume that pixels which are close together are highly correlated. Shift invariance, convolutions learn the features of objects notwithstanding their spatial position. They also are able to convolute over images of any size. All of which are really attractive features for neural networks on graphs, which should value local connectivity, because a node’s neighbors are correlated with the node itself, they should be shift invariant, in the sense that a neighborhood should conceive the same meaning to the convolution despite of the particular place on the graph where is located, and they also should be invariant to the size of graphs.

This gave rise to different approaches of applying convolutions in graphs, which can mainly be classified as either spatial based graph convolutions or spectral based graph convolutions.

Photo by Zak Sakata on Unsplash

Spectral Graph Convolutions

Spectral GNN’s is a more principled approach towards convolutions, which have a foundation on signal preprocessing theory.

Without going into much details on the theory behind spectral convolutions, it has been shown that a convolution in the spatial domain equals to multiplication in the frequency domain (convolution theorem). The same theorem can be applied to graphs, but instead of using the Discrete Fourier Transform matrix as basis we use the eigenvectors of the graph Laplacian.

Where the laplacian of a graph is defined as: L = I — D-1/2 A D-1/2

where I is the identity matrix, D is the diagonal order matrix and A is the adjacency matrix of the graph.

The main limitation of spectral convolutions is their dependency on the eigenvectors of the Laplacian Matrix, which are needed in every calculation, which have the time complexity of O(n2) for backward propagation and O(n3) for calculating the eigen-decomposition.

Further work alleviated the time complexity by approximating the equations using Chebyshev polynomials and Cayley Polynomials.

GCN (Graph Convolutional Networks) made further simplifications into the ChebNet networks which used Chebyshev polynomials.

Photo by Ashley Whitlatch on Unsplash

Spatial Graph Convolutions

Spatial convolutions are a more relaxed version of graph convolutions, which got popularized by the simplifications made by GCN on previous spectral GNN’s.

Spatial convolutions are similar to regular convolution in the way they spatially convolute over nodes. Convolutions generate a new pixel from the pixel itself ( the pixel placed in the same position as the new one ) and it’s surrounding pixels, spatial graph convolutions makes convolutions by aggregating a node and it’s neighbors into a new node.

Image illustrating the difference between a regular convolution in a grid structured image and a spatial graph convolution in a graph.

From Wu et al.

A spatial graph convolution for a single node can be defined as:

hi = (sum(N) + ni) * W

Where ni represents node i which belongs to Rc , N represents the neighbors of ni, w is the weight matrix belonging to Rc x f. hi is the new representation of node i.

We note this convolution will aggregate the local information of a node into a new node. From the GCN paper we have that:

Which is similar to the previous equations. Theta is a non-linear activation function applied to the convolution.

à is the graph adjacency matrix plus the identity matrix, this will make take the role to aggregate every node with its neighbors.

D is the order diagonal matrix. If we note in the first equation that we are summing over all the neighbors of node i, and nodes in the graph can have an arbitrary number of neighbors we should note that we must normalize the results of the summation. Multiplying by the inverse diagonal matrix does the job.

This simple operation is the basis for spatial convolutions. And stacking a couple of these with dropout layers, L2 regularization and using ReLU as the activation function and Adam as the optimizer, GCN was able to achieve SOTA for 3 node classification benchmarks on the field.

Note that the de facto approach used today are spatial based convolutions. Which provide a more efficient and simple approach to graph convolutions.

Graph Neural Networks can deal with a wide range of problems, naming a few and giving the main intuitions on how are they solved:

Node prediction, is the task of predicting a value or label to a nodes in one or multiple graphs. Ex. predicting the subject of a paper in a citation network.

These tasks can be solved simply by applying the convolution described above. Most benchmarks are in a semi-supervisioned setting, where the datasets consists of a single graph with some node labels destined for training/validation and others for testing.

Graph prediction, predicting a value or label to a graph. ex. predicting the toxicity of a particular molecule.

These tasks are usually solved by applying a couple of graph convolutional layers to aggregate information among nodes, then using some sort of readout operation to aggregate all the nodes in the graph into a fixed sized representation and then applying a MLP model to get the final results.

The readout operation could be simply taking the mean / average / maxpooling over all the nodes.

Graph generation, generating graphs. ex. generating novel molecules with desired properties such as drug-likeness and synthetic accessibility.

These can be solved with Graph Autoencoders.

Edge prediction, analog to node prediction, predicting a value or label for an edge, ex. predicting if two entities are related in a knowledge base.

Predicting edge is generally solved by applying graph convolutional layers, although most datasets and papers are aimed towards node prediction, there are models focused on edge prediction and most node prediction models can be generalized for edge prediction.

More example of tasks (as well as benchmarks and papers) can be found here.

Further we explore a few extensions to Graph Convolution Networks and GNNs.

Created by starline

Spatial-temporal Graph Neural Networks

GNN’s can also be tailored to handle time forecasting problems or prediction problems which deal with time variant data (ex. traffic flow prediction).

Spatial-temporal graph data comes as multiple graphs each representing a timesteps, where graphs may have varying sizes.

There are two main approaches for dealing with sequenced graphs, by applying RNNs or CNNs.

RNN solutions generally apply graph convolutions to aggregate node level information and apply RNN’s to aggregate temporal level information.

Convolutional approaches follow the same spirit and generally apply graph convolutions to aggregate nodes, and then 1D convolutions for temporal level information.

Then depending if the task is graph based, readout operations will be applied to the graph to generate a single output value.

A caveat of convolutional neural networks is that they do not handle non-local correlation well.

Graph convolutions only take the first order neighbors into consideration, and although multiple convolutional layers could be stacked together in order to get the information from a larger neighborhood the input graphs come at an arbitrary size and can’t be properly resized, so nodes don’t usually get global information of the graph.

An example.

Consider the image below, in a traffic flow prediction task.

From Xu Geng and Yaguang Li.

The image contains several points of interest (POI) of which traffic flow should be predicted. This is a task well suited for spatio-temporal GNN’s because roads are naturally graph structured and all the regional information ( such as the presence of a nearby school or park ) could be embedded on the graph. The task is also related to the time, the traffic flow will change depending on the hour of the day or the day of the week.

Naturally we see that there is an important local correlation to the prediction, traffic flow is dependent on the nearby things around it. And we also notice a non local, contextual, correlation. In particular we may assume that POI 1 and 3 will have highly correlated traffic flows due to both points being located at a school with an hospital and amusement park nearby. Being able to identify such contextual correlation should improve the generalization of the model. The solution proposed by Xu Geng and Yaguang Li was to make a measure the similarity between points of interest, and if they are highly similar the information will be taken into consideration:

‘As’ is a N x N matrix which serves a similar purpose as of the adjacency matrix as in the equation for GCN above.

Though, this is a task specific problem, many tasks do not have non-local correlation, and in graph based tasks this problem is alleviated by the readout operation.

Attention based approaches.

Another great source of inspiration for new research was the development of attention mechanisms and full end to end attention models which achieved new state of the art in a wide range of benchmarks in NLP.

Attention methods were successfully applied to many graph convolutional networks such as GAT and GAAN. Differently from standard graph convolutional networks as GCN which gives the same weights to all neighboring nodes when performing a convolution, GAT introduces an attention mechanism which assigns different weights to a node’s neighbors.

Interestingly, the original Transformer model in NLP was also adapted to work with graph data, unsurprisingly, the model was named Graph Transformer.

It was applied in a graph-to-sequence task, where the model receives a graph and outputs a sequence, which the model had to generate of text from Abstract Meaning Representation (AMR) graphs. Common approaches to solve the task are around generating embeddings for the graph (mainly by using GNN’s) and then training NLP models with it.

Although Graph Convolution are commonly used to generate embeddings the authors noted that those have difficulty in learning distant relationships among nodes. Briefly, they created embeddings of graphs using a Graph Autoencoder based on the shortest distance between nodes.

Conclusion:

Similarly to every other field of Machine Learning, Graph Neural Networks is a growing and fast moving field.

But differently from many others, most problems in GNN’s do not have a well established solution or a superior model architecture that serves as a plug and play solver. Problems in the graph domain definitely ask for vertical thinking and a wide range of solutions could yield superior results depending on the problem and the data. One example from a kaggle competition with the task of predicting the force of interaction between atoms in a particular bond of a molecule, the first solution solved the problem with an end to end Graph Transformer, while the second to fourth solutions hand crafted sequences with the graphs and used standard NLP Transformers, other approaches used more general graph convolution neural networks such as GAT and GCN or used GNN’s built specifically to deal with data of the field, such as SchNet. Which shows the heterogeneity of approaches within the field.

There are many more interesting subjects within GNN’s which are not mentioned here. Some topics not mentioned: Graph Recurrent Neural Networks, Graph Autoencoders, Graph Adversarial Methods, Graph Reinforcement Learning.

GNN’s is an interesting field which I have been studying for the last couple of months I’m by no means an expert on it. If you notice any mistakes, make sure to leave a comment.

References:

A Comprehensive Survey on Graph Neural Networks — Zonghan Wu

Deep Learning on Graphs: A Survey — Ziwei Zhang

Graph Neural Networks: A Review of Methods and Applications — Jie Zhou, Ganqu Cui, Zhengyan Zhang

--

--