Beyond Graph Convolution Networks

A Dive into Advanced Graph Neural Network Architectures

Aishwarya Jadhav
Towards Data Science

--

Graph Neural Networks, though a fairly new concept, has gained immense popularity as an interesting methodology of analyzing graphs, the deep learning way. Being able to naturally fit many real-world datasets, which have an inherent graph structure, GNNs have found applications in many different domains from online advertising to road traffic prediction and drug designing. This blog post: Applications of Graph Neural Networks enumerates the variety of areas where GNNs have achieved significant results.

The early GNNs would learn a target node’s representation by propagating neighbour information in an iterative manner until a stable fixed point is reached. However, GNNs hardly stayed in this nascent form, quickly adopting ideas from other successful areas of deep learning to evolve to the architecture we know today as Graph Convolution Networks.

Graph Convolution Networks

As the name suggests, Graph Convolution Networks (GCN), draw on the idea of Convolution Neural Networks re-defining them for the graph domain. As for images, the convolution framework aims to capture neighbourhood information for graph nodes also.

Image Convolution and Graph Convolution

GCNs, however, come in two flavours: Spectral GCNs and Spatial GCNs. Spectral-based approaches define graph convolutions by introducing filters from the perspective of graph signal processing which is based on graph spectral theory. Spatial-based approaches formulate graph convolutions as aggregating feature information from neighbours. A drawback of the spectral approach is that it requires the entire graph to be processed simultaneously, which can be impractical for large graphs with billions of nodes and edges such as the social network graphs. Another disadvantage stemming from this requirement of handling the entire graph together is that it is difficult to take advantage of the parallel processing power available today. Spatial methods do not suffer from this as they directly perform convolution in the graph domain by aggregating the neighbour nodes’ information. Together with sampling strategies, the computation can be performed in a batch of nodes instead of the whole graph, which has the potential to improve efficiency.

Another pain-point for spectral convolution methods is that they assume a fixed graph, thus generalizing poorly to new or different graphs. Spatial models, on the other hand, perform graph convolution locally on each node, and hence, can easily share weights across different locations and structures. Due to all these factors, spatial models have attracted increasing attention in recent years.

For Spatial Graph Convolution, the key is to learn a function f to generate a node vi’s representation by aggregating its own features Xi and neighbours’ features Xj. The aggregation function must be one that is invariant to permutations of node orderings, such as mean, sum and max functions. After feature aggregation, a non-linear transformation such as Relu/Tanh is applied to the resultant outputs. To explore the depth and breadth of a node’s field of influence, a common practice is to stack multiple graph convolution layers together. Each hidden GCN layer is one iteration of the aggregation operation followed by the non-linear activation. Each iteration/layer uses the node representations from the previous iteration/layer to get the representations for the current one.

The first layer of GCN facilitates information flow between first-order neighbours; the second layer gets information from the second-order neighbours i.e., from the neighbour’s neighbours. Continuing this way, by stacking multiple layers, the final hidden representation of each node receives messages from a further neighbourhood.

While GCNs operate on the node level, graph pooling modules can be interleaved with the GCN layer, to coarsen graphs into high-level sub-structures. Such an architecture design can be used to extract graph-level representations and to perform graph classification tasks.

Due to the ever-improving efficiency of GCNs, they are yielding increasingly better results for many application domains and hence, have become baseline architecture for many other GNN models.

Moving Beyond GCNs, the following is a classification of some of the more complex graph neural network architectures that apply to a variety of different problems classes:

1. Graph Attention Networks

2. Graph Auto-Encoders

3. Graph Generative Networks

4. Graph Spatio-Temporal Networks

Graph Attention Networks

Attention mechanisms have almost become a standard in sequence-based tasks. Attention mechanisms have the ability to focus on the most important parts of the input puzzle. This has proven to be especially beneficial for tasks such as machine translation and natural language understanding. Thanks to the increased model capacity of attention mechanisms, graph neural networks can also benefit from the attention model. Graph Attention Networks are similar to GCNs and seek an aggregation function to fuse the neighbouring nodes’ representations, random walks, or outputs from multiple candidate models to learn a new representation. The key difference is that graph attention networks employ attention mechanisms which assign larger weights to the more important nodes, walks, or models. The attention weights are learned with a neural network within an end-to-end framework that encompasses the graph neural architecture and the attention neural network.

For the purpose of understanding how the attention mechanism applies to graph-structured data, we will see how neighbour node information is aggregated using attention models to get the current node’s representation. The following figure depicts the differences:

In figure (a), while GCNs explicitly assign a non-parametric weight aij to the neighbour vj of vi during the aggregation process, Graph Attention Networks, in figure (b), implicitly capture the weight aij via an end-to-end neural network architecture, so that more important nodes receive larger weights.

Similarly, for random walks based GNN, different attention weights are assigned to different walks. The advantage of Graph Attention Networks is that they can adaptively learn the importance of each neighbour. However, the computation cost and memory consumption increase rapidly as the attention weights between each pair of neighbours must be computed.

Graph Auto-Encoders

Graph auto-encoders aim to learn low dimensional vectors via an encoder, and then reconstruct the graph via a decoder. These vectors can be at node level, i.e., one vector per node generating a dense low-dimensional matrix representation for the graph or at the graph level, i.e., a vector that captures all the structural and node information of the graph. Graph auto-encoders are a popular approach to learn the graph embedding since they generate an intermediate vector representation.

A typical solution involves leveraging multi-layer perceptrons as the encoder to obtain node embeddings; a decoder reconstructs a node’s neighbourhood statistics such as edges between the nodes via link prediction mechanisms. The encoder can capture both the first and second order information of nodes. For plain graphs, many algorithms directly process the adjacency matrix to construct a new dense matrix with rich information. For attributed graphs, graph auto-encoder models tend to employ GCN as a building block for the encoder. Another variant of the graph auto-encoder model involves using graph generative mechanisms (discussed in the next section) for re-creating the graph from the embeddings.

The following figure depicts the general structure of a graph auto-encoder model:

The encoder uses GCN layers to get latent representations for each node. The decoder computes the pair-wise distance between node vectors produced by the encoder. After applying a non-linear activation function, the decoder reconstructs the graph adjacency matrix. The whole model is trained in a way that a plain pair-wise distance between the computed node vectors (followed by a non-linear operation) is enough to reproduce the edge information of the graph. That is, the encoder and the decoder are trained jointly so the learned node representations are such that the decoder is able to compliment the encoder.

After training, the decoder module can be disconnected to produce just the vectors and use them as graph/node embeddings which can be used as inputs to further downstream processing.

Graph Generative Networks

Given a data distribution, Graph Generative Networks (GGNs) aim to generate plausible structures from the data, optionally, according to certain constraints. That is, they try to capture the entities and their relationships, implicit in a given data, and try to model these as graphical structures. Alternatively, GGNs may try to generate a new graph with certain properties, given a set of observed graphs. One promising application domain of GGNs is chemical compound synthesis. In a chemical graph, atoms are treated as nodes and chemical bonds are treated as edges. The task is to discover new synthesizable molecules which possess certain chemical and physical properties, given a set of existing molecules. Another application of GGNs is in natural language processing where generating a semantic or a knowledge graph is often conditioned on a given sentence.

However, generating graphs given a graph empirical distribution is fundamentally challenging, mainly because graphs are complex data structures. One approach is to factor the generation process as forming nodes and edges alternatively. Deep Generative Models of Graph (DGMG) is one work that is representative of this approach. DGMG utilizes a GCN to get vector representations of existing input graphs. The decision process of generating nodes and edges is conditioned on this vector. DGMG recursively proposes a node to a growing graph until a stopping criterion is evoked. In each step after adding a new node, DGMG repeatedly decides whether to add an edge to the added node until the decision turns to false.

Another approach is based on Generative Adversarial Networks (GANs) that are used to generate images. But instead of using Convolution Neural Networks, Graph Convolution Networks are used as the building blocks in the adversarial networks. GAN consists of a generator and a discriminator, competing with each other to improve the authenticity of the generator. Firstly, the generator generates a candidate graph from a given data distribution. This candidate graph is often referred to as a faked graph. The discriminator aims to distinguish the faked sample from the empirical data. Additionally, a reward network can be introduced in parallel with the discriminator to encourage the generated graphs to possess certain properties. The discriminator and the reward network are fed the vector representation of the faked graph; both of them individually output a score which is used as feedback to the generator. Following is a representation of the architecture:

A disadvantage of both the approaches mentioned above is that they are not able to scale to large graphs as the output space of a graph with n nodes is n^2.

Graph Spatio-Temporal Networks

Spatio-temporal graphs are graph structures where the node and/or edge features change with time t. For example, road traffic networks can be modelled as graphs where each key location is a node and the sensor at that location monitors the speed of traffic at that location over a number of time steps; hence, the node values (the speed of traffic at that location) keep changing with time. Edges could simply be distances between pairs of sensors or location nodes. Using this temporal information of the spatial graph of road networks, traffic for a future timestep for the nodes of the spatiotemporal graph can be forecasted.

A variant of the traffic forecasting problem is the taxi demand prediction problem which has a very practical use case in intelligent transportation systems that pre-allocate taxi resources to meet travel demand and to reduce empty taxis on streets which waste energy and worsen the traffic congestion. This is especially useful, seeing the boom of online transportation markets such as Uber and Ola. As in the traffic prediction case, key locations can be modelled as nodes, with features such as the traffic and weather conditions at those locations: these keep changing continuously with the time of the day or month. The node labels could be the number of taxis requested in a particular area, this again varies with time and the problem is to predict these values for a future time step.

The key idea of graph spatial-temporal networks is to consider spatial dependency and temporal dependency simultaneously. Many current approaches apply GCNs to capture the spatial dependency together with some RNN or CNN to model the temporal dependency. One such approach called Diffusion Convolution Recurrent Neural Network (DCRNN) introduces diffusion convolution as graph convolution for capturing spatial dependency and uses sequence-to-sequence architecture with gated recurrent units (GRU) to capture temporal dependency. DCRNN processes the inputs of GRU using a graph (diffusion) convolution layer so that the recurrent unit simultaneously receives history information from the last time step and neighbourhood information from graph convolution. The advantage of DCRNN is that it is able to handle long-term dependencies because of the recurrent network architectures.

In the above sections, we went over a few of the advanced Graph Neural Network architectures that draw upon previously successful architectures employed in the deep learning domain for a variety of problems. Almost all of these use graph convolutions as a building block. Even so, these complex architectures fit better to certain types of graph and/or problem use cases and hence produce better results. As the efficiency of GCNs increases, we will see more such variations coming up and the taxonomy of GNNs ever growing!

References:

https://arxiv.org/abs/1901.00596

--

--

Google Intern | Carnegie Mellon University Masters of Computational Data Science | Ex-Morgan Stanley https://linkedin.com/in/aishwaryajadhav8/