Graph Neural Ordinary Differential Equations

Extending Graph Neural Networks into a continuous depth domain

Michael Poli
Towards Data Science

--

Rayleigh–Bénard convection. Finite element methods discretize spatially continuous problems into sets of entities that display complex inductive relational biases. Source: original author.

Multi — agent systems are prevalent across a variety of scientific fields: from physics to robotics, game-theory, finance and molecular biology, among others. Often, closed — form analytic formulations are not available and forecasting or decision making tasks have to rely on noisy, irregularly sampled observations.

This class of systems offers a crystal clear example of inductive relational biases. Introducing inductive biases in statistics or machine learning is a well known approach to improving sample efficiency and generalization performance. From the choice of objective function, to the design of ad — hoc deep learning architectures suited to the specific problem at hand, biases are common and effective.

Relational inductive biases [1] represent a special class of biases, concerned with relationship between entities. Graphical models, probabilistic or otherwise, are a traditional class of models specialized in imposing relational biases in the form of prior structures on entities. These graph structures are useful in different ways; namely reducing computational complexity by introducing conditional independence assumptions and increasing sample efficiency by encoding prior knowledge in graph — form.

Graph neural networks (GNNs) are the deep learning counterpart to graphical models. They are usually utilized when the target problem structure can be encoded as a graph or in settings where prior knowledge about relationships among input entities can itself be described as a graph. GNNs have shown remarkable results in various application areas such as node classification [2], graph classification and forecasting [3][4] as well as generative tasks [5].

Ordinary differential equations in deep learning

A different but equally important class of inductive biases is concerned with the class of systems from which the data is collected. Although deep learning has traditionally been a field dominated by discrete models, recent advances propose a treatment of neural networks as models equipped with a continuum of layers [6]. This view allows a reformulation of the forward pass as the solution of the initial value problem of an ordinary differential equation (ODE). This assumption allows direct modeling of ODEs and enhances the performance of neural networks on tasks involving continuous time processes.

Our work is aimed at bridging the gap between geometric deep learning and continuous models. Graph neural ordinary differential equations (GDEs) cast common tasks on graph — structured data into a system — theoretic framework:

GDEs model vector fields defined on graphs, both when the structure is fixed or evolves in time. This is achieved by equipping the model with a continuum of GNN layers.

GDEs provide flexibility due to their structure defined by a continuum of GNN layers and can therefore accommodate irregularly sampled sequential data.

The primary purpose of GDEs is to offer a data — driven approach to the modeling of structured systems, particularly when the dynamics are nonlinear and therefore challenging to approach with classical or analytical methods.

What follows is an introduction to GDEs. We refer to the full paper for additional details and derivations. A github repository with introductory examples in the form of commented Jupyter notebooks is currently under development. We encourage requests/suggestions of additional applications of GDEs: our plan it to eventually include working examples with GDE variants of all major graph neural network (GNN) architectures, deployed across various settings (forecasting, control…)

Preliminaries and background

GDEs, like GNNs, operate on graphs. We refer to the excellent survey on GNNs as well as the background section in our paper for a more detailed introduction on notation and basic definitions. The GDE introduction to follow is in distilled form and only two basic facts about graphs are immediately necessary:

  1. Graphs are collections of nodes (entities) connected by edges. It is common for deep learning models to work on attributed graphs, that is graphs whose nodes are described by a set of features (often in the form of an embedding vector or tensor). For graphs with n nodes, each described by d features, we denote the n x d node embedding matrix as H.
  2. The structure of the graph is captured by its adjacency matrix A. The connectivity structure between nodes represents the main difference between standard deep learning models and GNNs [1], since the latter utilize it directly in a variety of ways to perform operations on node embeddings.

Graph Neural Ordinary Differential Equations

A Graph neural ordinary Differential Equation (GDE) is defined as follows:

General GDE formulation

where H is the node feature matrix. The above defines a vector field for H, parametrized by function F which can be any known graph neural network (GNN) layer. In other words, F utilizes connectivity information of graph G, as well as its node features, to specify the evolution of H in S. Here S is the depth domain of the model; S is continuous, differently from GNNs with depth domain specified by subsets of the natural numbers, and represents the integration domain of the ordinary differential equation defined by F. GDEs can be trained in a variety of ways, much like standard Neural ODEs [6]. Well-posedness of the system is discussed in full in the paper.

The general GDE formulation carries with it several implications. In the case of general Neural ODEs, it has been observed that the choice of discretization scheme can describe previously known discrete multi — step variants of ResNets [7]. The continuous, dynamical system point of view of deep learning is therefore not limited to the modeling of differential equations and can guide discovery of novel general purpose models by leveraging a rich literature of numerical methods.

Compared to ResNets, GNNs are relatively younger as a model class. The literature of multi — step or variants with complicated, fractal — like residual connections is therefore not as developed; discovery of new GNNs variants can be guided by applying various discretization schemes to GDEs, instead of starting from scratch.

Results on static graphs: node classification

We show that GDEs can be utilized as high — performance general purpose models by performing a series of experiments on a semi — supervised node classification task on Cora, Pubmed and Citeseer. These datasets contain static graphs, where adjacency matrix A remains fixed, and are thus far removed from the dynamical system setting where GDEs shine. We evaluate performance of Graph Convolutional ordinary Differential Equations (GCDEs), defined as:

GCDE model. A more detailed version is included in our paper, along with GDE versions of popular GNN variants

as well as their fully — discretized counterpart Graph Convolutional Networks (GCN) [8]. We include well — known Graph ATtention networks (GAT) [9] for reference:

Accuracy on node classification tasks. Mean and standard deviation of 100 runs.

GCDEs are shown to be competitive with state — of — the — art models and outperform their discrete counterparts. We evaluate two versions of GCDEs: discretized with a fixed step scheme, Runge-Kutta4 (GCDE — rk4), as well as an adaptive step scheme, Dormand-Prince (GDDE — dpr5). Fixed step discretizations do not ensure that the ODE approximation remains close to the analytical solution; in this case, where solving a proper ODE is not necessary, GCDE — rk4 simply offer a computationally efficient FractalNet — like structure for GCNs that improves accuracy.

Training loss and accuracy on Cora. Shaded area is the 95% confidence interval

On the other hand, training GCDEs with adaptive step solvers naturally leads to deeper models than possible with vanilla GCNs, whose layer depth greatly reduces performance. In our experiments, we successfully trained GCDE — dpr5 with up to 200 ODE function evaluations (NFE), a significantly higher amount of computation on graphs than possible with vanilla GCNs, whose layer depth greatly reduces performance. It should be noted that GDEs do not require more parameters than their discrete counterparts, due to the fact that they reutilize their parameters across function evaluations. Interestingly, adaptive step GDEs do not seem to suffer from over — smoothing of node features. The over — smoothing issue [10] prevents effective use of deep GNNs in various domains, particularly multi — agent reinforcement learning (MARL); we are currently exploring this property of GDEs and will include a more detailed analysis soon.

Spatio-Temporal GDEs

A key setting for GDEs involves spatio — temporal graph data. When handling sequences of graphs, a recurrent version of GNNs is necessary [11][12]. However, much like regular recurrent neural networks (RNNs) and their variants, fixed discretizations do not allow operations on irregularly sampled data. This fact has motivated further developments in the form of RNNs with prior assumption on dynamics between arrival times [13] as well as the ODE version of RNN [14].

In scenarios involving a temporal component, the depth domain of GDEs S coincides with the time domain and can be adapted depending on the requirements. For example, given a time window Δt, the prediction performed by a GDE assumes the form:

regardless of the specific GDE architecture. GDEs represent a natural model class for autoregressive modeling of sequences of graphs and naturally lead to an extension of classical spatio — temporal architectures in the form of hybrid dynamical systems, i.e. systems characterized by interacting continuous and discrete — time dynamics. The core idea is to have a GDE smoothly steering the latent node features between two time instants and then apply some discrete operator, resulting in a jump of node features H which is then processed by an output layer.

Given a set of time instants {(t ₖ)} and a state — graph data stream {(X ₜ , G ₜ)} the general formulation of autoregressive GDEs is:

Autoregressive GDE. Continuous variants of known spatio — temporal GNN models can be obtained from this system by choosing appropriate F, G, K

where F, G, K are GNN — like operators or general neural network layers and H represents the value of H after the discrete transition. The evolution of the system can be visualized by means of hybrid automata:

Hybrid automata schematic of autoregressive GDEs.

Compared to standard recurrent models which are only equipped with discrete jumps, autoregressive GDEs incorporate a continuous flow of latent node features H between jumps. This feature of autoregressive GDEs allows them to track dynamical systems from irregular observations. Different combinations of F, G, K can yield continuous variants of most common spatio-temporal GNN models.

To evaluate the effectiveness of autoregressive GDE models on forecasting tasks we perform a series of experiments on the established PeMS traffic dataset. We follow the experimental setup of [15] with an additional preprocessing step: undersampling of time series with a probability 0.7 of removal for each entry in order to simulate a challenging environment with irregular timestamps and missing values.

To measure performance gains obtained by GDEs in settings with data generated by continuous time systems, we employ a GCDE — GRU as well as its discrete counterpart GCGRU [12] and we contextualize the results with vanilla GRU metrics. For each model under consideration we collect normalized RMSE (NRMSE) and mean absolute percentage error (MAPE). More details about the chosen metrics and data are found in our paper.

Non — constant differences between timestamps result in a challenging forecasting task for a single model since the average prediction horizon changes drastically over the course of training and testing. For a fair comparison between models we include delta timestamp information as an additional node feature for GCGNs and GRUs.

Results of irregular data forecasting task. Mean and standard deviation across 5 training runs.

Since GCDE-GRUs and GCGRUs are designed to match in structure and number of parameters we can measure a performance increase of 3% in NRSME and 7% in MAPE over GCGRUs metrics. A variety of other application areas with continuous dynamics and irregular datasets could similarly benefit from adopting GDEs as modeling tools: medicine, finance or distributed control systems, to name a few. We are working on additional experiments in these domains and welcome requests, ideas or collaborations.

Conclusions

As mentioned before, we are currently working on a github repository with a series of examples and applications of different types of GDEs.

We encourage requests/suggestions of additional applications of GDEs: our plan it to eventually include working examples with GDE variants of all major graph neural network (GNN) architectures, deployed across various settings (forecasting, control…). Our paper is available on arXiv as preprint: consider citing us if you find our work useful.

References:

[1] P. W. Battaglia et al. Relational inductive biases, deep learning, and graph networks. arXiv preprint arXiv:1806.01261, 2018.

[2] J. Atwood and D. Towsley. Diffusion-convolutional neural networks. In Advances in Neural Information Processing Systems, pages 1993–2001, 2016.

[3] Z. Cui, K. Henrickson, R. Ke, and Y. Wang. Traffic graph convolutional recurrent neural network: A deep learning framework for network-scale traffic learning and forecasting. arXiv preprint arXiv:1802.07007, 2018

[4] J. Park and J. Park. Physics-induced graph neural network: An application to wind-farm power estimation.Energy, 187:115883, 2019.

[5] Li, O. Vinyals, C. Dyer, R. Pascanu, and P. Battaglia. Learning deep generative models of graphs. arXiv preprint arXiv:1803.03324, 2018.

[6] T. Q. Chen, Y. Rubanova, J. Bettencourt, and D. K. Duvenaud. Neural ordinary differential equations. In Advances in neural information processing systems, pages 6571–6583, 2018.

[7] Y. Lu, A. Zhong, Q. Li, and B. Dong. Beyond finite layer neural networks: Bridging deep architectures and numerical differential equations. arXiv preprint arXiv:1710.10121, 2017.

[8] T. N. Kipf and M. Welling. Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907, 2016.

[9] P. Velickovic, G. Cucurull, A. Casanova, A. Romero, P. Lio, and Y. Bengio. Graph attention networks. arXiv preprint arXiv:1710.10903, 2017.

[10] Chen, Deli, et al. “Measuring and Relieving the Over-smoothing Problem for Graph Neural Networks from the Topological View.” arXiv preprint arXiv:1909.03211 (2019).

[11] Y. Li, R. Yu, C. Shahabi, and Y. Liu. Diffusion convolutional recurrent neural network: Data-driven traffic forecasting. arXiv preprint arXiv:1707.01926, 2017

[12] X. Zhao, F. Chen, and J.-H. Cho. Deep learning for predicting dynamic uncertain opinions in network data. In 2018 IEEE International Conference on Big Data (Big Data), pages 1150–1155. IEEE, 2018.

[13] Z. Che, S. Purushotham, K. Cho, D. Sontag, and Y. Liu. Recurrent neural networks for multi-variate time series with missing values.Scientific reports, 8(1):6085, 2018.

[14] Rubanova, R. T. Chen, and D. Duvenaud. Latent odes for irregularly-sampled time series. arXiv preprint arXiv:1907.03907, 2019.

[15] B. Yu, H. Yin, and Z. Zhu. Spatio-temporal graph convolutional networks: A deep learning framework for traffic forecasting. In Proceedings of the 27th International Joint Conference on Artificial Intelligence (IJCAI), 2018.

--

--

Researcher at KAIST. Working at the intersection of deep learning, dynamical systems, optimization and control.