Thoughts and Theory, State of the Art Digest

Graph ML in 2022: Where Are We Now?

Hot trends and major advancements

Published in

Towards Data Science

16 min readDec 28, 2021

It’s been quite a year for Graph ML — thousands of papers, numerous conferences and workshops… How do we catch up with so many cool things happening around? Well, we are puzzled as well and decided to present a structured look at Graph ML highlighting 🔥 trends and major advancements.

The image was generated by ruDALL-E with a prompt “graphs floating in space”.

Whether you are working on a narrower topic or just starting in Graph ML — we hope this article would be a good reference point.

Surely, the field is so large that any such review would be somewhat myopic, so let us know in the comments if we missed something important!

Table of Contents:
1. Graph Transformers + Positional Features
2. Equivariant GNNs
3. Generative Models for Molecules
4. GNNs + Combinatorial Optimization & Algorithms
5. Subgraph GNNs: Beyond 1-WL
6. Scalable and Deep GNNs: 100 Layers and More
7. Knowledge Graphs
8. Generally Cool Research with GNNs
9. New Datasets, Challenges, and Tasks
10. Courses and Books
11. Libraries & Open Source
12. How to Stay Updated

Graph Transformers + Positional Features

While GNNs operate on usual (normally sparse) graphs, Graph Transformers (GTs) operate on the fully-connected graph where each node is connected to every other node in a graph. On one hand, this brings back the O(N²) complexity in the number of nodes N. On the other hand, GTs do not suffer from oversmoothing which is a common problem of long-distance message passing. Fully-connected graphs mean we have ‘true’ edges from the original graph and ‘fake’ edges added from the fully-connected transformation, and we want to distinguish those. Even more importantly, we need a way to imbue nodes with some positional features, otherwise GTs fall behind GNNs (as shown in the 2020 paper of Dwivedi and Bresson).

Perhaps the two most visible graph transformer models this year were SAN (Spectral Attention Nets) and Graphormer. SAN by Kreuzer, Beaini, et al employs top-k eigenvalues and eigenvectors of the Laplacian showing that spectral features alone can distinguish graphs considered isomorphic by 1-WL test. Having spectral features concatenated with input node features, SAN outperforms sparse GNNs on many molecular tasks.

Graphormer by Ying et al takes a different approach and employs spatial characteristics. First, node features are enriched with centrality encoding — learnable embeddings of in- and out-degrees. Then, the attention mechanism has two bias terms: 1️⃣ distance of the shortest path between nodes i and j; 2️⃣ edge feature encoding that depends on one of the available shortest paths.

Graphormer positional encoding scheme. Source: **Ying et al**

🏅 Graphormer accomplished the 2021 Grand Slam of Graph ML: first places in the graph regression task of OGB Large-Scale Challenge and Open Catalyst Challenge! (more on those challenges below)

🤔 Open problems: scalability and computational overhead. SAN and Graphormer were evaluated on molecular tasks where graphs are rather small (50–100 nodes on average) and we could afford, eg, running an O(N³) Floyd-Warshall all-pairs shortest paths. Besides, Graph Transformers are still bottlenecked by the O(N²) attention mechanism. Scaling to graphs larger than molecules would assume tackling those issues. Did you just think “Linear transformers from NLP”? 😉 Yes, they might give a hand, but as they never materialize the attention matrix, you’d need to find a smart way to put edge features to such models. Surely, we’ll see more research on that in 2022!

Equivariant GNNs

What’s unique in equivariance that Geoffrey Hinton acknowledges its coolness?

Generally speaking, equivariance is defined over certain transformation groups, eg, 3D rotations form SO(3) group, Special Orthogonal group in 3 dimensions. Equivariant models took ML by the storm 🌪 in 2021, in Graph ML it was particularly disruptive in many molecular tasks. Applied to molecules, equivariant GNNs need an additional input to node features — namely, some representation of physical coordinates of molecules which will be rotated/reflected/translated in n-dimensional space.

With an equivariant model, we arrive at the same final state despite the order of transformations. Source: **Satorras, Hoogeboom, and Welling**

Satorras, Hoogeboom, and Welling propose EGNN, E(n) equivariant GNNs, whose important difference from vanilla GNNs is about adding physical coordinates to the message passing and update steps. Eq. 3 👇 adds the relative squared distance to the message m, and Eq. 4 updates positional features. EGNN show impressive results in modeling n-body systems, as an autoencoder, and in quantum chemistry tasks (QM9 dataset).

Important differences from vanilla GNNs: Eq.3 and 4 add physical coordinates into the message passing and update steps. Source: **Satorras, Hoogeboom, and Welling**

Another option is to incorporate angles between atoms as done in GemNet by Klicpera, Becker, and Günnemann. This might require transforming an input graph to a line graph, eg, a graph of edges, where edges from the original graph become nodes in the line graph. That way, we can incorporate angles as edge features in a new graph.

Using angles as primary coordinates. Source: **Klicpera, Becker, and Günnemann**

GemNet shows strong results on molecular dynamics tasks: COLL, MD17, and Open Catalyst20. Clearly, the equivariance area is only taking off 🛫 and we’ll see more advancements in 2022!

Generative Models for Molecules

The whole field of drug discovery (DD) received a significant boost 🚀 in 2021 thanks to geometric DL. One of the many crucial challenges of DD is generating molecules (graphs) with desired properties. The field is huge so we’ll just highlight three branches of models.

Normalizing Flows.
Satorras, Hoogeboom et al apply the equivariance framework described above to create E(n) equivariant normalizing flows capable of generating 3D molecules with positions and features 💪

Probabilistic Models.
Shi, Luo, et al study the problem of generating 3D conformers, i.e., 3D structures, given a 2D graph. The model, ConfGF, estimates the gradient field of the log density of atomic coordinates. Those fields are roto-translational equivariant, and the authors come up with a way to incorporate this equivariance property into the estimator. Conformer sampling itself is done via the annealed Langevin dynamics sampling.

RL approaches.
Described in a very non-scientific way, those approaches generate molecules by appending ‘building blocks’ step by step. We could broadly classify such approaches in a way they condition this building process.

For example, Gao, Mercado, and Coley condition the building process on the synthesizability, i.e., whether we could actually create 🧪 this molecule in the lab or not. For that, they first learn how to create synthetic trees (kind of templates) of the building blocks.

A somewhat more generic framework is proposed by the team of Mila and Stanford researchers led by Yoshua Bengio who introduce Generative Flow Networks (GFlowNets). It’s hard to summarize it in a few sentences — for starters, GFlowNets can be used in the active learning cases when we want to sample diverse candidates, and the sampling probability is proportional to a reward function. Further, their recent NeurIPS’21 paper shows benefits of GFlowNets applied to molecule generation tasks. Check the blog post by Emmanuel Bengio describing the framework in more detail and with more experimental evidence.

GNNs + Combinatorial Optimization & Algorithms

2021 was a big year for this emerging subfield!

Xu et al in their outstanding ICLR’21 paper study extrapolation of neural nets and arrive at several spectacular outcomes. With the notion of algorithmic alignment (introduced by the same authors in the spotlight ICLR’20 paper), the authors show that GNNs are well-aligned with dynamic programming (DP)(check the illustration 👇). Compared to the ICLR’20 paper, here the authors discuss a stronger extrapolation condition — linear alignment to DP. Indeed, compare an iteration of the classical Bellman-Ford algorithm for finding a shortest path and an Aggregate-Combine step of a message passing GNN — you’d find a lot of commonalities. Further, the authors show that choosing a proper aggregation function for a GNN is crucial when modeling a particular DP algorithm, e.g., for Bellman-Ford we need a min-aggregator. I would also recommend checking out a very illustrative talk by Stefanie Jegelka explaining the main outcomes of this work at the Deep Learning and Combinatorial Optimization 2021 workshop.

For a more thorough intro to the area, I’d like to highlight a comprehensive IJCAI’21 survey by Cappart et al covering GNNs in combinatorial optimization. The article features the first appearance of the Neural Algorithmic Reasoning blueprint that was later described in the position paper in Patterns by Veličković and Blundell.

📘 The blueprint explains how neural networks can mimic and empower the execution process of usually discrete algorithms in the embedding space. In the Encode-Process-Decode fashion, abstract inputs (obtained from natural inputs) are processed by the neural net (Processor), and its outputs are decoded into abstract outputs which could then be mapped to more natural task-specific outputs. For example, if abstract inputs and outputs can be represented as graphs, then GNNs could be Processor nets. A common preprocessing step of discrete algorithms is to squash whatever we know about the problem into scalars like ‘distance’ or ‘edge capacity’ and run an algorithm over those scalars. Instead, vector representations and neural execution make it easy to enable high-dimensional inputs instead of simple scalars and attach a backprop for optimizing the processor. For more info, see the talk by Petar Veličković 🎥.

The blueprint enjoys an increasing adoption — NeurIPS’21 features a handful of cool works! Xhonneux et al study if transfer learning can be applied to generalize learned neural executors to new tasks; Deac et al discover the connection between algorithmic reasoning and implicit planning in reinforcement learning. Even more to come in 2022!

Subgraph GNNs: Beyond 1-WL

🤝 If 2020 was a year of first attempts to leave the 1-WL-landia of GNN expressiveness, we can confirm that 2021 was the year of establishing solid contacts with aliens in the beyond-1WL-landia 👾. The contacts have definitely proven themselves useful as we now have a handful of powerful and more expressive GNN architectures that extend message passing to higher-order structures like simplicial complexes (MPSN nets by Bodnar, Frasca, Wang et al), cell complexes (CW Networks by Bodnar, Frasca, et al) or subgraphs.

For the latter, check out this new wonderful blog post by Michael Bronstein, Leonardo Cotta, Fabrizio Frasca, Haggai Maron, Christopher Morris, and Lingxiao Zhao on subgraph GNNs!

If you want to go really deep into recent studies of WL, check this fresh survey Weisfeiler and Leman go Machine Learning: The Story so far by Christopher Morris, Yaron Lipman, Haggai Maron, and co-authors.

Scalability and Deep GNNs: 100 Layers and More

If you were jealous of deep ResNets or huge Transformers of 100+ layers while working with 2–4 layer GNNs, it’s time to rejoice 🤩! 2021 brought us 2 papers that casually train GNNs of 100–1000 layers, and one work on almost-constant size neighborhood sampling.

Li et al propose two mechanisms to dramatically reduce GPU memory consumption from O(L) in terms of L layers to O(1) when training extremely deep overparameterized networks. The authors show how to employ 1️⃣ reversible layers used for many years in CV or efficient Transformer architectures like Reformer; 2️⃣ sharing weights among the layers (weight-tying). It is then possible to train GNNs of up to 1000 layers 👀 . The charts below exhibit constant scaling to the number of layers with modest GPU requirements.

Nice scaling of Reversible (Rev) and Weight-Tying (WT) GNNs. Source: **Li et al**

Godwin et al introduce a way to learn deep GNNs leveraging recurrence — message passing steps are organized in Blocks, each Block can have M message passing layers. Then, N blocks are applied recurrently which means blocks share the weights. If you have 10 message passing layers and 10 blocks, you’d get a 100-layer GNN. An important component is the Noisy Nodes regularization technique that perturbs node and edge features and computes an additional denoising loss. The architecture suits better molecular tasks and was evaluated on QM9 and OpenCatalyst20 datasets.

Finally, if we want to scale any arbitrary GNN to very large graphs, we don’t have any other choice rather than sampling subgraphs. Usually, sampling k-hop subgraphs results in the exponential memory cost and computational graph size.

The authors of PyG, Matthias Fey et al created GNNAutoScale, a framework of to scale GNNs in constant time leveraging historical embeddings (a fancy version of cache from previous message passing steps) and graph clustering (a well-known METIS algorithm in this case). During pre-processing, we partition the graph into B clusters (mini-batches) such that connectivity among the clusters is minimized. Then we can run message passing over those clusters keeping track of updated node features in the cache. Experimentally, GNNAutoScale of deep networks (up to 64 layers) performs as good as the full-batch setup but with dramatically lower memory requirements (about 50 times smaller) — so you can fit deep GNNs and large graphs on a commodity-level GPU 💪

Knowledge Graphs

Representation learning on KGs has finally broken through the ceiling of transductiveness.

Before 2021, models were clearly separated into transductive and inductive having different inductive biases, architectures, and training regimes. In other words, transductive models had no chance to adapt to unseen entities, while inductive models were way too expensive to train on mid-large graph. 2021 brought two architectures that

work in both transductive and inductive settings,
do not require node features,
can be trained in the inductive mode in the same way as in the transductive one,
are scalable to real-world KG sizes.

The first one is Neural Bellman-Ford nets by Zhu et al. The authors find a very elegant way to generalize classic Bellman-Ford to a higher-level framework and show how we can obtain other well-known methods like Katz index, PPR, or Widest Path by instantiating the framework with particular operators. Even more importantly, they show that the generalized Bellman-Ford is essentially a relational GNN architecture (another confirmation of the algorithmic alignment between GNNs and dynamic programming). NBFNet does not learn entity embeddings (only relational and GNN weights) which makes the model inductive by design and generalizing to unseen graphs. The model works exceptionally well on link prediction tasks on both relational and non-relational graphs. Applied to KGs, NBFNet gives the biggest performance boost to FB15k-237 and WN18RR from 2019 while having 100x fewer parameters 💪.

From vanilla Bellman-Ford to neural Bellman-Ford GNN. Source: **Zhu et al**

🍰 The other approach by Galkin et al (disclaimer: one of the authors of this article is the author of the paper) is inspired by tokenization algorithms in NLP that have a fixed vocabulary of tokens capable of tokenizing any word, even those unseen during training time. Applied to KGs, NodePiece represents each node as a set of top-k nearest anchor nodes (optionally sampled in the pre-processing step) and m unique relation types around the node. Anchors and relation types are encoded to a node representation that can be used in any downstream task (classification, link prediction, relation prediction, to name a few) and any inductive/transductive setup. NodePiece features can be used directly by non-parametric decoders like RotatE or sent to GNNs for message passing. The model performs competitively to NBFNet on inductive link prediction datasets and exhibits high parameter efficiency on large graphs — a NodePiece model on OGB WikiKG 2 needs ~100x fewer parameters than shallow transductive-only models. Check out the full blog post about NodePiece if you’d like to know more!

NodePiece tokenization. Source: **Galkin et al**

Generally Cool Research with GNNs 👀

This section mentions several particularly cool works that used GNNs but don’t fall specifically in a certain category.

Huang, He, et al presented Correct & Smooth at ICLR’21 — a simple procedure to improve model predictions with label propagation. Paired with just an MLP, the approach stormed the OGB leaderboards with top scores w/o using any GNNs and with much fewer parameters! Today, pretty much all top models in the node classification track of OGB use Correct & Smooth to squeeze a bit more points.

🪄 Knyazev et al shook the ML community in November with the work on predicting parameters of various neural network architectures in one forward pass. That is, instead of initializing a model randomly it is possible to predict good parameters right away, and such a model will be already dramatically better than the random one 👀. Of course, if you optimize a randomly initialized net with n SGD steps you’d get much higher numbers but a principal contribution of the paper is that it’s generally possible to find decent parameters without training this particular architecture.

Parameter prediction is actually framed as a graph learning task — any neural net architecture (ResNet, ViT, Transformers, you name it) can be represented as a computational graph where nodes are modules with learnable parameters, node features are those parameters, and we have a bunch of node types (say, Linear layer, Conv layer, Batch Norm, the authors use ~15 node types). Parameter prediction is then a node regression task. A computational graph is encoded with a GatedGNN and its new representations are sent to the decoder module. For training, the authors collected a new dataset of 1M architectures (graphs). The approach works for any neural net architecture, even for other GNNs!

For more info, check the official repo with examples and Jupyter notebooks, and an interview of Boris Knyazev with Yannic Kilcher.

Predicting parameters for unseen models: the pipeline. Source: **Knyazev et al**

🗺 DeepMind and Google dramatically improved the quality of ETA in Google Maps by modeling road networks as graphs of supersegments and applying GNNs on top of them. In the paper by Derrow-Pinion et al, the task is framed as node- and graph-level regression. Besides that, the authors describe numerous engineering challenges to be solved in order to deploy the system at the Google Maps scale. A perfect example of applying GNNs for solving practical problems faced by millions of users!

The other potentially impactful application of GNNs in cancer research was recently announced by Transgene and NEC. According to Mathias Niepert, chief researcher at NEC, GNNs were used to impute missing patient data with embedding propagation (check the recent Twitter thread about this method) as well as scoring peptide candidates to elicit an immune response.

Finally, Davies et al from DeepMind recently employed GNNs to help formulate conjectures pertaining to hardcore math problems (about 🪢 knots 🪢 in their work), and actually did find and prove a new theorem! You see, GNNs can work with quite abstract matters either 👏

New Datasets, Challenges, and Tasks

If you are sick of Cora, Citeseer, and Pubmed — we feel you. 2021 brought a good bunch of new datasets of various sizes and characteristics.

OGB organized the Large Scale Challenge at KDD’21 with 3 very large graphs for node classification (240M nodes), link prediction (the whole Wikidata, 90M nodes), and graph regression (4M molecules). In the KDD Cup most winning teams used ensembles of 10–20 models — check the workshop recordings to learn more about their approaches. New versions of LSC datasets are now available with new leaderboards!
Open Catalyst NeurIPS’21 Challenge by Meta AI offered a large molecular task — predict the relaxed state energy given an initial structure with atom positions. The dataset is huge and requires tons of compute, but organizers hinted upon releasing a smaller version that would be a bit more friendly to smaller labs with limited GPU budgets. The results and recordings are available — equivariant models and transformers reached the top. In fact, Graphormer took top-1 places in both OGB LSC and OpenCatalyst’21 pretty much collecting the Grand Slam of Graph ML in 2021 🏅
Workshop on Graph Learning Benchmarks @ The WebConf 2021 brought a collection of new datasets including non-homophilous graphs by Lim et al, graph simulations by Tsitsulin et al, spatiotemporal graphs by Rozemberczki et al, and many more
NeurIPS’21 Datasets & Benchmarking Track is like an SXSW festival of new datasets: this year we have MalNet — graph classification where average graph size is 15k nodes and 35k edges, dramatically larger than molecules; ATOM3D — a collection of new 3D molecular tasks; RadGraph — information extraction from radiology reports. Finally, Liu et al report on the challenge of creating a taxonomy of graph learning datasets — a long-anticipated effort the community would definitely benefit from.

Courses and Books

New & noteworthy courses & books:

Geometric Deep Learning proto-book & course by Michael Bronstein, Joan Bruna, Taco Cohen, and Petar Veličković. Contains 12 lectures and hands-on tutorials & seminars. If you like video recordings, Michael’s ICLR’21 keynote is the best video about graphs released this year.
A new open book on knowledge graphs by 18 (!) authors. The entire book is available for free in the web form; it contains a lot of details on methodology, patterns, and queries.

Graph Representation Learning book by William Hamilton. While technically released in 2020, it is still the best short introduction to GML from the modern deep learning perspective.

Libraries & Open Source

New libraries released in 2021:

TensorFlow GNN — GNNs as first-class citizens in the Tensorflow world.
TorchDrug — PyTorch-based GNN library for molecular and KG tasks

Established Graph ML libraries that got updated:

PyG 2.0 — now supporting heterogeneous graphs, GraphGym, and a flurry of improvements and new models
DGL 0.7 — graph sampling on a GPU, faster kernels, more models
PyKEEN 1.6 — the go-to library for training KG embeddings: more models, datasets, metrics, and NodePiece support!
Jraph — GNNs for JAX aficionados, check this fresh intro by Lisa Wang (DeepMind) and Nikola Jovanović (ETH Zurich) on building and evaluating GNNs

How to stay updated 🤓

That’s it for 2021! Most probably, we overlooked something important in your subfield of graph representation learning — let us know in the comments what you think we missed.

How to stay updated about all new cool stuff happening in Graph ML:

Graph Machine Learning channel in Telegram
Mila GRL reading group (afternoons Eastern Time Canada & US)
LoGaG reading group (afternoons UK time)
Michael Bronstein’s blog on Medium
Follow me on Twitter and subscribe here on Medium ;)

You made it to the end of the longread — you deserve a picture of Michael Bronstein being isomorphic to Santa Claus! 🎅

I would like to thank Anton Tsitsulin, Anvar Kurmukov, and Sergey Ivanov for helping with the content of the blog post, and Petar Veličković for correcting few imprecise formulations.