What’s new in Graph ML?

Graph Machine Learning @ ICML 2023

Recent advancements and hot trends, August 2023 edition

Michael Galkin
Towards Data Science
16 min readAug 6, 2023

--

Magnificent beaches and tropical Hawaiian landscapes 🌴did not turn brave scientists away from attending the International Conference on Machine Learning in Honolulu and presenting their recent work! Let’s see what’s new in our favorite Graph Machine Learning area.

Image By Author.

Thanks Santiago Miret for proofreading the post.

Graph Transformers: Sparser, Faster, and Directed

We presented GraphGPS about a year ago and it is pleasing to see many ICML papers building upon our framework and expanding GT capabilities even further.

➡️ Exphormer by Shirzad, Velingker, Venkatachalam et al adds a missing piece of graph-motivated sparse attention to GTs: instead of BigBird or Performer (originally designed for sequences), Exphormer’s attention builds upon 1-hop edges, virtual nodes (connected to all nodes in a graph), and a neat idea of expander edges. Expander graphs have a constant degree and are shown to approximate fully-connected graphs. All components combined, attention costs O(V+E) instead of O(V²). This allows Exphormer to outperform GraphGPS almost everywhere and scale to really large graphs of up to 160k nodes. Amazing work and all chances to make Exphormer the standard sparse attention mechanism in GTs 👏.

➡️ Concurrently with graph transformers, expander graphs can already be used to enhance the performance of any MPNN architecture as shown in Expander Graph Propagation by Deac, Lackenby, and Veličković.

In a similar vein, Cai et al show that MPNNs with virtual nodes can approximate linear Performer-like attention such that even classic GCN and GatedGCN imbued with virtual nodes show pretty much a SOTA performance in long-range graph tasks (we released the LGRB benchmark last year exactly for measuring long-range capabilities of GNNs and GTs).

Source: Shirzad, Velingker, Venkatachalam et al

➡️ A few patch-based subsampling approaches for GTs inspired by vision models: “A Generalization of ViT/MLP-Mixer to Graphs” by He et al split the input into several patches, encode each patch with a GNN into a token, and run a transformer over those tokens.

Source: “A Generalization of ViT/MLP-Mixer to Graphs” by He et al

In GOAT by Kong et al, node features are projected into a codebook of K clusters with K-Means, and a sampled 3-hop neighborhood of each node attends to the codebook. GOAT is a 1-layer model and scales to graphs of millions of nodes.

➡️ Directed graphs got some transformer love as well 💗. “Transformers Meet Directed Graphs” by Geisler et al introduces Magnetic Laplacian — a generalization of a Laplacian for directed graphs with a non-symmetric adjacency matrix. Eigenvectors of the Magnetic Laplacian paired with directed random walks are strong input features for the transformer that enable setting a new SOTA on the OGB Code2 graph property prediction dataset by a good margin!

🏅 Last but not least, we have a new SOTA GT on the community standard ZINC dataset — GRIT by Ma, Lin, et al incorporates the full d-dimensional random walk matrix, coined as relative random walk probabilities (RRWP), as edge features to the attention computation (for comparison, popular RWSE features are just the diagonal elements of this matrix). RRWP are provably more powerful than shortest path distance features and set a record-low 0.059 MAE on ZINC (down from 0.070 by GraphGPS). GRIT often outperforms GPS in other benchmarks as well 👏. In a similar vein, Eliasof et al propose a neat idea to combine random and spectral features as positional encodings that outperform RWSE but were not tried with GTs.

Image by Author.

Theory: VC dimension of GNNs, deep dive into over-squashing

➡️ VC dimension measures model capacity and expressiveness. It is studied well for classical ML algorithms but, surprisingly, has never been applied to study GNNs. In “WL meet VC” by Morris et al, the connection between the WL test and VC dimension is finally uncovered — turns out it the VC dimension can be bounded by the bitlength of GNN weights, i.e., float32 weights would imply the VC dimension of 32. Furthermore, the VC dimension depends logarithmically on the number of unique WL colors in the given task and polynomially on the depth and number of layers. This is a great theoretical result and I’d encourage you to have a look!

Source: “WL meet VC” by Morris et al

🍊🖐️ The over-squashing effect — information loss when you try to stuff messages from too many neighboring nodes — is another common problem of MPNNs, and we don’t fully understand how to properly deal with it. This year, there were 3 papers dedicated to this topic. Perhaps the most foundational is the work by Di Giovanni et al that explains how MPNNs width, depth, and graph topology affect over-squashing.

Source: Di Giovanni et al

Turns out that width might help (but with generalization issues), depth does not really help, and graph topology (characterized by the commute time between nodes) plays the most important role. We can reduce the commute time by various graph rewiring strategies (adding and removing edges based on spatial or spectral properties), and there are many of them (you might have heard about the Ricci flow-based rewiring that took home the Outstanding Paper award at ICLR 2022). In fact, there is a follow-up work to this study that goes even deeper and derives some impossibility statements wrt over-squashing and some MPNN properties — I’d highly encourage to read it as well!

➡️ Effective resistance is one example of spatial rewiring strategies, and Black et al study it in great detail. The Ricci flow-based rewiring works with graph curvature and is studied further in the work by Nguyen et al.

➡️ Subgraph GNNs continue to be in the spotlight: two works (Zhang, Feng, Du, et al and Zhou, Wang, Zhang) concurrently derive expressiveness hierarchies of the recently proposed subgraph GNNs and their relationship to the 1- and higher-order WL tests.

Image By Author.

New GNN architectures: Delays and Half-hops

If you are tired of yet another variation of GCN or GAT, here are some fresh ideas that can work with any GNN of your choice:

⏳ As we know from the Theory section, rewiring helps combat over-squashing. Gutteridge et al introduce “DRew: Dynamically Rewired Message Passing with Delay” which gradually densifies the graph in later GNN layers such that long-distance nodes see the original states of previous nodes (the original DRew) or those skip-connections are added based on the delay — depending on a distance between two nodes (the vDRew version). For example ( 🖼️👇), in vDRew delayed message passing, a starting node from layer 0 will show its state to 2-hop neighbors on layer 1, and will show its state to a 3-hop neighbor on layer 2. DRew significantly improves the ability of vanilla GNNs to perform long-range tasks — in fact, a DRew-enabled GCN is the current SOTA on the Peptides-func dataset from the Long Range Graph Benchmark 👀

Source: Gutteridge et al

🦘 Another neat idea by Azabou et al is to slow down message passing by inserting new, slow nodes at each edge with a special connectivity pattern — only an incoming connection from the starting node and a symmetric edge with the destination node. Slow nodes improve the performance of vanilla GNNs on heterophilic benchmarks by a large margin, and it is also possible to use slow nodes for self-supervised learning by creating views with different locations of slow nodes for the same original graph. HalfHop is a no-brainer-to-include SSL component that boosts performance and should be in a standard suite of many GNN libraries 👍.

Source: Azabou et al
Image By Author.

Generative Models — Stable Diffusion for Molecules, Discrete Diffusion

➡️ Diffusion models might work in the feature space (e.g., pixel space in image generation like the original DDPM) or in the latent space (like Stable Diffusion). In the feature space, you have to design the noising process to respect symmetries and equivariances of your feature space. In the latent space, you can just add Gaussian noise to the features produced by (pre-trained) encoder. Most 3D molecule generation models work in the feature space (like a pioneering EDM), and the new GeoLDM model by Xu et al (authors of the prominent GeoDiff) is the first to define latent diffusion for 3D molecule generation. That is, after training an EGNN autoencoder, GeoLDM is trained on the denoising objective where noise is sampled from a standard Gaussian. GeoLDM brings significant improvements over EDM and other non-latent diffusion approaches 👏.

GeoLDM. Source: Xu et al

➡️ In the realm of non-geometric graphs (just with an adjacency and perhaps categorical node features), discrete graph diffusion pioneered by DiGress (ICLR’23) seems the most applicable option. Chen et al propose EDGE, a discrete diffusion model guided by the node degree distribution. In contrast to DiGress, the final target graph in EDGE is a disconnected graph without edges, a forward noising model removes edges through a Bernoulli distribution, and a reverse process adds edges to the most recent active nodes (active are the nodes whose degrees changed in the previous step). Thanks to the sparsity introduced by the degree guidance, EDGE can generate pretty large graphs up to 4k nodes and 40k edges!

Graph Generation with EDGE. Source:Chen et al

➡️ Finally, “Graphically Structured Diffusion Models” by Weilbach et al bridges the gap between continuous generative models and probabilistic graphical models that induce a certain structure in the problem of interest — often such problems have a combinatorial nature. The central idea is to encode the problem’s structure as an attention mask that respects permutation invariances and use this mask in the attention computation in the Transformer encoder (which by definition is equivariant to input token permutation unless you use positional embeddings). GSDM can tackle binary continuous matrix factorization, boolean circuits, can generate sudokus, and perform sorting. Particularly enjoyable is a pinch of irony the paper is written with 🙃.

GSDM task-to-attention-bias. Source: “Graphically Structured Diffusion Models” by Weilbach et al
Image By Author

Geometric Learning: Geometric WL, Clifford Algebras

Geometric Deep Learning thrives! There were so many interesting papers presented that would take pretty much the whole post, so I’d highlight only a few.

➡️ Geometric WL has finally arrived in the work by Joshi, Bodnar, et al. Geometric WL extends the notion of the WL test with geometric features (e.g., coordinates or velocity) and derives the expressiveness hierarchy up to k-order GWL. Key takeaways: 1️⃣ equivariant models are more expressive than invariant (with a note that in fully connected graphs the difference disappears), 2️⃣ tensor order of features improves expressiveness, 3️⃣ body order of features improves expressiveness (see the image 👇). That is, spherical > cartesian > scalars, and many-body interactions > just distances. The paper also features the amazing learning source Geometric GNN Dojo where you can derive and implement most SOTA models from the first principles!

Source: Joshi, Bodnar, et al

➡️ Going beyond vectors to Clifford algebras, Ruhe et al derive Geometric Clifford Algebra Networks (GCANs). Clifford algebras naturally support higher-order interactions by means of bivectors, trivectors, and (in general) multivectors. The key idea is the Cartan-Dieudonné theorem that every orthogonal transformation can be decomposed into reflections in hyperplanes, and geometric algebras represent data as the elements of the Pin(p,q,r) group. GCANs introduce a notion of linear layers, normalizations, non-linearities, and how they can be parameterized with neural networks. Experiments include modeling fluid dynamics and Navier-Stokes equations.

Source: Ruhe et al

In fact, there is already a follow-up work introducing equivariant Clifford NNs — you can learn more about Clifford algebras foundations and the most recent papers on CliffordLayers supported by Microsoft Research.

💊 Equivariant GNN (EGNN) is the Aspirin of Geometric DL that gets applied to almost every task and has seen quite a number of improvements. Eijkelboom et al marry EGNN with Simplicial networks that operate on higher-order structures (namely, simplicial complexes) into EMPSN. This is one of the first examples that combines geometric and topological features and has great improvement potential! Finally, Passaro and Zitnick derive a neat trick to reduce SO(3) convolutions to SO(2) bringing the complexity down from O(L⁶) to O(L³) but with mathematical equivalence guarantees 👀. This finding allows to scale up geometric models on larger datasets like OpenCatalyst and already made it to Equiformer V2 — soon in many other libraries for geometric models 😉

Image By Author.

Molecules: 2D-3D pretraining, Uncertainty Estimation in MD

➡️ Liu, Du, et al propose MoleculeSDE, a new framework for joint 2D-3D pretraining on molecular data. In addition to standard contrastive loss, the authors add two generative components: reconstructing 2D -> 3D and 3D -> 2D inputs through the score-based diffusion generation. Using standard GIN and SchNet as 2D and 3D models, MoleculeSDE is pre-trained on PCQM4M v2 and performs well on downstream fine-tuning tasks.

Source: MoleculeSDE Github repo

➡️ Wollschläger et al perform a comprehensive study of Uncertainty Estimation in GNNs for molecular dynamics and force fields. Identifying key physics-informed and application-focused principles, the authors propose a Localized Neural Kernel, a Gaussian Process-based extension to any geometric GNN that works on invariant and equivariant quantities (tried on SchNet, DimeNet, and NequIP). In many cases, LNK’s estimations from one model are on par with or better than costly ensembling where you’d need to train several models.

Source: Wollschläger et al
Image By Author.

Materials & Proteins: CLIP for proteins, Ewald Message Passing, Equivariant Augmentations

CLIP and its descendants have become a standard staple in text-to-image models. Can we do the same but for text-to-protein? Yes!

➡️ Xu, Yuan, et al present ProtST, a framework for learning joint representations of text protein descriptions (via PubMedBERT) and protein sequences (via ESM). In addition to a contrastive loss, ProtST has a multimodal mask prediction objective, e.g., masking 15% of tokens in text and protein sequence, and predicting those jointly based on latent representations, and mask prediction losses based on sequences or language alone. Additionally, the authors design a novel ProtDescribe dataset with 550K aligned protein sequence-description pairs. ProtST excels across many protein modeling tasks in the PEER benchmark, including protein function annotation and localization, but also allows for zero-shot protein retrieval right from the textual description (see an example below). Looks like ProtST has a bright future of being a backbone behind many protein generative models 😉

Source: Xu, Yuan, et al

Actually, ICML features several protein generation works like GENIE by Lin and AlQuraishi and FrameDiff by Yim, Trippe, De Bortoli, Mathieu, et al — those are not yet conditioned on textual descriptions, so incorporating ProtST there looks like a no-brainer performance boost 📈.

Gif Source: SE(3) Diffusion Github

⚛️ MPNNs on molecules have a strict locality bias that inhibits modeling long-range interactions. Kosmala et al derive Ewald Message Passing and apply the idea of Ewald summation that breaks down the interaction potential into short-range and long-range terms. Short-range interaction is modeled by any GNN while long-range interaction is new and is modeled with a 3D Fourier transform and message passing over Fourier frequencies. Turns out this long-range term is pretty flexible and can be applied to any network modeling periodic and aperiodic systems (like crystals or molecules) like SchNet, DimeNet, or GemNet. The model was evaluated on OC20 and OE62 datasets. If you are interested in more details, check out the 1-hour talk by Arthur Kosmala at the LOG2 Reading Group!

Source: Kosmala et al

A similar idea of using Ewald summation for 3D crystals is used in PotNet by Lin et al where the long-range connection is modeled with incomplete Bessel functions. PotNet was evaluated on the Materials Project dataset and JARVIS — so reading those two papers you can have a good understanding of the benefits brought by Ewald summation for many crystal-related tasks 😉

Source: Lin et al

➡️ Another look at imbuing any GNNs with equivariance for crystals and molecules is given by Duval, Schmidt, et al in FAENet. A standard way is to bake certain symmetries and equivariances right into GNN architectures (like in EGNN, GemNet, and Ewald Message Passing) — this is a safe but computationally expensive way (especially when it comes to spherical harmonics and tensor products). Another option often used in vision — show many augmentations of the same input and the model should eventually learn the same invariances in the augmentations. The authors go for the 2nd path and design a rigorous way to sample 2D / 3D data invariant or equivariant augmentations (e.g., for energy or forces, respectively) all with fancy proofs ✍️. For that, the data augmentation pipeline includes projecting 2D / 3D inputs to a canonical representation (based on PCA of the covariance matrix of distances) from which we can uniformly sample rotations.

The proposed FAENet is a simple model that uses only distances but shows very good performance with the stochastic frame averaging data augmentation while being 6–20 times faster. Works for crystal structures as well!

Augmentations and Stochastic Frame Averaging. Source: Duval, Schmidt, et al
Image By Author.

Cool Applications: Algorithmic Reasoning, Inductive KG Completion, GNNs for Mass Spectra

A few papers in this section did not belong to any of the above but are still worthy of your attention.

➡️ ”Neural Algorithmic Reasoning with Causal Regularisation” by Bevilacqua et al tackles a common issue in graph learning — OOD generalization to larger inputs at test time. Studying OOD generalization in algorithmic reasoning problems, the authors observe that there exist many different inputs that make identical computations at a certain step. At the same time, it means that some subset of inputs does not (should not) affect the prediction result. This assumption allows to design a self-supervised objective (termed Hint-ReLIC) that prefers a “meaningful” step to a bunch of steps that do not affect the prediction result. The new objective significantly bumps the performance on many CLRS-30 tasks to 90+% micro-F1. It is an interesting question whether we could leverage the same principle in general message passing and improve OOD transfer in other graph learning tasks 🤔

Source: ”Neural Algorithmic Reasoning with Causal Regularisation” by Bevilacqua et al

If you are further interested in neural algorithmic reasoning, check out the proceedings of the Knowledge and Logical Reasoning workshop which has even more works on that topic.

➡️ “InGram: Inductive Knowledge Graph Embedding via Relation Graphs” by Lee et al seems to be one of the very few knowledge graph papers at ICML’23 (to the best of my search). InGram is one of the first approaches that can inductively generalize to both unseen entities and unseen relations at test time. Previously, inductive KG models needed to learn at least relation embeddings in some form to generalize to new nodes, and in this paradigm, new unseen relations are non-trivial to model. InGram builds a relation graph on top of the original multi-relational graph, that is, a graph of relation types, and learns representations of relations based on this graph by running a GAT. Entity representations are obtained from the random initialization and a GNN encoder. Having both entity and relation representations, a DistMult decoder is applied as a scoring function. There are good chances that InGram for unseen relations might be as influential as GraIL (ICML 2020) for unseen entities 😉.

Source: “InGram: Inductive Knowledge Graph Embedding via Relation Graphs” by Lee et al

🌈 ”Efficiently predicting high resolution mass spectra with graph neural networks” by Murphy et al is a cool application of GNNs to a real physics problem of predicting mass spectra. The main finding is that most of the signal in mass spectra is explained by a small number of components (product ion and neutral loss formulas). And it is possible to mine a vocabulary of those formulas from training data. The problem can thus be framed as graph classification (or graph property prediction) when, given a molecular graph, we predict tokens from a vocabulary that correspond to certain mass spectra values. The approach, GRAFF-MS, builds molecular graph representation through GIN with edge features, with Laplacian features (via SignNet), and pooled with covariate features. Compared to the baseline CFM-ID, GRAFF-MS performs inference in ~19 minutes compared to 126 hours reaching much higher performance 👀.

Source: ”Efficiently predicting high resolution mass spectra with graph neural networks” by Murphy et al

The Concluding Meme Part

Four Michaels (+ epsilon in the background) on the same photo!

The meme of 2022 has finally converged to Michael Bronstein!

--

--

AI Research Scientist @ Intel Labs. Working on Graph ML, Geometric DL, and Knowledge Graphs