FastFlows: Flow-Based Models for Molecular Graph Generation

Deep learning invertible transformations for efficient generative modeling

Nathan C. Frey, PhD
Towards Data Science

--

From Unsplash.

Nathan C. Frey

This post was co-authored by Bharath Ramsundar from Deep Forest Sciences.

Normalizing flow-based deep generative models learn a transformation between a simple base distribution and a target distribution. In this post, we show how to use FastFlows to model a dataset of small molecules and generate new molecules. FastFlows allows us to generate thousands of valid molecules in seconds and shows the advantages and challenges of flow-based molecular generative models.

An interactive tutorial accompanies this post and is available to run through Google Colab.

Why would we want to use normalizing flows in chemistry and biology?

There is a lot of great work using variational autoencoders (VAEs), recurrent neural networks (RNNs), graph neural networks (GNNs), and generative adversarial networks (GANs) for generating molecules. Normalizing Flows (NFs) can be used to generate molecular graphs, as in the MoFlow and GraphNVP papers. They can also be used to accelerate calculations of the binding affinity between small molecules and proteins.

Our goal in this paper is not to replace these methods, but instead to provide an easy-to-use framework that anyone can apply to their distribution of interest and start doing generative and probabilistic modeling quickly.

What is a normalizing flow?

NFs have two key capabilities: density calculation and sampling. After learning a mapping between probability distributions, NFs can exactly compute the likelihood that any given sample was drawn from the target distribution. They can also generate new samples from the target distribution. This makes NFs distinct from other generative models like VAEs and GANs, which can only estimate likelihoods or can’t compute them at all.

An example of how a normalizing flow transforms a two-dimensional Normal distribution to a target distribution. In this case, a point cloud that looks like the word “SIGRAPH.” Image by Eric Jang.

Another key difference is that the layers in an NF are bijective transformations — they provide a one-to-one mapping between inputs and outputs, rather than compressing inputs into a latent space. So NFs are useful for any application that requires a probabilistic model with either or both density calculation and sampling. I recommend these excellent blog posts by Eric Jang and Brad Saund for more details on NFs and some nice examples.

Preparing a dataset of SELFIES strings

To show how NFs work, we’ll try to model distributions from the QM9 and ChEMBL datasets. To represent the molecules, we’ll use Self-Referencing Embedded Strings (SELFIES) [1]. SELFIES are a 100% robust string representation for molecules — meaning that even totally random SELFIES strings are chemically valid molecules. This is great for design applications, where deep learning models (like NFs) are generating new molecules without knowing the underlying rules of chemistry that determine if a molecule is valid or not. With the functions and examples in the selfies repo, it’s easy to translate the Simplified Molecular-Input Line-Entry System (SMILES) strings provided for chemical datasets.

(a) A representative molecule with its SMILES and SELFIES representations. (b) Schematic of a normalizing flow transforming input samples from a base distribution to a target distribution. Image by author.

We transform the SELFIES strings into one-hot encoded vectors and then dequantize those vectors by adding random noise from the interval [0, 0.95). This gives us continuous, numerical inputs that our model will be able to read, and we can recover the original SELFIES one-hot encodings by applying a simple floor function.

Encoding molecules as SELFIES strings and dequantized tensors for training a normalizing flow. Image by author.

Training a Normalizing Flow

An NF is just a sequence of bijective transformations, and we can construct deep neural networks where the layers are bijective. That way, we can train the model on data and learn the parameters of a transformation between our base distribution and the target. Luckily, the TensorFlow Probability and nflows libraries have many built-in probability distributions and bijectors, including some of the most popular architectures for learning distributions.

The choice of base distribution is easy — we pick a multivariate Normal distribution with the same number of dimensions as our one-hot encodings. That is about as simple as it can get. Then we have to build our NF. For FastFlows [2], we use Real NVP layers.

You might be wondering how layers in a neural network can be invertible; in general, they aren’t. Real NVP works by constraining each element x_d+1 from the transformed distribution to depend only on the previous d elements from the base distribution, for some integer d. Alpha_i and mu_i are scalars that are computed by passing the elements of the base distribution through a neural network. This gives a simple scale-and-shift (affine) transformation that nevertheless can capture complex target distributions by depending sensitively on previous variables from the base distribution.

The forward pass in Real NVP. Image from Normalizing Flows Tutorial.

The key here is that the scale-and-shift operation can be computed in a single pass, so Real NVP can perform fast, parallelized sampling and density estimation. The sampling speed is orders of magnitude faster than for autoregressive models like causal sequence-based models and Masked Autoregressive Flow, but it comes at the cost of reduced expressivity.

We alternate the NVP layers with permutations so that the layers can operate on different parts of the inputs. With this setup, we can “undo” transformations by reversing the shift-and-scale operation, so we have an invertible transformation even though we’re using a neural network with learnable parameters.

For Real NVP we have to specify the number of layers and the number of hidden units in each NVP layer. In this example we’ll use 8 layers and [512, 512] hidden units. We can increase these numbers to make more expressive flows to capture more complex target distributions.

Generating new molecules

With a trained model, it’s easy to generate new molecules and evaluate their log likelihood. We have to do a bit of post-processing: applying the floor function and clipping by value to turn the noisy, continuous samples back into one-hot encoded vectors. We’ll also have to add padding characters to any vectors that were generated with all zeros. After that,

 selfies.multiple_hot_to_selfies() 

will give us back SELFIES representations, which we can decode into SMILES and analyze with the open-source cheminformatics software RDKit.

RDKit will look at our generated SMILES strings and check to see if they are valid molecules. With FastFlows, we can generate 100K valid molecules in 4.2 seconds on a single GPU! If you have any experience trying to generate valid SMILES strings with autoregressive models, that figure is pretty impressive! That’s the speed of Real NVP and the robustness of the SELFIES representation at work. Now it’s time to actually look at the molecules our NF came up with.

High-throughput virtual screening

With FastFlows, our goal is not to come up with the exact molecules that are going to go into clinical trials, nor to ensure that the generated molecules are precisely engineered to fulfill all the constraints we care about. FastFlows is designed to enable anyone to do deep generative modeling in a low-data limit, and to take advantage of the computationally cheap sampling capabilities. It’s a way to experiment with NFs for chemistry, not a production-ready system by any means.

We know from work on the synthesizability of generated molecules that goal-oriented deep generative models tend to propose molecules that can’t be easily made in the lab. To ensure relevance in a lab setting, you either need to generate molecules directly with synthetic planning, or have fast ways to generate lots of candidates and evaluate their synthetic accessibility and complexity to filter out the unreasonable candidates.

Here, in the interest of exploratory, curiosity-driven investigations of chemical space, we opt for coupling our fast molecular generation method to equally fast multi-objective optimization to identify druglike, synthesizable molecules (at least as measured by some simple proxy metrics).

FastFlows workflow diagram. Image by author.

FastFlows is quite good at generating chemically valid, unique, and novel molecules at rates faster than other deep generative models. That might be interesting, or it could mean that even though these molecules are chemically valid and unique, they aren’t very realistic. FastFlows doesn’t do well on distribution learning metrics like GuacaMol and MOSES, because it isn’t designed to.

We can use the fast generation of FastFlows and couple it to post-hoc filters to flexibly weed out molecules that our medicinal chemists don’t like the look of. After that, we compute scores like quantitative estimate of druglikeness, synthetic accessibility, and learned synthetic complexity — all simple proxies that don’t add a substantial time cost to generation, but do give some sense of which generated molecules could be interesting to study further.

With those scores computed, we find the Pareto frontier of generated samples that are maximally druglike, synthetically accessible, and complex. By considering the tradeoffs between these metrics, we avoid picking out molecules that over-optimize for a particular score, which could lead to generated samples that are too similar to existing drugs (maximizing QED), too simple and easy to make (maximizing synthetic accessibility), or too complex and unrealistic (maximizing synthetic complexity). By balancing these factors, we can explore any area of chemical space we like and find interesting molecules, without baking in a ton of complexity into our deep neural network architecture that makes it difficult to train and sample from.

Limitations and next steps

FastFlows enables fast training and sampling of more than 20,000 chemically valid molecules per second after training on QM9, or nearly 500 molecules per second after training on more complex distributions from ChEMBL, so deep generative modeling can be coupled to equally time-efficient post-hoc filters and multi-objective optimization for efficient molecular generation. But because FastFlows relies on fully bijective transformations, as the dimensionality of the target distribution representation increases (~18,000 for one-hot encoded SELFIES from ChEMBL), the normalizing flow suffers from low expressivity and requires increasing depth to capture the target distribution.

FastFlows illustrates the advantages of stable deep neural network training (no mode collapse or other difficulties commonly encountered in other deep generative models) and fast, computationally cheap sampling for high-throughput virtual screening campaigns. However, the low expressivity and need for many bijective transformations presents an opportunity for architectural improvements and more flexible base distributions to improve generation.

Key takeaways

  • Normalizing flows are a flexible and interesting deep learning framework for generating novel molecules.
  • Using SELFIES, a simple normalizing flow architecture, and multi-objective optimization, it’s possible to identify novel, chemically valid molecules in a high-throughput virtual screening.

Getting in touch

If you liked this tutorial or have any questions, feel free to reach out to Nathan over email or connect on LinkedIn and Twitter.

This work was presented at the 2021 ELLIS Machine Learning for Molecule Discovery Workshop and is available on arXiv.

You can find out more about Nathan’s projects and publications on his website.

References

[1] Krenn, Mario, et al. “Self-referencing embedded strings (SELFIES): A 100% robust molecular string representation.” Machine Learning: Science and Technology 1.4 (2020): 045024.

[2] Frey, Nathan C., Vijay Gadepally, and Bharath Ramsundar. “FastFlows: Flow-Based Models for Molecular Graph Generation.” arXiv preprint arXiv:2201.12419 (2022).

--

--

Senior ML Scientist & Group Leader @PrescientDesign • @Genentech | Co-founder @AtomicDataSciences | Prev Postdoc @MIT, NDSEG Fellow @UPenn, @Berkeley Lab