Thoughts and Theory

Modeling Protein-Ligand Interactions with Atomic Convolutional Neural Networks

Exploit local structure of three-dimensional molecular complexes to predict binding affinities

Nathan C. Frey, PhD
Towards Data Science
6 min readMar 17, 2021

--

Image from Unsplash.

Nathan C. Frey

This post was co-authored by Bharath Ramsundar from DeepChem.

ACNNs learn chemical features from the three-dimensional structure of protein-ligand complexes. In this post, we show how to use the open-source implementation of ACNNs in DeepChem and the PDBbind dataset to predict protein-ligand binding affinities.

An interactive tutorial accompanies this post and is available to run through Google Colab.

A complex problem

A key challenge in drug discovery is finding small molecules that preferentially bind to a target protein. We can use molecular docking or free energy perturbation calculations to predict the binding affinity of candidate molecules, but these techniques are computationally intensive and require a good deal of expertise to apply. In this post, we show how to apply the powerful idea of Convolutional Neural Networks or ConvNets to 3D atomic coordinates to directly learn chemical features and predict binding affinities. We’ll walk through how an Atomic ConvNet works, how to quickly get a dataset of protein-ligand complexes, and train a model to predict binding affinities — all using the open-source library, DeepChem.

Image by Thiago Reschützegger from DeepChem forums.

What is an Atomic ConvNet?

ACNNs were introduced in this paper [1] by Joseph Gomes, Bharath Ramsundar, Evan Feinberg, and Vijay Pande. First, let’s break down the architecture of an ACNN. The building blocks of the ACNN are two primitive convolutional operations: atom-type convolution and radial pooling. The Cartesian atomic coordinates of the molecules, X, are used to construct the distance matrix, R. The elements of R are the distances between neighboring atoms for some maximum number of nearest neighbors for an atom. An atom-type matrix, Z, is also constructed to list the atom numbers of neighboring atoms.

R is fed into a filter with depth Nat, where Nat is the number of unique atom types in the complex. The atom-type convolutional kernel is a step function that operates on R and produces a one-hot encoding of the atom type into separate copies of the distance matrix for each atom type.

3D atomic coordinates and atom types are inputs to the atom-type convolution. Image by Joseph Gomes from Gomes, et al.

Next, the outputs from atom-type convolutions are down-sampled with radial pooling layers. This form of dimensionality reduction prevents overfitting and reduces the number of learned parameters. Radial pooling layers apply a radial filter (pooling function) with learnable parameters. We can think about radial pooling as a featurization that sums pairwise-interactions between an atom and all neighboring atoms with different atom types.

Outputs from atom-type convolution are fed to a radial pooling layer. Image by Joseph Gomes from Gomes, et al.

Atomic convolution layers (atom-type convolution + radial pooling) can be stacked by flattening the outputs of the radial pooling layer and feeding them into another atom-type convolution. The final tensor output is fed row-wise (per-atom) into a fully-connected network to predict the “energy” of each atom. By adding up the energies for each atom, the total energy of a molecule is predicted.

Fully-connected layers produce a scalar output. Image by Joseph Gomes from Gomes, et al.

This allows for an interesting interpretation of the outputs, where the binding energy is calculated as ACNN(complex) - ACNN(protein) - ACNN(ligand). It’s also possible to decompose the contributions of each atom to the binding energy by taking the difference between the energy evaluated by the ACNN for an atom in a free-standing molecule and the same atom in the complex. It’s important to note that the input dimension only depends on the number of features (number of atom types and radial filters), not the number of atoms in a molecule, so the ACNN generalizes to systems that are larger than those in the training set.

Protein-ligand complex data from PDBbind

Now that we understand the basics of an ACNN, let’s get a dataset of molecules to work with and start training models! We’ll use the “core” set of high-quality protein-ligand complexes and measured binding affinities from PDBbind [2]. Thanks to the loader function in MoleculeNet, we can retrieve training, validation, and test sets from PDBbind with a single line of python code. To minimize the computational burden, we only featurize the protein binding pockets (rather than entire proteins) and we consider only the ~200 complexes in the core set. If we want to consider the much larger “refined” or “general” sets from PDBbind, we simply change set_name in load_pdbbind().

tasks, datasets, transformers = load_pdbbind(featurizer=acf,pocket=True,set_name=’core’)

The dataset of 3D atomic coordinates is featurized using an AtomicConvFeaturizer, which pre-computes features (coordinate matrix, atom-type matrix, and neighbor list) for the protein, ligand, and the complex. The featurizer controls how many atoms will be considered in each molecule and the maximum number of neighbors considered for each atom.

Training an ACNN on protein-ligand complexes

We’ll set up an ACNN model with similar hyperparameters to the model from the original paper.

acm = AtomicConvModel(n_tasks=1,frag1_num_atoms=f1_num_atoms,frag2_num_atoms=f2_num_atoms,complex_num_atoms=f1_num_atoms+f2_num_atoms,max_num_neighbors=max_num_neighbors,batch_size=12,layer_sizes=[32, 32, 16],learning_rate=0.003,)

Here, we pay attention to layer_sizes, which controls the number and size of the dense layers in the fully-connected network. The other parameters related to the number of atoms are equal to the inputs we specified for AtomicConvFeaturizer. We’ll keep the default number of 15 atom types and 3 radial filters.

We train the model for 50 epochs and visualize the loss curves. Unsurprisingly, they aren’t particularly smooth. Using a larger dataset might help with this, but this is a starting point for rapidly prototyping a binding affinity prediction model.

Training and validation loss over 50 epochs for the core PDBbind set. Image by author.

Predicting binding affinities

The original ACNN paper used a random 80/20 train/test split on the core PDBbind dataset. They showed Pearson R² values of 0.912 and 0.448 on the training and test sets, respectively, for predicting binding affinity. Our simple model achieves similar performance on the training set (0.944), but doesn’t do too well on the validation (0.164) or test (0.250) sets. This is expected, because the original paper showed that, while ACNNs succeed in learning chemical interactions from small datasets, they are prone to overfitting and fail to generalize outside the training set.

There are many things we can (and should!) experiment with to build a more robust model: adding regularization/dropout, using a larger dataset and a larger fully-connected network, changing the number of atom types and radial filters, and trying different splits. We might even be able to do pretty well by featurizing only the ligands and ignoring the binding pockets completely!

It’s pretty remarkable that we can get a dataset of protein-ligand complexes and train a fairly sophisticated deep neural net to predict a complicated physical quantity like binding energy in less than 15 minutes! However, predicting binding affinities remains a challenging problem, whether we use deep learning methods like ACNNs or physics-based simulations. For an overview of more machine learning methods for protein-ligand interaction prediction, check out this helpful post. Hopefully, open-source tools like ACNNs in DeepChem will make it easier for researchers to experiment with deep learning and develop even better methods for modeling protein-ligand interactions.

Getting in touch

If you liked this tutorial or have any questions, feel free to reach out to Nathan over email or connect on LinkedIn and Twitter.

You can find out more about Nathan’s projects and publications on his website.

Thanks to Prof. Joe Gomes for providing feedback on this post.

References

[1] Joseph Gomes, Bharath Ramsundar, Evan N. Feinberg, Vijay S. Pande. Atomic Convolutional Networks for Predicting Protein-Ligand Binding Affinity. arXiv:1703.10603.

[2] R. Wang, X. Fang, Y. Lu, S. Wang. The PDBbind Database: Collection of binding affinities for protein−ligand complexes with known three-dimensional structures. Journal of Medicinal Chemistry, 47(12):2977–2980, 2004.

--

--

Senior ML Scientist & Group Leader @PrescientDesign • @Genentech | Co-founder @AtomicDataSciences | Prev Postdoc @MIT, NDSEG Fellow @UPenn, @Berkeley Lab