The world’s leading publication for data science, AI, and ML professionals.

Chemical Predictions with 3 lines of code

State of the art results using Chemprop & graph neural networks

Designed by macrovector - www.freepik.com
Designed by macrovector – www.freepik.com

In this post, we use Machine Learning / AI to predict the properties of small molecules (a task known as QSAR). This is done by using state-of-the-art graph neural networks from the open-source library Chemprop.


Typical pharmaceuticals come in the form of small molecules that can regulate some biological processes in our bodies. Unfortunately, an unimaginable amount of things can go wrong in this process; the compounds can be toxic, clear very slowly from our bodies, interact with unintended other molecules, etc. We, therefore, want to very carefully be testing these small molecules before they ever get injected into anyone.

During the early phases of drug discovery, many different variations of small molecules are typically tested in lab-scale experiments for various properties, e.g., solubility, different forms of toxicity, binding affinities, etc. This process can be extraordinarily laborious, so wouldn’t it be great to use ML to predict these properties based on experiments already performed? This task, which is well known within cheminformatics, has received increasing attention in recent years due to advancements in deep learning.

Plenty of libraries exist for these kinds of analyses. In this post, we will use the open-source library called ChemProp, which is constantly updated by a research group at MIT, and which achieves excellent results across a wide array of benchmark datasets while being extremely simple to use.


What is the input data?

Predicting chemical properties is traditionally known as Quantitative Structure-Activity Relationship (QSAR) models. The input for these models is a string representation of the molecule, also known as a "SMILES" string. For example, this could look something like this:

Different SMILES strings can represent the same molecule, but each SMILES string corresponds only to one molecule. Image by author
Different SMILES strings can represent the same molecule, but each SMILES string corresponds only to one molecule. Image by author

Essentially, our datasets consist of molecules represented by their SMILES strings and a set of properties that we want to predict for each molecule. This could look as follows: the properties are binary variables for whether a molecule was FDA approved and whether it passed toxicity tests.

Each sample is a molecule represented by its SMILES string and one or more properties—image by the author.
Each sample is a molecule represented by its SMILES string and one or more properties—image by the author.

What is this Chemprop model?

The Chemprop model was published in 2019 in [1] but has continuously been updated since then and shown its worth in a range of later publications, most notably in [2], where they used it to uncover new potential antibiotic compounds.

Underlying Chemprop is a message-passing neural network implemented in Pytorch, meaning the input for the model is the graph representation of the molecules. In addition to working directly on the graph, though, Chemprop also automatically uses more classical derived chemical features, meaning that it generally performs well (compared to other models) on both small and large datasets. For more information, I recommend reading the paper [1].

Chemprop can be installed from PyPI:pip install chemprop.

Let’s get started: 3 lines of code

Say we have a dataset like the one shown above with some SMILES strings and two properties we wish to predict; whether the FDA approved the compound and whether it passed toxicology tests during clinical trials.

Plenty of similar datasets can be found online, e.g., check out these benchmarks. In typical usage of Chemprop, we would go through the following steps: 1) optimize hyperparameters of the model, 2) train the model, and 3) run predictions on a new set of molecules.

Step 1: Tune the hyperparameters.

Chemprop has a few tunable hyperparameters that can be adjusted to get the best possible results on a given dataset. Having installed chemprop, we can run 50 iterations of the TPE hyperparameter tuning:

Step 2: Train the model

With the ideal hyperparameters identified and saved in data/config.json, we can then train a model with these parameters:

A few interesting parameters of note here:

  • num_folds: we create 5 folds (different train/test) splits and train/evaluate the model on each of these folds.
  • ensemble_size: we create an ensemble of 3 models with different initialization, which will improve mean performance. Note that each of these 3 models has 5 folds, so in total, we are training the model 15 times!
  • split_typeWhen creating our internal train/test splits, we ensure that similar molecules are put into the same folds to get a more realistic measure of generalization.

On my training run, I received the following output: Overall test AUC: 0.871 +/- 0.036 – i.e., quite promising results from the training cross-validation.

Step 3: Evaluate the model on new data.

Now that we have trained the model, the next step is to evaluate its performance on a test dataset. We already know from the cross-validation that the performance should be about AUC of 0.84, but let’s see:

This will generate a file predictions.csv with predictions for the compounds in our test dataset. Note these predictions are the mean of a total of 15 models trained in the previous step. Holding this up against the true values, we get the following:

Evaluation of trained chemprop model on the test dataset. Plot by author.
Evaluation of trained chemprop model on the test dataset. Plot by author.

Quite good for minimal effort! 👏

Taking it a step further: Interpretation

Sometimes it is not enough to get the point predictions; we might want to know why the model returns a given prediction. If the molecule, for instance, is predicted to be toxic, it would be valuable to know which part of the molecule is causing said toxicity. By fetching my Chemprop branch from here, we can create such interpretations using a variation of the BayesGrad method:

For each compound being evaluated by the interpret_local.py script, the algorithm will return an explanation plot, e.g., in the case of the above-trained solubility model, it would show how hydrophilic atoms (Oxygens) increase the solubility of the molecule:

Heatmap for interpreting solubility prediction. Image by the author.
Heatmap for interpreting solubility prediction. Image by the author.

Briefly, here the "importance score" of a given atom or bond in a molecule is calculated as the gradient of the predicted target with respect to features from that atom or bond; i.e., the larger the absolute sum of gradients attributable to a given bond/atom, the more important for the prediction that bond/atom is assumed to be. Sensitivity maps generated directly in this fashion are known to be very noisy. Therefore, BayesGrad proposes using dropout to sample the posterior distribution p(W|D) of the network parameters W, allowing us to calculate an average of the gradients from all these sampled networks instead, thus smoothing the results. Rather than relying on the inclusion of dropout in Chemprop, we’re instead sampling p(W|D) by simply using the weights W from cross-validation folds and ensemble models – i.e., for a set of 20 models, we calculate the average sum of gradients on each atom and bond.

Final Remarks

Being able to predict chemical properties is extremely powerful; imagine you have a set of 100 potential molecular candidates. Synthesizing each one of them is a laborious task. If we can rank them according to how likely they are to fulfill various properties, we can first synthesize the most promising candidates. In addition, by interpreting model predictions, we may learn more about why the molecules work or do not work and thus maybe even develop even more promising candidates based on this information.

Using 3 commands to train & predict with these models may give the impression that this is a solved problem. This is not the case. I recommend reading, e.g., this post by Andreas Bender for some reflections on how dangerous and difficult it can be to draw conclusions from these models.


[1] Kevin Yang et al., Analyzing Learned Molecular Representations for Property Prediction (2019), J. Chem. Inf. Model. 2019, 59, 8, 3370–3388

[2] Jonathan M. Stokes et al., A Deep Learning Approach to Antibiotic Discovery (2020), Cell, 180, 4, P688–702.E13


Related Articles