From Unsplash

Thoughts and Theory

MoleculeNet Part 1: Datasets for Deep Learning in the Chemical and Life Sciences

Towards an “ImageNet moment” in Molecular Machine Learning

--

This post was co-authored by Bharath Ramsundar from DeepChem.

Benchmark datasets are an important driver of progress in machine learning. Unlike computer vision and natural language processing, the diversity and complexity of datasets in chemical and life sciences make these fields largely resistant to attempts to curate benchmarks that are widely accepted in the community. In this post, we show how to add datasets to the MoleculeNet benchmark for molecular machine learning and make them programmatically accessible with the DeepChem API.

Image from moleculenet.ai.

Molecular ML Dataset Curation

MoleculeNet [1] collects datasets in six major categories: quantum mechanics, physical chemistry, proteins, biophysics, physiology, and materials science. The “first generation” of MoleculeNet showed what a molecular ML benchmark might look like, and revealed some interesting trends with respect to data scarcity, class imbalances, and the power of physics-aware featurizations over model architectures for some datasets.

It isn’t easy to cover the entire breadth and depth of molecular ML, so that’s why MoleculeNet is evolving into a flexible framework for contributing datasets and benchmarking model performance in a standardized way, powered by DeepChem.

Why Should We Care about Benchmarks?

Image and speech recognition seem like gargantuan tasks, but they are really pretty simple compared to the kinds of problems we see in physics, chemistry, and biology. That’s why it’s comparatively rare to see anyone claim that a problem in physical or life science has been “solved” by machine learning. Better datasets, dataset generating methods, and robust benchmarks are essential ingredients to progress in molecular machine learning, maybe even more so than inventing new deep learning tricks or architectures.

In many subfields of deep learning, the standard avenue of progress goes something like

1. Pick a widely used benchmark dataset (e.g., ImageNet, CIFAR-10, or MNIST).

2. Develop and test a model architecture that achieves “state of the art” performance on some aspect of the benchmark.

3. Come up with an ad-hoc “theoretical” explanation for why your particular architecture outperforms the rest.

4. Publish your results in a top conference.

If you’re lucky, other researchers might even use your model or build on it for their own research before the next SOTA architecture comes out. There are obvious issues with this paradigm, including bias in datasets, distribution shifts, and the Goodhart-Strathern Law — when a metric becomes a target, it is no longer a good metric. Still, there’s no question that benchmarks provide a kind of clarity of purpose and fuel interest in machine learning research that is lacking in other fields.

Maybe more importantly, benchmarks encourage and reward researchers for creating high-quality datasets, which has historically been underappreciated in many fields. And benchmark datasets enable striking breakthroughs, like DeepMind’s AlphaFold, which was made possible by decades of effort assembling high-resolution protein structures. AlphaFold represents a sort of “ImageNet moment” in protein folding, meaning that a problem is “solved” in some sense.

MoleculeNet contains hundreds of thousands of compounds and measured/calculated properties, all accessible through the DeepChem API. It brings a flavor of the traditional evaluation frameworks popularized in ML conferences, but also provides a standardized way to contribute and access new datasets.

Contributing a Dataset to MoleculeNet

Dataset contribution has been significantly streamlined and documented. The first step is to open an issue on GitHub in the DeepChem repo to discuss the dataset you want to add, emphasizing what unique molecular ML tasks the dataset covers that aren’t already part of MolNet. If you created or curated a dataset yourself, this is a great way to share it with the molecular ML community! Next, you need to

  • Write a DatasetLoader class that inherits from deepchem.molnet.load_function.molnet_loader._MolnetLoader. This involves documenting any special options for the dataset and the targets or “tasks” for ML.
  • Implement a create_dataset function that creates a DeepChem Dataset by applying acceptable featurizers, splitters, and transformations.
  • Write a load_dataset function that documents the dataset and provides a simple way for users to load your dataset.

The QM9 MolNet loader source code is a nice, simple starting point for writing your own MolNet loader.

This framework allows a dataset to be used directly in an ML pipeline with any reasonable combination of featurization (converts raw inputs like SMILES strings into a machine-readable format), splitter (controls how training/validation/test sets are constructed), and transformations (e.g., if the targets need to be normalized before training).

Splitters are particularly important here. When comparing how different models perform on the same task, it’s crucial that each model “sees” the same training data and is evaluated on the same data. We also want to know how a model does on samples that are similar to what it’s seen before (using a randomized train/val/test split) versus how it does on samples that are dissimilar (e.g., using a split based on chemical substructures).

Accessing Datasets with the DeepChem API

MolNet loaders make accessing datasets and pre-processing them for ML possible with a single line of Python code:

from deepchem.molnet.load_function.qm9_datasets import load_qm9tasks, (train, val, test), transforms = load_qm9()

To actually make the dataset available through the DeepChem API, you simply provide a tarball or zipped folder to a DeepChem developer, who will add it to the DeepChem AWS S3 bucket. Finally, add documentation for your loader and dataset.

We Want YOUR Datasets!

After taking a look at the long list of datasets in MoleculeNet, you might find that there’s something crucial missing. The good news is that you (yes, YOU!) can contribute new datasets! If you’re not comfortable with Python programming, you can simply open an issue on GitHub, include information on why the dataset should be added to MolNet, and request help from a DeepChem developer. If you are comfortable programming, even better — you can follow the steps outlined above and make a contribution.

The real power of an open-source benchmark is that anyone can contribute; this allows MolNet to evolve and expand beyond what a single research group can support.

Next Steps: Molecular ML Model Performance

In the next post, we’ll discuss how to use DeepChem and MolNet scripts to add performance metrics for ML models.

Getting in touch

If you liked this tutorial or have any questions, feel free to reach out to Nathan over email or connect on LinkedIn and Twitter.

You can find out more about Nathan’s projects and publications on his website.

Thanks to Zhenqin Wu for providing feedback on this post.

References

[1] DOI: 10.1039/C7SC02664A (Edge Article) Chem. Sci., 2018, 9, 513–530

--

--

Senior ML Scientist & Group Leader @PrescientDesign • @Genentech | Co-founder @AtomicDataSciences | Prev Postdoc @MIT, NDSEG Fellow @UPenn, @Berkeley Lab