Molecule synthesis using AI

Step by step guide to use GAN in search for new drugs and materials

Neeraj jain
Towards Data Science

--

Progressive GAN generated molecule structures
Progressive GAN generated molecule structures (Three molecules are real)

Molecule synthesis is a process of constructing complex chemical molecules from simple precursors. This is the key to the development of next generation medicines, smart materials, pesticides and electronics. Hitherto molecule synthesis is a manual, costly, time consuming, multi-step process. It is guided mainly by chemical intuition and a sound knowledge of chemical reactions. Artificial Intelligence (AI) has potential to make any molecule at will, inexpensively and on a meaningful timescale that will unlock unimagined opportunities for future scientific advancement.

Generative adversarial nets (GAN) is an AI model that consists of a ‘Generator’ and a ‘Discriminator’. Generator captures the training data distribution and generates samples from it. Discriminator estimates the probability of sample coming from the Generator rather than training data. This two-player game is played till the probability of discriminator making error is maximized.

In this article, I will demonstrate how a GAN can be trained to generate molecular structures. This is the first step in automating molecule syntheses. Similar work is being done by Machine Learning for Pharmaceutical Discovery and Synthesis Consortium (Can AI Create molecules? by MIT-IBM Watson AI Lab). I am taking this example to accentuate the ease with which any individual can use AI to perform cutting edge research. The only limiting factor is imagination.

Data

The first step in any artificial intelligence (AI) project is data. There are multiple websites containing chemical database in various formats. PubChem contains a list of ~96M chemical compounds in SMILES notation. SMILES notation can be classified into functional groups and converted into molecular structure images using python rdkit library. Out of 52 functional groups, only Alcohol Aliphatic functional group was used to limit the scope of this work. First 50,000 compounds from Alcohol Aliphatic group were converted into images of size 128x128. The code to categorize and convert SMILES entries into images is available at github.

Training

Next step was to select the GAN model. After a few failed attempts, I selected Progressive GAN’s (PGAN) pytorch implementation by facebook research (pytorch_GAN_zoo). Novelty of Progressive GAN is that it starts training with images at low resolution and add new layers that introduce higher-resolution details as the training progresses. According to its authors, these characteristics of PGAN makes it more stable and faster to train.

First step in training was to clone the pytorch_GAN_zoo:

git clone https://github.com/facebookresearch/pytorch_GAN_zoo.git

The training process for PGAN is slightly different than other GANs. Before starting the training, data has to be resized into low resolution images. This is done by giving the following command:

python datasets.py celeba $PATH_TO_IMAGES -o $OUTPUT_DIR -f

In this command, “celeba” is the name of pre-trainned dataset. pytorch_GAN_zoo has multiple dataset pre-trainned on this model. “celeba” dataset corresponds to images of 128x128 pixel, which is same as size of images used in this project. Unless user is a hyper-parameter wizard, it is advisable to adapt data to the hyper-parameters of the pre-trainned model rather than tuning hyper parameters according to data. After first run, hyper-parameters can be changed to fine tune the model. Following Table contains the pre-trainned model names and supported image sizes.

Dataset name and image size

Datasets.py will also generate a config file. This config file will contain 1) path to the resized images and the original images. 2) number of iterations per scale. ‘Scale’ is a concept that is unique to PGAN. Every scale is associated with number of layers in the model, iterations and image resolution. Max scale is calculated using following formula.

image_size = 2**(2+max_scale)

In this formula constant 2 is added to max scale because the training layers start from resolution of 4x4. Datasets.py will resize images in steps of (64, 128, 512, 1024) until the image size of the data. “$PATH_TO_IMAGES” is the location of the training images. “OUTPUT_DIR” this is the location where resized images will be saved. “-f” option is to generate resized images before training starts. Once the resize images and config file is created, next step was to start training by giving the following command:

python train.py PGAN -c $CONFIG_FILE -n $DATASET_NAME -d $WEIGHTS_DIR

In this command, “PGAN” refers to Progressive GAN. pytorch_GAN_zoo also support DCGAN. ”CONFIG_FILE” is the path to config file generated by datasets.py. “DATASET_NAME” is the name of custom dataset. “WEIGHTS_DIR” is the location where weights will be stored. More options are defined in train.py. I had used “ — np_vis” option to use numpy based visualization instead of installing “visdom” package. I had also set “-e” and “-s” option to 2000.

Training Analysis

This training went on for 9 days on Tesla K80 based server. I stopped the training after 62000 iterations on last scale. Weights of the trained model can be downloaded using this link. One of the main reason GAN training takes long time is due to lack of transfer learning in generative models. Transfer learning reduces the data requirement and convergence time in discriminative models. There is a recent paper that propose to address this issue in GANs. One way of reducing training time is to use minimum sized images relevant to data.

In PGAN, training layer starts from 4x4 resolution and goes up to image size of the dataset. For image size of 128, layers were added in the scale of 4,8,16,32,64,128. Plotting training time vs iterations clearly show how addition of new layer increased the training time exponentially at every scale.

Training time vs iterations

The video below shows training progress. Most of the refinement in generated images happened during last scale. This video was created by stitching evaluation images that were saved after every 2000 iterations.

Training progress on Progressive GAN

pytorch_GAN_zoo provides tools to evaluate the performance of the model on generated images. Most popular one is inception score. Inception score of generated images was calculated by giving following command.

python eval.py inception -c $CONFIGURATION_FILE -n $modelName -m $modelType -d $WEIGHTS_DIR

Sliced Wasserstein distance (SWD) is another method used to evaluate high-resolution GANs. laplacian SWD score was calculated by giving following command. More information about evaluating GAN performance is given in this paper.

python eval.py laplacian_SWD -c $CONFIGURATION_FILE -n $modelName -m $modelType -d $WEIGHTS_DIR

pytorch_GAN_zoo implements another tool, “Inspirational Adverserial Image generation”. This tool takes an image as input and extracts an input vector using gradient decent. This input vector is used to generate new images that share characteristics of the input image. Inspirational generation is a two step process.

python save_feature_extractor.py {vgg16, vgg19}\   $PATH_TO_THE_OUTPUT_FEATURE_EXTRACTOR --layers 3 4 5

In this command vgg16/vgg19 specify model to be used for input vector generation. Once the feature extractor was downloaded, following command was used to generate molecule structure based on

python eval.py inspirational_generation -n $modelName -m $modelType\ --inputImage $pathTotheInputImage -f \ $PATH_TO_THE_OUTPUT_FEATURE_EXTRACTOR -d $WEIGHTS_DIR

Image below shows input and generated molecule structures. It is admirable how well ProGAN had learned input vector and generated pattern.

Inspiration Generation

Future Work

The model trained above will generate random molecular structures. This work can be easily extended by using LC GAN to generate molecules with certain attributes by constraining the latent space of a Variational Autoencoder (VAE) with a separate GAN trained in latent space. This model can also be trained on 3D molecular structures. Output of this model can be fed to another model that can validate and predict the properties of the generated molecular structures. All these methods are currently being pursued by MIT-IBM Watson AI Lab.

Due to open accessibility to information, data and resources any individual can pursue the research for new technologies.

Acknowledgement

I am indebted to open source communities on whose work and knowledge this article depend. I thank Preeti Gupta, professor of Chemistry, for giving me crash course in functional group classification.

--

--