Pl-ai (play) models — quiz yourself

Play around with ai models of molecules and DNA

Miri Trope - Data Scientist and AI Consultant

Published in

Towards Data Science

5 min readFeb 9, 2022

This time, I thought we could have some fun with AI models of molecules and DNA. I have composed a short self-assessment quiz for you to test your understanding. Below are four figures, corresponding to four questions, that describe bioinformatics applications in AI architectures. Each illustration has some missing components numbered 1 to 10. Can you fill in the missing components for each illustration? The answers and explanations are provided below.

Questions:

Question 1/4: Multimodality neural network. This architecture predicts a drug response. Given a molecular structure (drug) and gene expression (DNA sequence), what type of encoders (numbered 1,2) would represent these input structures?

sssss — Question 1: Which kinds of encoders produce an embedding for drug and cell line, which are then fed into a predictor to estimate drug response?

Question 2/4: Cascade Model. Given a drug molecule, which encoders (numbered 3,4) are cascadingly fed into the predictor to classify the drug molecule?

Question 3/4: Given a standing DNA double helices on skates wearing a hat and eyeglasses (the “input character”). Which model (numbered 7) reconstruct the exact “input character”? Which representation is detected (numbered 5), and what is the extracted data of the encoder (numbered 6)?

Question 4/4: This network recognizes a fake molecule (wearing a hat) and returns a poison molecule. Which sub-models (numbered 8a, 8b) are composed in this model? Which data does the output return (numbered 9)? Which model is numbered 10?

Answers:

Explanations:

Question 1: Multi-dimentionality

by concatenating two modalities: gene expression and a molecule (next figure). (2) RNN represents a molecule and (1) CNN a genetic profile. By concatenating the output of those models, a combined representation is generated involving sequential and convolutional data.

Question 2: Cascade Model

The output of CNN (3) is the input of RNN (4). This cascade model acts like a descriptor that defines the drug globally (the RNN impact of global feature discovery) and locally (the CNN impact of local feature selection). Here, we assume that the local feature represents the type of atoms and functional group, and the global feature is the atomic arrangement.

Question 3: Variational Auto Encoder VAE (7, 8a)

The principal goal of an autoencoder (AE) is to construct a low-dimensional latent space (5) of compressed representations so that each input can be reconstructed to the original input. The module that maps the original input data, which is high-dimensional, to a low-dimensional representation is called the encoder, while the module that realizes the mapping and reconstructs the original input from the low-dimensional representation is called the decoder. The encoder and the decoder are usually neural networks with RNN and CNN architectures. With the molecular representations calculated, a typical data processing procedure with AE on molecule generation starts with encoding the input into a low-dimensional latent space. Within the latent space, the axis of variations (6) from the input can be encoded.

Question 4: Adversarial Auto Encoder (AAE)

In generative chemistry, GAN (8b) generates strings, molecular graphs, or fingerprints, depending on the selection of the molecular representation, using the latent random inputs (6). The generated molecules are mixed with the samples of real compounds to feed the discriminator (10). The discriminative loss is calculated to evaluate whether the discriminator can distinguish the real compounds from the generated ones, while the generative loss (9) is computed to estimate whether the generator can fool the discriminator by generating undistinguishable molecules. Both loss functions indicate that even a well-established discriminator can be misled to classify generated molecules as real, as the generator has learned and accumulated authentic data patterns to create new compounds. To confront the discriminator and minimize the generative loss, the generator can only explore a limited chemical space defined by the real compounds. Thus, the restricted chemical space covered by the generated molecules may be a limitation.

The architecture of the Adversarial Auto Encoder AAE is quite similar to the Variational Auto Encoder (8a) except for the additional discriminator network (10). AAE trains three modules: an encoder, a decoder, and a discriminator (10). The encoder learns the input data and maps the molecule into the latent space following the distribution. The decoder reconstructs molecules through sampling from the latent space following the probabilistic decoding distribution (6). And the discriminator distinguishes the distribution of the latent space from the prior distribution. The encoder is modified consistently during the training iterations to minimize the discriminator’s cost by following a specific distribution. A simplistic prior, like a Gaussian distribution, is assumed in VAE, while alternative priors can exist in real-world practices.

Note:

If you liked this post and want to learn more about drug discovery, I recommend you to read my blog Questions you should ask AI-based drug discovery companies.
I put a lot of effort into my blogs. Please write me an email (miritrope@gmail) and connect via LinkedIn, tell me that you like my piece of work.
Additionally I’ve just opened my website page. You are welcome to visit me.
All the diagrams drawn by the author.

Pl-ai (play) models — quiz yourself

Play around with ai models of molecules and DNA

Questions:

Answers:

Explanations:

Written by Miri Trope - Data Scientist and AI Consultant