Bridging the Gap Between Genetics and Neural Networks

Building and Analysing Neural Networks on Genetic Data

Miri Trope - Data Scientist and AI Consultant

Published in

Towards Data Science

13 min readApr 28, 2020

An Anecdote of Merging a Neural Network with DNA Double Helices (aligned) — drawn by the author

I recently conducted research-work on genetic sequences. The main question that occupied my mind about this was: “which is the simplest suggested neural network available for this purpose that is most compatible with genetic data?” After much literature-review, I discovered that the most “down to earth” yet fascinating work related to this topic took place in Prof. Yoshua Bengio’s lab. The paper named “Diet network: Thin Parameters for Fat Genomics,” and its main goal was to classify genetic sequences of 3,450 individuals into 26 ethnicities. That paper inspired me, and here I would like to explain the basics of building neural networks for solving that sort of a problem. For understanding this blog, no prior background in biology is needed; I will try to cover most of the necessary parts to jump straight into the computational sections.

Motivation

We are facing challenging times: the SARS-CoV-2 virus has left us helpless towards the powerful force of nature. By learning new tools: gaining intuition with regards to genomic data, and exploring which machine learning methods can best generalize that data; I hope that we can join forces together and make a change for better days, or at least use the incredible intelligence of neural networks to do something besides developing entertainment applications, but saving our lives and even our planet.

Why Do I Find Genetics so Appealing?

Your genetics reveal not just your biological information but the genetics history of your ancestors by representing which dominant parts survived through the years (check out “ancestral sequence reconstruction”).

In other words, it is the biological evolution coding of your family, and even more, according to Darwin’s Theory of Evolution, the entirety of the collection of organic creatures (plants, animals, etc.) all share the same genetic principles.

Intuition

Let me walk you through other types of data, such as images and sentences, to understand the uniqueness of genetic data. On the one hand, images are two-dimensional data (or three-dimensional for volumes) with neighbor-relationships. Sentences are one-dimensional vectors of up to about a thousand values with the hierarchical nature of sentences trained by an unsupervised manner.

On the other hand, a genetic sequence is a one-dimensional vector (a sequence) of at least hundreds of thousands of values with no well-defined relations between neighbors and far away from having a pre-trained set of models.

Thus, a Gaussian smoothing filter that is very popular in image-processing is not relevant here, as well as all the bunch of pre-trained models in vision (ImageNet, VGG, ResNet …) and natural language processing (Word2Vec, Glove, BERT …) are benched out of the game.

Why Is It a Challenge?

Think of a database consisting of thousands of genetic samples. You need to find a method that generalizes well (accuracy over 90%) with input data of tens of millions of combinations. A neural network can be a good fit because it utilizes the power of fully connected units in a way that is missing in other “classical” algorithms like PCA, SVM, and decision trees that do not manage the data separately. Nevertheless, building the simplest network architecture requires more than tens of millions of free-parameters in the weights of the first layer. Dimensionality reduction (to avoid a surfeit of free parameters) is one way to face that problem; we will discuss it later in this blog.

Biological Background

To clear things up and not pose difficulties over the main purpose of this forum, I present here only a high-level view of the biological parts needed in this blog. Needless to say, you are more than welcome to explore any of these biological topics further.

What Is a Genetic Sequence?

A DNA molecule is a sequence of four types of bases represented by the letters of A, C, G, T. Specific parts of the sequence (even if remotely located) are correlated with a phenotype. For instance, a recent study: “A Pneumonia Outbreak Associated With a New Coronavirus of Probable Bat Origin” indicates that the ACE2 gene could be the host receptor (phenotype) of the SARS-CoV-2 virus. That example and many others remarkably show valuable information (criminals detection, matching cannabis strains, nutrition, and personalized medications) that can be achieved based solely on your DNA.

What Are SNP Genotypes?

Nowadays, we are closer than ever to achieving nearly full human genetic sequences. However, we are still far from covering the entirety of it. Single Nucleotide Polymorphisms SNPs are specific genotypes locations in the genomic sequence, generally represented as RS[number]. Different populations have different sequence invariants, but likely to be about the same within families (hence Asians look differently from Europeans). The analysis of SNP sequences will be a key point throughout the rest of this blog.

Method

In this section, I describe the data and the two main network architectures (and another network with improved parameters to overcome some of the major problems in machine learning) as well as some technical tips …

Data

Relatively to other data types, medical datasets are difficult to find, mainly due to privacy restrictions. In light of this, the 1000 genome project achieved a remarkable breakthrough by publishing a publicly available dataset of 3,450 human DNA samples, 315K SNPs each of 26 worldwide populations. The next figure shows a histogram derived from the 1000 genomes data, depicting the frequency of individuals per population (ethnicity); The average number of samples of each population is about 133 genetic samples.

The 1000 Genomes Population Distribution (Ethnicity)

Dimensionality Reduction

As mentioned above, reducing the number of free parameters in a model is preferred (in our case, we are dealing with about 30 million parameters). The proposed method for achieving this uses another auxiliary network on top of the discriminative network that inputs a histogram per class (an embedding matrix calculated in an unsupervised manner). The output of this network initializes the weights of the first layer of the discriminative network. The embedding matrix is the normalized genotypes histogram per population, and its size is SNPs X [4x26], where four stands for {00, 01, 11, NA} (bi-allelic) and 26 for the number of classes (populations).

The implementation of such an embedding matrix is described below.

Anyhow, that is their solution; my solution is by reducing the number of the hidden units layer (see the architecture section). I called this new architecture the improved model, and one of its benefits is to overcome overfitting, as discussed later in the results section.

Architecture

Two main networks are compared in this blog. Both networks consist of two fully connected hidden layers followed by a softmax layer, but the second (see next figure) includes the auxiliary network that predicts the discriminative network’s free parameters. The auxiliary network takes as input the embedding matrix and returns the weights of the discriminative network’s first later (Fig. 1).

Figure 1: Two Discriminative Models, Without (upper) and With the Auxiliary Network (lower). Drawn by the author. Note that the embedding dimension of the auxiliary network is 10K (the explanation is below).

A detailed description of the architecture can be seen in Fig.2.: batch norm followed by a dropout layer are required before each fully connected layer.

Figure 2: The Internal Network Architecture

Implementation

I wrote the whole code in this work from scratch in Pytorch, it can be found at the public GitHub repository, named “human genome”. Below are some general points that I find most relevant for this forum.

Managing the Data Structure

To achieve objective results, we generate five-folds, one for each experiment, calculating the statistics variants at the end.

Split and Shuffle

We split the 3.5K samples into train (60%), validation (20%) and test (20%). As usual, we randomly shuffle the data and normalise the values:

mu = x.mean(axis=0)
sigma = x.std(axis=0)
train[0] = (train[0] — mu[None, :]) / sigma[None, :]
valid[0] = (valid[0] — mu[None, :]) / sigma[None, :]
test[0] = (test[0] — mu[None, :]) / sigma[None, :]

2. Dimensionality Reduction

Generating the embedding matrix is conducted in two steps: the first generates a histogram of genotypes per class by bincount() and the second normalises that histogram. The outcome is a dimensionality reduction by a factor of about ten orders of magnitude.

for i in range(n_features):
    if i % 5000 == 0:
        print("processing snp no: ", i)
    for j in range(n_targets):
        # generate for each snp , a histogram of four bins, one for each allel's option
        embed[i, j*n_genotypes: j*n_genotypes + n_genotypes] += \
            np.bincount(genome[i, labels == j].astype('int32'), minlength=n_genotypes)# normalizing the result for each class
        embed[i, j*n_genotypes: j*n_genotypes + n_genotypes] /= \
            embed[i, j*n_genotypes: j*n_genotypes + n_genotypes].sum()

Here is an example of a histogram per class of a specific SNP:

Normalized per class histogram of a specific SNP

3. Connecting the Networks

Both networks (auxiliary and discriminative) have separated computational graphs that are not linked in any way. The computational graph of the discriminative net, which contains the loss, does not have the information about the dependency between the loss and the embedding tensor. A solution can be to set the gradient value of the embedding tensor with the gradient value of the discriminative net manually and call torch.autograd.backward() on the embedding net because in the computational graph of the embedding net, the dependency between tensors is known.

The training loop is:

Forward pass in embedding net
Set weight of the first hidden layer in the discriminative net with embedding values
Forward pass in discrim net
Backward in embedding and discriminative nets
Updating parameters

In the discriminative model’s first hidden layer, we initialise its 30 million weights with the output of the auxiliary network (which is the embedding layer)

with torch.no_grad():
    self.hidden_1.weight.copy_(embedding)

The forward passes:

embedding = emb_model(feat_emb) # forward pass in embedding netdiscrim_model.hidden_1.weight.data = embedding  
embedding.grad = discrim_model.hidden_1.weight.grad

The optimizer should update the parameters of both networks:

params = list(discrim_model.parameters()) + list(emb_model.parameters())
optimizer = torch.optim.Adam(params, lr=learning_rate)

The backward passes:

Because we set the computational graph of the embedding net, the dependency between embedding and discriminative nets are known.

loss.backward(retain_graph=True) # compute the gradient of the weights of the hidden layer in the embedding net 
torch.autograd.backward(emb, emb.grad) # backpropagation in the embedding net

4. Training Session

Daniel Godoy explains in his blog the training process, I would like to expand the scope of his work according to our work with mini batches.

Minibatch and Epoch

The function: loss_fn(y, yhat) returns the mean squared error (squared L2 norm) between each element in the input y and target yhat. Because we want to calculate the loss, we need to multiply that value by the batch size and then summarise all the returned values of each minibatch.

train_loss += loss * batch_size

In the end, each epoch contains the accumulated value from the last section. Thus, to get the loss, we need to divide by the number of mini-batches in that loop.

test_loss = test_loss / len(test_minibatches)

I highly recommend you to use early stopping, the rationale for this is automatically deciding according to the validation loss when is the right time to stop the training and save the best last model result.

Frameworks

Below are some cool tools that I find useful (and free):

Neural Network Library

I must mention the benefit of using Pytorch as the best neural network library, from my experience, in comparison with many others, it’s the best in many ways. The paper: “A Comparative Measurement Study of Deep Learning as a Service Framework” presents an empirical analysis of the frameworks: TensorFlow, Caffe, Torch, and Theano.

Training on the Cloud

You would benefit from training your model on the cloud and saving time. Ashwin De Silva describes in his blog how you can work locally, commit and push to your GitHub repository, pull your commits on the cloud, and run your training there. In my opinion, it’s worth the time and effort to script some tests with different parameters, such as the number of hidden units, dropout values, and so forth.

Results

In a retro- perspective view, I walked through some known difficulties among data scientists during the analysis of the results and found it important to share with you, to give you honest evidence of the dynamic behavior of developing such networks. I find the following remarks as the main characters while investigating the performance of your network:

Loss

Let’s start with the loss function: this is the “bread and butter” of the network performance, decreasing exponentially over the epochs. Moreover, a model that generalizes well keeps the validation loss similar to the training loss. The reason for this is simple: the model returns a higher loss value while dealing with unseen data. If you encounter a different case, your model is probably overfitting. Solutions to overfitting can be one or a combination of the following: first is lowering the units of the hidden layer or removing layers to reduce the number of free parameters. As we discussed above, our improved network as well as the auxiliary network, come to the rescue for the sake of this problem. Other possible solutions are increasing the dropout value or regularisation. Mazid Osseni, in his blog, explains different types of regularization methods and implementations. Fig. 3 shows the loss function of the simpler version of my network before (to the left) and after (to the right) dealing with the so-called overfitting problem.

My solution is the combination of lowering the size of the hidden units (from 100 to 50 units) and increasing the dropout value of the first layer (from 0.5 to 0.8).

Figure 3: The Loss Functions (training in orange and validation in blue) without (left) and with (right) the auxiliary network

Accuracy

The test accuracy was calculated in each architecture. Surprisingly, it seems that overcoming the overfitting or reducing the number of free parameters does not promise higher accuracy. Fig. 4 shows the test accuracy of the three architectures: notice that the accuracy is higher although the overfitting problem.

The number of free parameters of the first layer of such model would be about the number of features (SNPs) x the number of the first layer (~300kx100). Now, we use an auxiliary network that predicts those 300kx100 free parameters. This auxiliary network takes as input a feature embedding, that is some arbitrary transformation of the vector of values each feature — SNP — takes across patients. The question is then how does this embedding look like. If we follow the embeddings considered in the paper, we would have a 4x26 dimensional embedding for the per-class histogram x 100 the number units of the first layer.

Figure 4: Test Accuracy (batch size: 64) and the number of free parameters of the first layer.

Batch Sizes

Testing the performance with different batch sizes is an amusing task. Kevin Shen, in his blog, investigates the effect of batch size on training dynamics. According to the total training times, probably because of data diversity, the batch size is inversely proportional to the training time (Fig. 6). For the same reason, the loss is directly proportional to the batch size (Fig. 5).

Figure 5: The Loss Functions — Training and Validation (dashed) Over Different Batch Sizes

Training Times

Fig. 6 clearly shows the behavior of using different batch sizes in terms of training times, both architectures have the same effect: higher batch size is more statistically efficient but does not ensure generalization. Read the paper: “Train longer, generalize better: closing the generalization gap in large batch training of neural networks” to understand more about the generalization phenomenon and methods to improve the generalization performance while keeping the training time intact using large batch size.

Figure 6: Total Training Times Over Different Batch Sizes

Notice the effect of changing the architecture in terms of the training time (Fig. 7). The training time is significantly lower either with 15 million free parameters, than the auxiliary network.

Figure 7: The Epoch Times of the Three Architectures (batch size: 64)

Alternative Machine Learning Method

I also compared the performance of the improved model to the decision trees approach, specifically the Light Gradient Boosting Machine that is commonly used in the data science domain. However, the performance went beyond our limits in terms of misclassification error (see Appendix for more details).

Discussion

Which Is the Best Model? (The Thousand Dollar Question)

In this blog, I described two main networks: with and without the auxiliary network and an additional network with improved parameters. The benefit of the parameter prediction networks is that it considerably reduces the number of free parameters in the first layer of a model when the input is very high dimensional, as in genetic sequences.

I showed how changing the parameters of the basic network yielded better generalization in terms of overfitting. I validated those networks’ approaches on the publicly available 1000 genomes dataset, addressing the task of ancestry prediction based on SNP data. This work demonstrated the potential of neural network models to tackle tasks where there is a mismatch between the number of samples and their high dimensionality, like in DNA sequencing.

Given the high accuracy achieved in the ancestry prediction task, I believe that neural network techniques can improve standard practices in the analysis of genetic data. I expect that these techniques will allow us to tackle more challenging genetic association studies.

Acknowledgments

Thanks to Camille Rochefort-Boulanger (University of Montreal) for giving me some good advice on the implementation process.

For those wondering which computer do I have: I would like to say some good words about my MacBook Pro, 16 GB of memory, Intel Core i7 for allowing me to work on such an amazing task, leaving me with those satisfying training times (see the Results section), and the whole “computer laboratory” experience (while working from home during the Coronavirus closure period).

Appendix

Results Using Decision Trees

In addition to the neural network method that I highlighted in this blog, I would like to mention my experience results with Gradient Boosting Decision Trees. The implementation can be found in this GitHub repository. The parameters of the algorithm are:

params = {
'task': 'train',
'boosting_type': 'gbdt',
'num_class': 26,
'objective': 'multiclass',
'metric': 'multi_logloss',
'num_threads': 36,
'zero_as_missing': True,
'verbose': 1
}

Results show that the classification error is higher with decision trees, explaining the reasons are beyond the scope of the blog.

Figure 8: A Comparison Between the Errors of Using Decision Trees vs. Neural Networks

Closure

Dear readers, thanks for reading this blog; any of your thoughts would be much appreciated; please feel free to email me (miritrope@gmail.com) or Linkedin.