Leveraging AI to Cure Cancer

Shagun Maheshwari
Towards Data Science
8 min readJan 1, 2019

--

Cancer is one of the leading causes of death worldwide. This isn’t a surprise.

1 in 2 Canadians are expected to develop cancer in their lifetime, that’s a whooping 18,500,000 people expected to be diagnosed with this deathly disease. Not to mention 1 in 4 Canadians are predicted to actually die of cancer during their lifetime, that’s 9,250,000 people.

Recently, many advances in technologies such as Artificial Intelligence are helping researchers revolutionize the future of healthcare, from identifying patterns in medical images to predicting new target proteins for drugs! This technology is showing significant ability to change the lives of millions around the world.

I leveraged the technology of AI and delved deep into the subsection called Generative Models. In Generative Models I focused on variational autoencoders specifically for it’s ability to:

a) Learn a meaningful underlying representation of data

b) Disentangle sources of variation from different classes of data

Because of these factors, I was able to construct a variational autoencoder to identify and extract known and unknown biological signals within a dataset of 5000 variably expressed genes.

The dataset stemmed from The Cancer Genome Atlas which has profiled over 10,000 tumors over 33 different cancer types, revealing different genomic features such as expression levels for numerous genes.

Gene expression levels are the measure of all the active and inactive genes in a sample. If the gene is active that means that it’s process, of encoding RNA with it’s information (transcription) and from the RNA encoding a protein (translation), is carried out. Gene expression levels capture substantial information about the state of a tumour. Obviously not all genes are expressed all the time, so the genes that are expressed help researchers identify which specific gene pathways to target when curing for a particular disease. The process of gene expression is extremely important as the product of gene expression, a protein, dictates a cells function. Identifying the genes that are highly expressed in different tumours, and identifying the biological affect of their gene expression in a patient, is extremely crucial to designing specific treatments to help cure diseases such as cancer.

The variational autoencoder I built was able to successfully compress the input data and re-generate similar data of 5000 genes and their expression levels. It was also able to disentangle sources of biological variation in the data, as well as identify the contribution of specific genes causing disparate biological patterns, that could have led to cancerous state of a tumour.

This post is going to be composed of two sections.

Part 1) Components — under the hood of a VAE

Part 2) Interpretation — extracting/identifying meaningful biological signals within the data

Part 1

What is it?

A variational autoencoder is a generative model. This means it can learn an underlying distribution of the input data and generate a replica of the data based on it’s learning.

Notice how the word encoder is a part of “variational autoencoder?” This is because one function of a VAE is to compress the input data (5000 genes) into a lower dimentional (hidden) space called the latent space. The data in the latent space is a distribution of the input data, which is then sampled from by a decoder network to generate a similar version of the input data.

A VAE consists of an encoder, decoder, and a loss function.

Structure of a VAE

The encoder is a neural network that takes the input data of 5000 genes and encodes it into just a 100 features.

Snip-it of input data consisting of 5000 genes and their gene expression levels (this table goes on for a long time)
Compressing the input data

The encoded features are probability distributions representing only the relevant features of the input data. Since the VAE is a generative model, it’s goal is to generate variations similar to the input data. In order for the encoder to compress and represent the data in probabilistic terms, to fit in a minimized space, the encoder outputs the compressed data as two vectors: the mean vector and the standard deviation vector.

Intuitively, the mean vector controls the range, what the encoding of the input data should be centred around, while the standard deviation controls the “area”, how much from the mean the encoding can vary.

Now the decoder network samples from the mean and standard variation vectors to get an input, as vector format, to feed into the decoder network. The sampled vector is called the hidden layer. The decoder is now able to reconstruct the original input.

But how do we make sure the output from the decoder network, matches the original input data fed into the encoder network?

This is where the loss function comes in to the rescue! The loss function is composed of 2 parts, a generative loss and a latent loss. The generative loss aids the decoder to generate data similar to the input, it helps it’s accuracy. It does this by taking the error difference between the data output of the decoder and the input of the encoder network. The error is then back propagated through both networks updating it’s weights and parameters to improve the accuracy of the decoder network. The latent loss tells how closely the encoded features in the mean and standard deviation vectors, match the original input data. This is an extremely important function in VAE’s as ultimately the encoded features are what is being sampled by the decoder to learn from and generate data similar to the input.

Output generated data from decoder network

Part 2

Identifying Biological Signals

I wanted to see if the encoded features in the VAE was able to recapitulate and preserve biological variance present in the gene data, such as the sex of a patient.

To do this I extracted the weights in the first layer of the decoder network. The weights in the first layer of the decoder network, decode the hidden layer which consists of sampled information from the compressed input data.

The weights used to decode the features in the hidden layer, were actually able to capture important and consistent biological patterns in the gene expression data. By extracting the weights from the decoder network, I was able to identify that feature 82 comprised mostly of genes relating to the sex of a paitent. This means that the encoder network was able to output a compression of the 5000 genes, that learned a pattern within the dataset.

By extracting the weights from the decoder, we are now able to look at which genes contributed to a specific feature encoding that the encoder network created. You can note that all of the genes in feature 82, are located on sex chromosomes.

genes in feature 82

We can predict the sex of a patient by looking at the positive high weight genes, these include the x inactivating genes such as XIST and TSIX.

The VAE was also able to construct two features that respectively comprised of primary and metastatic skin cutaneous melanoma(SKCM) tumours.

For the genes in these two features, we extracted the genes with high weights to specifically identify GO (gene ontology) terms that indicate the biological process these genes carried out. We extract the high weight genes because these are noted as the genes most expressed, meaning they have a larger affect on the state of the tumour as their expression dictates cellular function.

Code to extract high weight genes from SKCM features
Output of high weight genes from metastatic tumour
Output of high weight genes for primary tumour

Over-representation pathway analysis can now be implemented on the high weight genes from feature 66 and feature 53 to identify GO terms in each feature.

Overrepresetation pathway analysis is a technique for interpreting the GO terms/functional process of a list of genes in a feature, also called a gene set. The GO terms are grouped into 3 categories: molecular function (describing the molecular activity of a gene), biological process (describing the larger cellular role carried out by the gene, coordinated with other genes) and cellular component (describing the location in the cell where the gene product (i.e proteins) executes its function).

Each gene can be described (annotated) with multiple terms. Overrepresentation pathway analysis identifies the GO terms in a set that are overrepresented (hence the name), compared to random.

The terms that are overrepresented are outputted as P-values. The closer the P-value is to zero, the higher the distinction that the GO terms represented in a feature(gene set) are not random and hold biological relevance.

From the analysis of high gene weights in Feature 66, GO terms overrepresented in that gene set implicated functions related to cholesterol, ethanol, and lipid metabolism.

GO terms

Identifying GO terms that are overrepresented in the highest gene weights of a feature, can signal important processes carried out by a set of expressed genes.

If a particular set of genes are sampled from a cancerous tumour, identifying which genes have the highest weights (most active) and identifying their GO terms can signal potential gene expression pathway aberrations that aid in the causes of cancerous tumours.

For example if the high weight genes in feature 66 (metastatic SKCM tumour) implicated a strong correlation of a GO term relating to cholesterol, that could mean a particular pathway/process dysfunction occurred from expression of the genes which dictated changes in cholesterol levels. These changes would serve a huge factor in cell functions which can attribute to creating cancerous tumours.

Using this technology researchers can identify target pathways and products of expressed genes (proteins) that specifically contribute to the creation of cancerous tumours. The VAE was able to separate and construct encoded features of the 5000 gene data which captured patterns within the dataset such as patient sex and SKCM tumours. We are able to find this out by extracting the weights in the decoder network, as through training the VAE model, the decoder weights learn patterns within the data.

Key Takeaways

1. Generative models such as a variational autoencoders are able to efficiently compress data in a lower dimensional space, while still maintaining relevant features that help the decoder network to reconstruct the input data.

2. Weights in a decoder network are able to capture significant and important biological patterns within the data that can be extracted.

3. Using overrepresentation analysis to identify GO terms in the high gene weights can signal dysfunctions in biological processes.

--

--

16 year old machine learning developer interested in philosophy, programming and gaining new experiences.