The Mystery of the Origin — Cancer Type Classification using Fast.AI Library

Alena Harley
Towards Data Science
6 min readOct 30, 2018

--

tree without its roots

Chapter 1. The problem — tree without its roots

Approximately 15% of cancers metastasize, i.e. cancer cells break away from where they are first formed (the primary site or tissue of origin) and travel through the blood or lymph system to form new metastatic tumor. Determining primary site of origin for metastatic tumors is one of the open problems in cancer treatment because the efficacy of treatment is often dependent on the cancer primary site of origin.

Cancer classification using point mutations in tumors is challenging, mainly because the data is very sparse. Many tumors have only a handful of mutations in coding regions, and many mutations are unique.

It has been previously demonstrated that classifiers that rely on the point mutations in a tumor achieve limited accuracy, particularly 64.9% on 12 tumor classes, e.g. DeepGene algorithm. Tumor classification accuracy can be greatly improved (~90% for 33 tumor classes) by relying on availability of gene expression data. However, this additional data is often not readily available in clinical setting. Thus, accurate computational methods that can predict tumor class from DNA point mutations alone without relying on additional gene expression data are of great interest.

Chapter 2. The solution = Embedding + Transfer Learning and Fine-Tuning

So what is the solution?

As a quantum theory physicist Niels Bohr has noted that: “Every great and deep difficulty bears in itself its own solution.” It forces us to change our thinking in order to find it.“

Let’s examine the difficulties we are facing:

  1. Representation of the data — current representation of the data doesn’t allow us to use pre-trained Deep Neural Networks that perform very well on image data sets. Unfortunately, in cancer genomic application domains, training data is scarce, and approaches such as data augmentation are not applicable. There are 9,642 samples spread around 29 classes that are available from The Cancer Genome Atlas (TCGA).
  2. Tumor point mutation data is sparse even when summarized on a gene-level. One interesting observation from cancer biology is that cancer mutations in genes belonging to the same pathway are often mutually exclusive. Below is an example of ‘hallmark’ processes (pathways) affected in cancer. Pathways are listed in blue color; the image was adopted from this paper.

So, why not encode point mutation data using pathways, but how? What about training our own Gene2Vec encoding using information from gene membership in pathways.

Here is a preview of how well this works without reading the details — 78.2% accuracy on 29 tumor classes relying on DNA point mutations only.

Chapter 3. Step-by-step ‘how to’

3.1 Data and its pre-processing: I downloaded TCGA Mutation Annotation Format (MAF) files from the Genomic Data Commons Portal. I removed silent mutations and only retained genes with Homo sapiens genome assembly GRCh38 (hg38) annotations. The dataset was split 80% samples within each of the 29 tumor types was used for training and 20% for testing.

MutSigCV was used to identify significantly mutated genes among the non-silent mutations that were detected in each tumor type training set. This let me extract important features of the very sparse dataset. MuTSigCV detects genes with higher mutation occurrences than what is expected by chance, taking into account the covariates that include a given gene’s base composition, its length, and the background mutation rate. I was left with 1,348 unique significantly mutated genes.

To learn biologically relevant embedding of the data, I trained Gene2Vec embedding. I used the database of all known pathways, MSigDb version 6.2, containing 17,810 pathways. In the spirit of Word2Vec, I mapped pathway-similar genes to nearby points. Here, I assumed that genes that appear in the same pathway contexts share biological function. I used standard Skip-Gram model when defining Gene2Vec.

3.2 Transformation of mutation data into images:
I then extracted learned Gene2Vec embedding for 1,348 significantly mutated genes in our training set, this step produced a square matrix. I used spectral clustering algorithm to create visual structure in the embedding matrix. Spectral clustering is a technique for putting N data points in an I-dimensional space into several clusters. Training and test samples were then encoded using spectrally clustered gene embedding. The image on the left is an example of an embedding for stomach cancer sample. The image below is t-distributed stochastic neighbor embedding (t-SNE) visualization of gene embedding for 1,348 significantly mutated genes. Genes participating in the same cancer pathways have been placed closer to one another in their representation, e.g. KRAS and PTEN (colorectal cancer); TP53, APC and MSH6 (DNA mis-match repair) are closer together than other genes.

3.3 Transfer learning and fine-tuning — Fast.AI:
I used the pre-trained weights of ResNet34 model on ImageNet as an initialization for the target task of tumor classification using our tumor image embeddings. Images were re-scaled to 512x512 and normalized to match mean and standard deviations of ImageNet images, batch size was set to 32 to fit my GTX 1070 Ti GPU.

During the first stage of fine-tuning all but last custom fully connected layer of the ResNet34 were frozen. Learning rate was chosen to be 0.01 using learning rate finder, see Leslie Smith’s paper and its implementation in Fast.AI repo. The slanted triangular learning rates training schedule was used for 10 cycles. Accuracy achieved during the first stage was 73.2%.

In the second stage, discriminative fine-tuning was used with a sequence of 0.000001 to 0.001 learning rates, these were also determined using learning rate finder. Discriminative fine-tuning splits layers of the deep neural network into groups and applies a different learning rate for each group since different layers should be fine-tuned to different extents; the earliest residual blocks have the smallest, and the fully connected layer has the largest learning rate. In stage two of the training I used slanted triangular learning rates training schedule for 12 cycles. Accuracy achieved during the first stage was 78.3%.

Here is the confusion matrix for our classifier:

Chapter 4. A few observations

I observed that our mis-classifications are primarily within the same organ systems, e.g. ovarian serous cystadenocarcinoma (OV) and breast carcinoma (BRCA).

I also observed that ovarian serous cystadenocarcinoma (OV) was the class with the most errors. This is actually not surprising since only 6 genes were determined to be significantly mutated in this cohort compared to larger number of genes for other cohorts.

But my most important observation was that Fast.AI library allows state-of-the-art transfer learning and fine-tuning. Given the right representation of the data, it becomes very easy to build state-of-the-art classifiers: here I reduced the previous state-of-the-art error for this problem by more than 30% while at the same time discriminating over many more classes. Thanks Jeremy, Rachel and Fast.AI!

I am really looking forward to performing more “transfer of knowledge” from Jeremy and Rachel to myself (:-), discriminatively fine-tuning what I learn to hack on other important and interesting problems!

If you have any questions about what is described above find me on twitter @alenushka

--

--