The world’s leading publication for data science, AI, and ML professionals.

Distributed Biomedical Text Mining using PySpark for Classification of Cancer Gene Mutations…

Distributed Biomedical Text Mining using PySpark for Classification of Cancer Gene Mutations At-scale – Part I: EDA & Pointwise Mutual Information Analysis

Distributed machine learning using Apache Spark for large-scale classification of cancer tumor gene mutations

In this two-part post, I will share my learning from a term research project I worked on in a graduate course on Distributed Computing. I implemented a machine learning algorithm for classifying cancer tumor gene mutations using PySpark, the Python implementation of Apache Spark. In this part, I will describe Exploratory Data Analysis (EDA)along with the distributed implementation of the Natural Language Processing (NLP)concept of Pointwise Mutual Information (PMI).

I am sharing only certain parts of the code in this article as including the whole code would make the article too lengthy. Please check out the GitHub link [here](https://www.youtube.com/watch?v=LRp75Ku8Hc4) for the full code. I have also prepared a 5-minute video that you can find here. Here’s the link for Part II if you want to skip this and move straight to the distributed ML section.

Introduction

One of the main reasons why Cancer is so hard to understand and therefore treat is the sheer vastness and complexity of tumor gene mutations. There can be several thousand genes in tumor cells and each of these genes can have several thousand known mutations along with a huge number of yet undiscovered ones. Not all mutations are cancerous and several of them can be benign. After all, genes keep mutating as a part of natural evolution, and not each mutation is deleterious to the organism. In order to develop effective cancer drugs, a drug researcher needs to know the nature of the majority of mutations in a tumor.

However, the process of classifying a tumor gene mutation is manual, painstaking, and time-consuming. For each mutation that is discovered in a tumor specimen in a lab, a researcher has to go through several pages of medical literature to categorize the discovered mutation into the appropriate class. This process of human-intermediated information retrieval from medical literature significantly slows down the pace of knowledge growth. If a researcher spends say 1 hour to sequence a mutation and 5 hours to classify it, the research community incurs a significant opportunity cost as the time spent in classifying could have been better spent finding new mutations or researching cures for malignant mutations.

Biomedical Text Mining is a suite of computational techniques that have been developed for extracting insights and actionable information from large biomedical research literature datasets.

Biomedical Literature has witnessed rapid growth ever since the advent of the internet and especially after new technologies such as Next Generation Sequencing (NGS) gained widespread adoption in labs. The figure below shows the annual growth in the number of publications in cancer research in a select group of countries.

Country-wise Growth in the Number of Publications in Cancer Research (Source: Elsevier (Page 11))
Country-wise Growth in the Number of Publications in Cancer Research (Source: Elsevier (Page 11))

As a result, there is a growing need for computing techniques to process this data and derive insights from it to help medical researchers accelerate their research.

Now I will describe the dataset that I worked on in this study.

Dataset

I used the Personalized Medicine Dataset that was compiled and released by the Memorial Sloan Kettering Cancer Center (MSKCC) for a Kaggle Competition that they hosted in 2017. This dataset contains research papers from various medical journals about studies on tumor gene mutation classification. The training dataset has 3321 such papers. Each paper is about a particular gene and a mutation associated with it. Expert cancer researchers have annotated these mutations in one of the nine classes (1–9). As I did not have the labels of the test data set, I split the initial training dataset into a 80:20 training and test split and used the new dataset that is 80% of the initial training dataset size for training.

While it is not stated in the Kaggle Competition what these classes of mutations represent, I found from an article in an edited compilation¹ that they could very likely mean the following:

Meaning of Mutation Classes (Image by Author)
Meaning of Mutation Classes (Image by Author)

The figure below shows the structure of the dataset:

Format of the Training Dataset (Image by Author)
Format of the Training Dataset (Image by Author)

For example, the paper with ID 1 is about the CBL gene and its variation(mutation) W802*. This mutation belongs to class 2. The starting text of the paper is displayed in the column with the heading "Text". The initial few lines of this paper are as shown in the figure below:

Initial Sentences of a Research Paper (Image by Author)
Initial Sentences of a Research Paper (Image by Author)

As we can observe, it is a dense scientific paper that contains several biological terms and overall, it contains 5757 words. The test dataset has the same structure as the training dataset.

Challenge

Let’s assume that a researcher has a reading speed of 300 words a minute and that she must read 5 such papers before she can classify a mutation. This means that it would take 2 hours for her to do this task even without considering the fact that research papers cannot be read like novels. What if a tumor gene contains 10 different mutations which is a modest number for the number of mutations that can be present?

If we can come up with an algorithm that can "read" massive amounts of biomedical literature and learn the associations between keywords in the papers and the classification of the mutations that the papers describe, we can significantly reduce the time that researchers will spend manually annotating new mutations. For instance, if an algorithm can annotate a mutation in say 1 second (which is quite a long time can be reduced significantly provided we have enough computational resources) 80% of the time, such a method can be immensely useful for cancer researchers.

Idea

For this study, I worked on two ideas:

  1. As an exploratory step, I examined whether the frequent co-occurrence of two mutations of a gene in the literature indicates a higher likelihood that they belong to the same class and vice-versa, i.e. whether a less than average co-occurrence of two mutations indicates a higher likelihood that they belong to different classes. I used PMI, a concept well established in the area of NLP for this purpose.
  2. I trained a distributed multinomial logistic regression model using the Spark MLLib and used it to classify the mutations in the test dataset.

I will describe the first idea in this part. Before that, here’s the rationale behind using Distributed Computing:

Rationale behind using Distributed Computing

There are two classes of algorithmic challenges— those that can be split into smaller parts such that each of these parts can be dealt with separately and those that need to be dealt with as a whole and therefore, cannot be split into smaller parts. For instance, to add all the numbers in a large list, we can split the list into several sub-lists, add the numbers in these sub-lists separately and then add these sub-sums to get the final result. On the other hand, to compute the average of a list of numbers, we need to add all the numbers and divide the sum by the total count of numbers in the list. We cannot compute the correct average by dividing the list into sublists, computing sub-averages, and then computing an average of sub-averages.

For problems that can be split into parts, we can accelerate algorithm execution if we assign each part to a different machine and this is where distributed computing comes into the picture. If we have 10 machines, we can provide 1/10th of the larger problem to each machine and get the result in 1/10th of the time than processing the whole problem on a single machine which might not even be possible for many problems that involve large datasets.

For instance, computing the PMI score of a pair of genetic mutations requires counting the co-occurrence of two mutations in the same paper, a task that is well-suited for being split into parts and executed in multiple machines.

I will now discuss some EDA steps and their results to get a sense of the nature of the dataset:

Exploratory Data Analysis

  1. Finding entries that have no associated publication: After importing the libraries and creating a sparkcontext, I checked if there were missing values in the dataset. **** Out of the 3321 entries in the training dataset, 5 entries had no research paper associated with them as shown in the two figures below:
Column-wise check for missing values (Image by Author)
Column-wise check for missing values (Image by Author)
Training Data Rows having no Associated Research Paper (Image by Author)
Training Data Rows having no Associated Research Paper (Image by Author)

I filled these NaN values with the concatenated text containing the name of the gene and the associated mutation. For instance, for ID 1277, I replaced the NaN value in the TEXT column with ‘ARID5B Truncating Mutations’ as shown in the figure below:

Replacing NaN Value in the TEXT column with Gene-Mutation Concatenation (Image by Author)
Replacing NaN Value in the TEXT column with Gene-Mutation Concatenation (Image by Author)
  1. Determining the distribution of output classes in the training dataset: In order to check if there is any significant class imbalance, I plotted the number of instances belonging to each class in the training dataset and the result is shown in the plot below:
Distribution of mutation classes in the training and the test datasets (Image by Author)
Distribution of mutation classes in the training and the test datasets (Image by Author)

So classes 7 and 4 are disproportionately high in the training as well as test datasets.

3. Determining the Distribution of the Occurrence of Genes and their Mutations in Publications:

Next, I plotted the count of the number of publications related to a particular gene for all the 236 genes to check whether the distribution is relatively uniform or whether a few genes occur a large number of times. The plot of gene index vs the frequency of its occurrence in the research literature is shown below:

Histogram of Genes (Image by Author)
Histogram of Genes (Image by Author)

From the plot, it is clear that a small number of genes (~10 genes) are the topics of a large section of research literature and this is indirectly indicative of the fact that these are the genes that have the most mutations as each paper corresponds to one mutation of a gene.

In order to check if there were identical mutations that occur in more than one gene, I plotted a histogram of mutation index vs mutation count.

Histogram of Mutations(Image by Author)
Histogram of Mutations(Image by Author)

There are general mutations like deletions, fusions, truncations that are present across genes but their exact nature depends on a particular gene in question. As expected, we see from the plot on the right that a vast majority of mutations are unique (count of one), implying that they occur only in one gene.

I will now shortly discuss the concept of PMI before moving on to the code implementation.

Pointwise Mutual Information

Pointwise mutual information (PMI) is a measure of association used in information theory and statistics. The PMI of events x and y is given by:

PMI Formula (Image by Author)
PMI Formula (Image by Author)

where p(x) and p(y) are the marginal probabilities of events x and y respectively and p(x,y) is the joint probability of x and y. A positive PMI indicates that x and y are more likely to co-occur than they would be if they were independent. Similarly, a negative PMI indicates that x and y are less likely to co-occur.

Let’s consider an example. Let’s say that we have a document that has 10 papers and there are two words Alpha and Beta. Alpha occurs in 6 papers and Beta Occurs in 4 papers and they both occur in 3 papers. The PMI computation for this case is shown below:

PMI Computation Example (Image by Author)
PMI Computation Example (Image by Author)

If Alpha and Beta are two mutations of a gene, then if their PMI score is positive, it means that they co-occur frequently in the research literature as compared to when it is either zero or negative.

Distributed Computation of PMI of BRCA1 and CBL Gene Mutations

I considered two genes BRCA1 and CBL and computed the PMI scores of all the mutations of these two genes. There were in total 236 unique genes. The choice of CBL was random to avoid any selection bias whereas I selected the BRCA1 gene as it had the highest number of mutations.

Here is the sequence of steps to compute the PMI values of genetic mutations:

  1. I created a list containing all the mutations of BRCA1 gene and then removed general mutations like truncations, deletions, amplifications, etc. Then, I generated a list of mutation pairs for all the remaining unique mutations of this gene and labeled a pair as 1 if the constituent mutations were of the same class and as 0 if they were of different classes. I have embedded the code snippet below that also shows the output. Mutations R1753T and C44Y belong to the same class 4 and therefore I labeled this pair as 1. On the other hand, mutations R1753T and R1835P belong to classes 4 and 1 respectively and therefore I labeled this pair as 0.
  2. I combined all the documents associated with BRACA1 gene into one document and computed the probability of occurrence of each mutation in this combined document. Distributed computing comes into the picture from now onwards. Observe how PySpark can divide the task of counting mutations by splitting this large document into sets of papers, sending these papers to different machines, and then combining the result via a reduce operation to yield the probabilities of occurrence of these mutations in the combined text. The S1655F mutation occurs most frequently in the combined document and has a probability of occurrence of 0.78 (meaning, it occurs in 209 out of 264 papers associated with the BRCA1 gene).
  3. Next, I implemented the distributed version of the PMI formula discussed above to compute the PMI values of all the pairs of BRCA1 mutations. For instance, mutations G1743R and L1657P occur 85 times each individually and 62 times together and their PMI value is 0.358.

    Results

I will discuss the results of the BRCA1 gene in the interest of brevity but the trends that I observed for the CBL gene were identical. I plotted the PMI vs classification similarity for all pairs of the mutations and the plot is shown in the figure below:

PMI vs Mutation Similarity for BRCA1 gene mutations (Image by Author)
PMI vs Mutation Similarity for BRCA1 gene mutations (Image by Author)

There are two conclusions that we can draw from this plot:

  1. It is clear that contrary to what one might expect, there is no clear relationship between the value of PMI score and the similarity of mutations. There are mutation pairs with low PMI scores that belong to the same mutation class and there are mutations with high PMI scores that belong to different mutation classes. So it seems that a high PMI score between two mutations is not a good predictor of their class similarity.
  2. Interestingly, mutation pairs with the lowest and the highest PMI values belonged to the same mutation class. Why should mutations that hardly occur together (i.e. low PMI value mutation-pairs)belong to the same class is a question of research interest and further investigation can throw light on this unexpected finding.

Applications

We can use distributed PMI computations not just in detecting similarities in genetic mutations but in any area where we have a set of classes and multiple indicator words associated with the class. For instance, we can find whether a news article is relevant for us by computing the PMI of a set of word pairs that are most relevant for us and running several news articles through our algorithm to find a few of them that rank high in terms of the PMI scores of these word-pairs. We can also have more than two words and distributed computing can be readily extended to such cases as well.

That’s all for Part I! Hope you could learn something useful from this article. In Part II, I will discuss my work of training a distributed Multinomial Logistic Regression model in Spark MLlib to classify mutations in unseen research papers.

(I am thankful to Namratesh for his post on this Dataset that helped guide the initial parts of my EDA.)

References

  1. An Automatic Classification of Genetic Mutations by Exploring Different Classifiers, Badal Soni et al, Enabling AI Applications in Data Science, Studies in Computational Intelligence, Springer

Related Articles