Distributed Biomedical Text Mining using PySpark for Classification of Cancer Gene Mutations At-scale — Part II: Multinomial Logistic Regression

Distributed machine learning using Apache Spark for large-scale classification of cancer tumor gene mutations

Published in

Towards Data Science

5 min readMay 8, 2021

In Part I, I discussed Exploratory Data Analysis and applying Pointwise Mutual Information to mutation pairs to find out whether there was any correlation between PMI scores and mutation class similarities. In this part, I will discuss training a distributed multinomial logistic regression (MLR) model and applying it to the test dataset to determine the classes of mutations.

Again, I am sharing only certain parts of the code in this article for the sake of brevity. Please check out the GitHub link here for the full code. I have also prepared a 5-minute video that you can find here.

Before moving on to training the model in MLlib, there are a few pre-processing steps that are required:

Data Preprocessing

Dealing with Missing Values: As we saw in Part I, there are 5 entries in the training dataset that do not have an associated research paper text. These have NaN values in the Text column. I replaced these NaN values with a combination of Gene + Mutation values.
Performing Basic NLP Operations: I performed a few basic NLP operations on the research paper text (Text column of each row) to standardize the values across rows and to reduce the size of the dataset without losing valuable information. These operations were:

A. Replacing special characters like *,@, etc in the text with a single space

B. Replacing multiple spaces with a single space

C. Converting all characters to lower case

D. Storing only those words from the text that are non-stopwords. Words such as and, the, such, etc are known as stopwords. They are essential for making a sentence grammatically correct but do not contain useful information in most cases. For example, consider the sentence “The drug Aspirin has been used as a painkiller for a long time”. Which amongst these two words is a keyword: ‘Aspirin’ or ‘as’? Stopwords are numerous and while removing them can contribute to some loss of meaning, that loss is generally minimal and the data compression obtained is enormous as these are by far the most frequent words in any document.

The code snippet below performs these four operations:

Performing basic NLP Pre-Processing operations on the dataset(Image by Author)

The distributed implementation of these pre-processing steps as shown below took 8 seconds as compared to 30 seconds when I did this on a single machine.

3. Vectorizing Gene, Mutation (Variation) and Text Features: I created one hot encodings for genes, mutations and all the text in the training dataset. There were 65946 unique words across all the articles in the training dataset.

Vectorizing Genes, Mutations and Words in Training Dataset Papers (Image by Author)

4. Combining All Encodings to Create a Consolidated Training Dataset:

For each row, I combined the gene, mutation, and text one-hot-encodings to create a consolidated vector representing that information.

Creating a consolidated encoded training dataset (Image by Author)

Each row had 69229 binary entries as shown in the figure below:

Vectorized Representation of the Training Dataset (Image by Author)

5. Converting the Dataframe to Spark MLlib-friendly Format:

For training a distributed ML model, the training dataset panda dataframe that I showed in the figure above, needs to be converted to a specific format and this is accomplished by the csvlibsvm1.py program file that I have shared in the GitHub folder.

Training the MLR model

Here’s the code snippet that I used to train the MLR model. I set the hyperparameter values (regularization constant=0.01 and elasticNetParam=1 ) that yielded good results after manually tweaking these values and observing the results. The more systematic way would be GridsearchCV and I will update this post once I do it and get better results.

Distributed Training of a Multinomial Logistic Regression Model in Spark MLLib (Image by Author)

Results

I got an overall classification accuracy of 64% and a Multinomial Log Loss of 1.12 which are comparable to the values reported by best performers of the Kaggle Competition. They got a Multinomial Log Loss of 2.03 even though
their test dataset was different than mine as I prepared my test dataset after splitting the training dataset. Nevertheless, MLR seems to work well
for this dataset overall.

Here’s the Recall Matrix:

It is encouraging to see that the model can predict the actual classes well. The 2–7 and 3–7 and 8–7 mislabeling is significant and needs to be further investigated. This could be due to a high bias that leads the model to classify minority classes(2,3,8) as the majority class (7).

The Precision Matrix is shown below:

The model has a good precision overall as it can be been from the quite high values present on the diagonal. Precision for classes 3 and 5 is below par which points to the direction of further work.

The Confusion Matrix is shown below:

Once again, the model has good accuracy in predicting mutation classes. There is significant 2–7 and 1–4 mismatch that needs to be investigated further.

Conclusion

I found that distributed computing can significantly accelerate biomedical text mining and this is especially relevant for Big Data. There are several areas of improvement that I can think of including improving the precision and recall further via hyperparameter tuning, using Deep Learning (especially BERT based models), understanding what the model learns by examining the learned coefficients, coming up with better ways of representing the data and applying this method to other areas of biomedical text mining to examine its performance. I will post articles on a few of these topics in the future as and when I get some interesting results to share.

Hope you found this two-part series useful. Feel free to reach out if you want to know more about any part of this study or want to collaborate with me.

(Again, I am thankful to Namratesh for his post on this Dataset that helped guide the initial parts of my Data Preprocessing.)

Distributed Biomedical Text Mining using PySpark for Classification of Cancer Gene Mutations At-scale — Part II: Multinomial Logistic Regression

Distributed machine learning using Apache Spark for large-scale classification of cancer tumor gene mutations

Written by Bharat Sethuraman Sharman