Application of Machine Learning Algorithms in Modeling the Role of the Microbiome in Colorectal Cancer Diagnosis and Therapy: Part 1

Introduction: bioinformatics framework design and methodology overview

Published in

Towards Data Science

13 min readDec 25, 2022

After seven years of intensive and dedicated research work, I am wrapping up this year with a technical contribution to the fields of Applied Biosciences and Bioengineering. Working with an extraordinary group of microbiologists, biologists and bioinformatics scientists, my engineering contribution was designing and implementing high-performance machine learning classification algorithms for understanding the colorectal cancer drug resistance mechanism and carcinogenesis. Looking back, I welcomed the challenge of studying the fields of Machine Learning (ML) and Artificial Intelligence (AI) which were only buzzwords to me then. However, besides getting familiar with ML and AI, the epilogue of this story was the successful accomplishment of my PhD studies in Computer Science and Engineering in the field of Bioinformatics.

Considering the relatively long research period during the studies, I decided to summarize the main action points and other practical and scientific perspectives in a series of consecutive articles based on my dissertation titled “Application of Machine Learning algorithms in modelling and understanding the role of the Microbiome in the Colorectal Cancer diagnosis and therapy”. The aim is to briefly explain my contribution to predicting modelling in healthcare by designing and developing a comprehensive bioinformatics framework and machine learning pipeline for deep microbiome data analysis and interpretation. Targeting the gut microbiota, I aimed to provide highly performant machine learning models and methodology to assist clinicians in efficiently analyzing resistant patients’ microbiome diversity to address and threaten tumor proliferation, newly developed adenoma, inflammation promotion, and potential DNA damage.

This introductory article will cover the scientific background and present the methodology design for observing two different case studies, the colorectal cancer drug-resistance mechanism and carcinogenesis. The following articles will then apply these separate cases and be elaborated on practically.

* Note: It is worth mentioning here that this series of articles is based on the study titled “Understanding the Role of the Microbiome in Cancer Diagnostics and Therapeutics by Creating and Utilizing ML Models”, previously published in the MDPI Applied Sciences journal. Therefore, all references and cross-references related to the scientific background and the relevant literature used for the biological interpretation of the results can be explicitly found there.

Introduction

The microbiome is often called and identified as the second human genome since its genes and genetic potential are approximately 200 times the number of genes as a building block of the human genome. Moreover, there are ten times more microbial cells in the human gut microbiota than in the whole human body. These 100 trillion microbes represent as many as 7,000 different species and weigh approximately 2 kilograms, making them a good base for scientific research and investigation.

Conversely, colorectal cancer (CRC) is one of the most common malignant tumors, ranking in the top three causes of cancer-related death worldwide. The frequency of new cancer cases is estimated at 19.3 million new cancer cases, of which 10.0% are colorectal cancer (according to the official statistics from 2020/2021). Thus, out of 10 million cancer deaths, approximately 9.4% are due to CRC. The high mortality rate of CRC patients may be due to many genetic and environmental factors. One of the causes of the high mortality rate is the unreliable treatment of patients with colorectal cancer due to the gut microbiota.

The bacteria have well-known functions in the human organism and tend to live in symbiosis by production and fermentation of metabolites. Moreover, these bacteria actively participate in the immune system response. Disruption in the microbiome in the colon may cause inflammation and likewise promote the development of colorectal cancer. Numerous scientific studies have verified that gut microbiota can alter CRC susceptibility and progression since gut microbiota can impact colorectal carcinogenesis. Additionally, it is familiar that the microbiome can influence metabolic pathways, modulate anticancer drug efficacy, and cause drug resistance.

Recent scientific work has highlighted the potential of applying machine learning (ML) algorithms in creating data-driven frameworks and experimental setups over the traditional biostatistical methods for targeting the microbiota with diverse strategies, providing new opportunities involving tailored therapies for individual patients. In terms of this, supervised and unsupervised learning, multi-layer artificial neural networks or deep learning (DL) - both under the umbrella of artificial intelligence (AI) - are considered as two different subfields for analyzing gut microbiota insights regarding cancer development and potential therapeutic effects.

The aim was to design and develop a comprehensive bioinformatics framework and ML pipelines of a two-phase methodology for modelling and interpreting the key biomarkers that can play a significant role in understanding the drug-resistant mechanism and carcinogenesis for patients diagnosed with colorectal cancer. This framework would also identify important aggregated bacterial biomarkers that jointly contribute to the predictive character of the machine learning models, followed by interpretation and extraction of data knowledge and semantics about the biological role, activity and properties of the most significant features (in this case, bacteria).

Data

The intention was to re-analyze the publicly available microbiome dataset to assess the critical influence on bacterial species present in the human gut that can cause chemotherapy resistance or influence CRC carcinogenesis.

Dataset Demographics

The raw dataset and clinical metadata information are part of the “Gut microbiota in patients after surgical treatment” article previously published in the Environmental Microbiology journal. It was extracted after sequencing the V3-V4 region of the 16S ribosomal RNA gene amplified from the individuals’ fecal samples and mentioned that study is cross-sectional, meaning that the pre-operative and post-operative fecal samples were not collected from the same CRC patients. Generally, the data analysis consists of a total number of 116 individual microbiome samples, from which 23 microbiome samples were from patients diagnosed with tubular adenoma (19.8%), 15 microbiome samples were from CRC patients before the operation (12.9%), 47 were CRC post-operative microbiome samples (40.5%), and 31 were healthy control microbiome samples (26.7%). Therefore, the general data overview is presented in the image below.

Considering the clinical metadata about the follow-up surgical resection in the interval from 6 to 36 months, I divided the CRC post-operative samples into two distinct case studies visually described in the following image.

Image by Author - CRC drug-resistance and carcinogenesis case studies data

The first case study covered the group consisting of 21 samples from patients with newly developed adenoma, associated as resistant, and the group included the rest of the 26 samples from patients with a clean intestine, associated as not resistant. Conversely, the second case study covered the group consisting of 23 representatives from patients with pre-operative tubular adenoma and the group of 21 samples diagnosed with a post-operative newly developed adenoma.

Taxonomic Analysis

The initial raw study data was released in December 2018. Assuming the bacterial references and even the taxonomies are constantly changing, we tried regenerating the operational taxonomic units (OTU) and improving the taxonomical precision by removing the adapter and barcode sequences and the amplicon sequence primer sets for the V3-V4 regions. We have done this taxonomic assignment using the BBMap (v.38.90) tool. Going further, we reannotate the raw reads against the updated bacterial references to avoid the data’s taxonomical bias. Therefore, we generated the OTUs using the DADA2 and Phyllodes packages implemented in the R 4.0 analytical platform with SILVA 138.1–16 s reference database (latest reference database update on 27 August 2020).

Data Preprocessing and Transformation

As a result, we have identified a total number of 3603 ASV units phylogenetically defined in several levels (Kingdom, Phylum, Class, Order, Family, Genus, and Species). We performed a simple inner join technique based on the ASV identifier for generating the reference dataset. By applying the method of table pivoting, we structured the ASVs units structured on their count values distributed across the different samples.

The data was organized according the following ER diagram:

Image by Author - Data structure general overview

Without enough species-level information, we further analyzed and processed the microbial composition classified and specified at the genus level. Moreover, after handling filtering and missing information (N/A values) and applying the data aggregation/merging technique based on the ASVs naming and abundance, we finally reduced the working dataset to 259 unique bacteria at the genus level distributed across 116 microbiome samples, including the clinical metadata.

This process is visually presented in the diagram flow image below.

Image by Author - Data preprocessing and transformation flow

As described before, the data for additionally divided into two separate case studies related to the CRC immunotherapy effect and histology-based carcinogenesis, visually presented in the following diagrams:

Image bu Author - CRC therapy-resistance and carcinogenesis case studies data

Data Normalization and Scaling

Machine learning algorithms can be potentially biased when applied to data that is not scaled or normalized. In this case, data represents the relative bacterial presence across the samples — values that vary significantly from one bacterium to another. If these techniques are not applied, there is a risk of bias in the context of processing values that differ significantly.

For this purpose, I initially tried the Scikit-learn Standard Scaler (for removing the mean and scaling to unit variance) and MinMax Scaler (for transforming by scaling to a given range from 0.0 to 1.0), and KNIME Z-Score Linear Normalization (Gaussian) distribution. Therefore, I independently applied the centering and scaling methods for the training and test datasets on each feature by computing the samples’ relevant statistics in the given dataset, using the mean and standard deviation values for the transform functionality. Additionally, I calculated two data consistency coefficients and applied different data scaling/normalization methods. This, I calculated Cronbach’s alpha and Cohen’s kappa coefficients for reliability measure of internal consistency/featured correlation and inter-rater data reliability, accordingly.

Methodology — Bioinformatics Framework Design

After data preprocessing and transformation, I proceed with the design and implementation of the methodology workflow generalized in the diagram below.

Image by Author - General overview of the methodology design

Considering the previously explained dataset, I applied machine learning and statistics as a supervised learning approach to examine the biological features and model the drug-resistance mechanism. In general, classification ML algorithms and statistics are supervised learning approaches. In supervised learning approaches, the computer program can ‘learn’ from the reference data and make new observations or predictions (binary or multi-class) based on previously not seen structured or unstructured data. The data consists of 259 unique bacteria at the genus level distributed across 116 microbiome samples where the bacteria values were described according to their count values, respectively. An additional target categorical column was introduced, which provides the pre-operative and post-operative medical assessment information considering the metadata (including the record for the samples’ histology and treatment).

The main methodology and framework diagram flow is visually presented within the following image:

Image by Author - Study methodology diagram for modeling and interpreting the key biomarkers that play a significant role in understanding the drug-resistant mechanism and carcinogenesis for CRC patients

ML Modelling Screening Phase

In the ML screening phase labelled as ‘algorithm benchmark analysis’, I tried and performed a set of multiple different ML supervised algorithms to explore and provision the most promising approach determined by the maximized accuracy metric. Recognizing the most trustworthy algorithm base uncovered the potential of utilizing more advanced associate supervised algorithms to enhance accuracy and establish an understandable way for interpreting the contributions to the model predictiveness. Thus, I tried different well-known algorithms and industry standards addressing the data set, considering the binary classification study design. According to this, I applied the Naïve Bayes, Logistic Regression, K-Nearest Neighbor, Support Vector Machine with Principal Component Analysis (PCA), and Decision Tree algorithms. As a fundamental reference point, I assumed that all features could be potentially important and play a significant role in understanding the drug resistance mechanism. However, since feature dimensionality is most frequently directly correlated with the applied ML algorithms’ performance metrics, I decided to reduce and semantically interpret the input set by designing the modeling process into two subsequent stages.

ML Modelling Main Phase

Referring to the decision tree approach’s performance metrics, I explored the ensemble-based algorithms ( Scikit-learn Random Forest Classifier in Python and Tree Ensemble learner in KNIME), building multiple decision trees and taking advantage of the tree-related majority voting. In terms of undertaking the machine learning algorithms selection, I focused on emphasizing the accuracy maximization and overall sensitivity and specificity metrics. Thus, I simulated and optimized different ML models in both development environments, applying different dataset-splitting strategies and scaling and normalization techniques.

I have also performed an algorithm hyperparameter tuning for n_estimators, max_depth, and max_features parameters using the RandomizedSearchCV and GridSearchCV functionalities in Python Scikit-learn.

In terms of this, I assumed the narrowed first stage’s output as possible input for the second modeling iteration by considering the significance and potential relevance of the specific bacterial abundance. The approach aimed to establish more in-depth analysis and look for deep data insights, models’ behaviors, and performance metric improvements due to the attempt to recognize and confirm the biomarker potential of a particular bacterial, or group of bacterial genus types.

The first phase was additionally followed by statistical and non-parametric data testing and analysis to examine the abundance within the different classes and find more data insight for further biological evaluations and findings. On the other hand, the second stage was designed following the same modeling approach as the first one, considering that the input features scope consists only of the most significant features determined in the previous step.

Extracting the highly contributing features

After completing the main modelling phase, I compared both case models to analyze the input data relevance and identify the features with the most predictive power to the model. In the context of microbiome analysis, I denoted that the crucial features are the most informative ones defining the potential of significant bacteria for describing and understanding the CRC carcinogenesis and drug resistance mechanism. In this study, I used the importance of the random forest algorithm’s built-in features, the permutation method, and the technique of feature importance computed with SHAP values. The tree ensemble learner in KNIME provides a statistics table on the different decision trees’ attributes (output ports). Thus, using statistics nodes, I have developed an algorithm component for calculating the attribute importance regarding splitting value on the root, first, and second subsequent levels.

I compared the most relevant variables defined and extracted from both environments to provide narrowed feature sets. Therefore, I have additionally analyzed this set of features and referenced it as a set of crucial features that play an important role in understanding the tumor proliferation mechanism impact on the reference gut microbiome dataset. This machine learning analysis assumed that high model accuracy directly influences the trustworthiness of the computed variable importance.

Statistical Analysis

As mentioned above, I have done statistical and non-parametric data testing and analysis to examine the abundance within the different classes and find more data insight for further biological evaluations and findings. Therefore, I initially used the Mann-Whitney Wilcoxon rank-sum test for calculating the U value/p-value along with mean and median ranks between assigned classes in the microbial population of the dataset. Also, using R and KNIME statistics nodes, I figured the correspondent p-value probabilities for detecting the features with significantly different abundance levels between defined groups. Thus, I have additionally applied the Bonferroni and Benjamini-Hochberg p-value adjustments. Considering that the Bonferroni method for controlling the false positive rate (significance cut-off at α/n, where α = 0.05) was statistically strict, I continued the analysis using Benjamini-Hochberg’s p-adjustment with a false discovery rate threshold of 0.15. I ranked the feature’s importance after calculating the p-values, followed by sorting according to the threshold of p-values < 0.05.

Aggregated/Joint Features contribution analysis

Additionally, I went further and established a more operational way of defining predictability through the sequence of regions corresponding to each decision tree model. Assuming the random forest classifier’s randomized object state and stochastic algorithm’s nature, I developed a custom component for building and evaluating 2500 classifiers with different random state initializations.

I wrapped up this process by incorporating joint features contribution analysis to provide a more profound symbiotic bacteria analysis for feature correlation and interaction in the final model predictions — using the tree interpreter library (v.0.2.3) and applying the aggregated contributions convenience method on the most performant second-phase predictive model. To interpret the constitution of the entire trajectory of contributions, I have extracted a specific combination of features that make significant individual and joint prediction contributions in correspondence to the resistance class. Decomposing the features’ contributions along the prediction path of the algorithm resulted in aggregated contributions which can better explain the impact of a set of correlated bacteria on the drug-resistance mechanism and carcinogenesis.

Bacterial Abundance and Cell cycle integration analysis

The bacteria have well-known functions in the human organism and tend to live in symbiosis by production and fermentation of metabolites — actively participating in the immune system response. Thus, due to their enzymatic effects, every bacterium influences different biological pathways and drug metabolism. I extracted the most informative features as input into pathway analysis for a profound understanding of their biological role and activity. I have done this using the OTUs to create potential metabolite profiling by applying the workflow inherent to the iVikodak tool, a meaningful bioinformatics tool and framework for profiling, analyzing, comparing and visualizing the 16S-based functional potential of microbial communities.

IDEs and Tools

The general overview of the list of IDEs and tools I used for the analysis and ML modeling is the following:

- KNIME Analytics Platform, version 4.3.1

- KNIME Database, version 4.3.1

- Python, version 3.9.0

- Jupiter Notebook, version 6.0.3 ( Anaconda, version 4.9.2)

- R, version 4.0.4

- Scikit-learn, version 0.23.1

- Pandas (v1.0.5), Numpy (v1.18.5), Matplotlib (v3.2.2), Seaborn (v0.10.1), Pingouin (v0.3.9)

KNIME Ensemble Learning Wrappers v4.3.0, KNIME Excel Support v 4.3.1, KNIME Extension for Chromium Browser v 83.0.4103, KNIME R Statistics Integration v 3.4.2, KNIME JavaScript Views v 4.3.0, KNIME Modular PMML Models v 4.3.0, KNIME Optimization extension v 4.3.0, KNIME Statistics Nodes v 4.3.0, KNIME XGBoost Integration v 4.3.1

Thank you for reading this introductory content, which I strongly believe is clear and generalized in a way of understanding the core concepts related to the bioinformatics field for microbiome analysis. As mentioned in the beginning, I will continue elaborating the results and biological interpretation in the following articles. Stay tuned.

Part 2 - Bioinformatics Framework design and Methodology - Machine Learning Modelling Results for the colorectal cancer drug-resistance mechanism

Part 3 - Bioinformatics Framework design and Methodology - Machine Learning Modelling Results for understanding the colorectal cancer carcinogenesis

Originally published at https://www.linkedin.com.