Machine Learning in Breast Cancer Research

How explainable machine learning can be applied for pattern recognition in breast cancer research

Jarrett Evans
Towards Data Science

--

Photo by Tara Winstead from Pexels

Over the past few years, there has been a lot of hype around how machine learning will transform many industries. One industry mentioned frequently is health care. In the following blog post, we will explore a type of impact that can be made. We will do this by examining an article that was published in the journal Nature in 2021. The article's name is Morphological and molecular breast cancer profiling through explainable machine learning.

This post will have three sections: Purpose of the study, Methods / Results of the study, and Why this matters for Oncology and health care.

Purpose

Historically, there has been a lack of integration between various profiling techniques within oncology research. That leaves some disconnect regarding molecular data and morphological data for cancer properties. The researchers in this study set out to create a link through machine learning that bridges the gap between these two types of profiling techniques. If successful, this approach can help with hypothesis generation for connections between cell types and molecular properties.

Methods / Results

Machine Learning Algorithm

The machine learning method that the study used was based on a technique called layer-wise relevance propagation (LRP). Layer-wise relevance propagation is a type of explainable machine learning algorithm. Instead of only giving an output for a particular input, the approach can highlight the most important input features used for the output given. For situations using images, the features that are highlighted are pixels. The result is a heatmap over the input image showing which pixels had the highest impact on swaying the output. Pixels with high color intensity correlate to high relevance scores for pixels given the classification. The basic equation for this process can be denoted as:

j and k represent neurons at two consecutive layers in the neural network. Zjk represents how much neuron j contributed to making neuron k relevant. The R is the relevance scores of j and k.

The LRP computes attributes that explain the total contribution of an input feature rather than the sensitivity to an input variation that you would get in attention-heatmaps. For a more technical explanation of how layer-wise relevance propagation works, please refer to the following paper: Layer-Wise Relevance Propagation An Overview.

Application of the Algorithm

Now that we discussed the overview of how layer-wise relevance propagation works, we will explore how the research team was able to use this with cancer morphological and molecular breast cancer data.

The team first created an image database (Berlin Cancer Image Base, B-CIB) with annotated patches of microscopy image data. Using LRP, the team was able to distinguish cancer cells from tumor-infiltrating lymphocytes (TiLs). TiLs are lymphocytic cells that can infiltrate tumor tissue and recognize and kill cancer cells. The density of their presence can be used as a feature for the prediction of patient survival. LRP allows for a visual representation of their density.

The researchers then predicted molecular features using morphological image data as input for the algorithm. Essentially, trying to derive insights into the molecular properties of a patient by scanning an image. The data used for training consisted of combining image and molecular profiling data. Given the intent of this part of the study and the information available through the dataset for training, manual spatial annotation was not required. They could have the algorithm identify patterns within the images by feeding it the molecular data and morphological data during training. To reduce the dimensionality of the classification task, they used a high vs. low classification approach for the different molecular features. The algorithm gave a classification based on a prediction of whether a molecular feature's expression was high or low, given a morphological image as input. Two genes that scored high for levels of expression were CDH1 and TP53.

Importance of CDH1 AND TP53 genes

The algorithm producing a high expression score for these molecular features makes sense based on a priori knowledge of the impact they have involving breast cancer.

The TP53 gene is known as a tumor suppressor gene. It regulates the level of mitosis in a cell. When this gene is mutated excessive cell division occurs and tumors are formed.

Mutations of CDH1 have been linked to cancer progression by increasing proliferation, which is an increase in cell numbers, and metastasis where cancer growths are developed away from the place of origin. The protein produced by CDH1 is E-cadherin which is critical for cell adhesion, which keeps cells together, and it is thought that a genetic change in this gene can lead to cancer cells being able to detach from the primary site of the tumor more easily, leading to metastasis.

The team also experimented with predicting the spatial localization of molecular features. This prediction shows spatial regions statistically associated with certain molecular feature expressions. This information can be used to create hypotheses about how relevant specific components of the tumor microenvironment are for the presence of molecular tumor profile features—driving the discovery of potential links that may exist between the histological features of the tumor shown in the image and the molecular features that may be expressed. These links can lead to candidate lists being made available for molecular features. These lists would consist of molecular features related to breast cancer.

Ultimately, the machine learning approach revealed spatial and morphological features that are statistically associated with expressions of various molecular features. This information was displayed through a heatmap that visualizes how in most cases, molecular features have specific associations with the morphological groups tested for (cancer cells, TiLs, and stroma (support cells)).

To test the validity of the results, the team used a technique called immunohistochemical (IHC) staining to compare the results with the heatmaps generated from the machine learning algorithm. They utilized a quadrat test to show the associations between the two. A quadrat test can be used to measure spatial randomness for a point pattern. The quadrant test measures spatial randomness by using a chi-squared test. Overall, the IHC patterns reflected the patterns predicted from the machine learning algorithm validating the machine learning approach.

Why

The study we looked at today is an excellent example of how machine learning can be used for cancer research purposes. It will not necessarily replace tools already in place, but it can act complementary to them. Some use cases for oncology research relate to hypothesis generation through new patterns that may emerge from a machine learning-based approach. Examples of the hypotheses that could be generated from the study explored in this article would mostly be related to new relationships being discovered between non-spatial molecular features and spatial information. These relationships can help for more refined tumor grading and new candidate lists for potential targeted therapies.

In an industry where pattern recognition is crucial, machine learning can act as a guiding light for pushing discovery further.

Layer-Wise Relevance Propagation: An Overview — Scientific Figure on ResearchGate. Available from: https://www.researchgate.net/figure/Illustration-of-the-LRP-procedure-Each-neuron-redistributes-to-the-lower-layer-as-much_fig2_335708351 [accessed 17 Aug, 2022]

Binder, A., Bockmayr, M., Hägele, M. et al. Morphological and molecular breast cancer profiling through explainable machine learning. Nat Mach Intell 3, 355–366 (2021). https://doi.org/10.1038/s42256-021-00303-4

MedlinePlus [Internet]. Bethesda (MD): National Library of Medicine (US); [updated 2020 Jun 24]. CDH1 gene; [updated 2017 Aug 1; reviewed 2018 Jul 1; cited 2022 Aug 17]; Available from: https://medlineplus.gov/genetics/gene/cdh1/#conditions

--

--