The world’s leading publication for data science, AI, and ML professionals.

Machine Learning in Medicine – Journal Club II

A Critical Appraisal of the Use of Machine Learning Techniques in Clinical Literature

Beware of claims comparing apples to oranges, or airliners to automobiles.

Photo by Daniel Fazio on Unsplash
Photo by Daniel Fazio on Unsplash

Introduction

Welcome to another entry of the Machine Learning in Medicine Journal Club. Let’s remind everyone that the goal of this journal club is to help readers develop the knowledge and skills necessary to digest and critique Biomedical journal articles involving the use of machine learning techniques. Although some of its terminologies and concepts may sound strange, machine learning is not magic, and we should evaluate findings derived from the use of these techniques with the same level of skepticism as we do any other clinical research. For the purpose of this journal club, we will focus not on the content or clinical implications of the research, but the methods and technical details related to machine learning techniques. Along the way, we hope to highlight some of the pitfalls and common mistakes and to dispel some common misconceptions related to machine learning.

Article

Guo A, Mazumder NR, Ladner DP, Foraker RE. Predicting mortality among patients with liver cirrhosis in electronic health records with machine learning. PLoS One. 2021 Aug 31;16(8):e0256428. doi: 10.1371/journal.pone.0256428. PMID: 34464403; PMCID: PMC8407576.

To carry on this theme of cirrhosis from our last Journal Club entry, we chose this recently published paper from PLoS One because the authors of this article aimed to answer the same question posed by the last article using similar techniques, but their conclusions were quite the opposite. We should try to understand what it was these authors did differently that led to a different result.

PLoS One is a peer-reviewed, open-access online journal founded with the aim of making scientific publications freely accessible to all. The journal publishes researches from all areas across science, engineering, medicine, and social sciences. Although the journal does not publish its impact factor, it is regarded as a highly reputable and reasonably selective journal in the field of medicine.

Background

Patients with liver cirrhosis often have subclinical disease with very few symptoms when they were first diagnosed. The disease typically follows a slow and indolent course for many years, but in a small proportion of patients, the disease may progress and these patients can decompensate rapidly without warning. Being able to predict the mortality risk in an individual patient is clinically relevant in terms of preventing decompensation events and triaging inpatient and outpatient management.

The Model for End-Stage Liver Disease (MELD), originally developed to predict 3-month mortality in patients undergoing elective transjugular intrahepatic portosystemic shunts has been extensively studied and externally validated as a predictor of mortality in patients with cirrhosis.

MELD score. Image by Author.
MELD score. Image by Author.

Despite being the most widely used objective measure of liver disease severity, the MELD score is far from perfect. Its discriminatory power at low ranges (<15) is very poor, as demonstrated by the nearly flat mortality curve in Figure 5 of [this paper](https://pubmed.ncbi.nlm.nih.gov/31394020/). We also know that the discriminatory power of the MELD score is dependent on the underlying etiology of liver disease, as shown in Table 1 of this paper, as well as other clinical variables not captured by the MELD score. Because the original model was only validated to predict 90-day mortality, its performance in predicting long-term mortality is uncertain.

Hypothesis and Specific Aims

In this article, the authors hypothesize that Machine Learning models can outperform MELD score in predicting short-term and long-term mortality in patients with cirrhosis

The specific aim of the paper is to use machine learning to predict 90-, 180-, and 365-day mortality in a cirrhotic cohort from a large academic liver transplant center.

Methods

  • Study format – single retrospective observational cohort study.
  • Study population – subjects included in the hospital electronic medical record system.
  • Inclusion criteria – adult patient with an initial diagnosis code of liver cirrhosis added between 1/1/2021 and 12/31/2019.
  • Exclusion criteria – None.
  • Study period – 365 days from the first occurrence of any liver cirrhosis-related diagnosis code.
  • Outcome – all-cause mortality at 90 days, 180 days, and 365 days.
  • Features – demographics, laboratory data measured at or before the time of first diagnosis code, and information related to initial diagnosis encounter (e.g., encounter type, exact diagnosis, etc.)

Data Preprocessing

Unlike in the previous article, the authors of this article described the data preprocessing step in much greater detail. Features that had more than 10% missing values were not included in the models. Otherwise, multiple imputation (presumably the sklearn.impute.IterativeImputer) was used for numerical features, while mode imputation was used for categorical features. The authors have also built separate models using mean imputation, instead of multiple imputation, and provided the results in the supplementary content of the article.

The authors also acknowledged that the training dataset is highly imbalanced due to the fact that the overall mortality rate is low. In order to account for this class imbalance, the authors employed the Synthetic Minority Over-sampling Technique (SMOTE) to generate synthetic samples to be used during the training phase of the models.

Synthetic Minority Over-sampling Technique. Image by Author.
Synthetic Minority Over-sampling Technique. Image by Author.

Model Development

Three different machine learning techniques were included in this article, including logistic regression, random forest, and deep neural network

Logistic Regression (LR)

  • Presumably implemented using sklearn.linear_model.LogisticRegression with L2 regularization and an inverse regularization strength of 1.0.
  • Regularization optimization: L1 and L2 regularization and 10 different regularization strengths between 0 and 4.

Random Forest (RF)

  • Presumably implemented using sklearn.ensemble.RandomForestClassifier with 500 estimators and the number of features per estimator was set to the square root of the total number of features. All the hyperparameters are presumed to be set to the default values
  • Tree count optimization: 200, 500, and 700 estimators
  • Maximum feature count optimization: auto, square root, and log2

Deep Neural Network (DNN)

  • Presumably implemented using keras.models.Sequential and keras.layers.Dense with 4 hidden layer and 128 nodes in each layer. Sigmoid function was used in the output layer and ReLu function was used in the hidden layers. Binary cross-entropy loss was used as the loss function. Adam optimizer was used with a mini-batch size of 512.
  • Network depth optimization: 2 to 8 layers
  • Layer dimension optimization: 128 and 256 nodes
  • Batch size optimization: 64, 128, and 512.

Model performance was evaluated based on the area under the receiver operating characteristic (ROC) curve (AUC), overall accuracy, and F1 score using 5-fold cross-validation.

Results

The final analysis included 34,575 subjects, with a mean age of 60.5, evenly distributed in gender, and mostly (77.5%) white. The mean MELD-Na score on admission was 11.5. The 90-, 180-, and 365-day mortality rates were 5.2%, 6.4%, and 8%, respectively.

In the end, 41 features were included in the full models. The authors also created partial models using only sodium, creatinine, total bilirubin, and INR (the components of the MELD-Na score).

Image by Aixia Guo et. al. PLoS One. Reproduced under Creative Commons Attribution License.
Image by Aixia Guo et. al. PLoS One. Reproduced under Creative Commons Attribution License.

Figure 2 of the article shows the performance of the three full models and the three partial models in predicting 90-, 180-, and 365-days mortality rates. Any astute readers would immediately notice that Figure 2a, 2b, and 2c appear almost identical to one another. The AUC values for each model are also suspiciously close across all three outcomes.

The authors argued based on the F1 scores that the full RF model and the full DNN model perform better than the full LR model across all three outcomes.

F1 scores of the three models. Image by Author.
F1 scores of the three models. Image by Author.

Using the same performance metrics, the authors also argued that the three full models outperform the three partial models. The authors further investigated the contribution of each feature by computing feature importance from the three full models. The authors concluded that while the MELD-Na variables were quite important, other features and laboratory values also "play an important role in the predictions."

Article Critique

We will begin our critique by commending the authors on things that were done right. Unlike the previous articles, this article provided much more details in the method section to allow reviewers and readers to reproduce the models. The authors have even provided contact information for requesting a copy of the original dataset. The only thing that could have made it even better is to make the original code for generating the models available in the supplementary materials. Without the original code, we have to make certain assumptions, such as the default hyperparameter values, that were used to generate these models.

We should also commend the authors for describing the optimizations that were done to the models. It certainly shows that the authors have made a reasonable effort to maximize the performance of their models and the hyperparameters were chosen based on systematic analysis, instead of being plucked out of thin air. This is especially important for the DNN model because the performance of an artificial network can vary significantly depending on the hyperparameters being used. If we were to be nitpicky, we may ask for additional optimizations being done on the tree depth hyperparameter of the RF model, which can often yield additional performance improvement. We may also ask for additional optimizations being done on the Adam Optimizer settings, which can also affect the model’s performance in most cases. Ideally, the authors would include the results of these optimizations in the supplementary materials of the article to show that the final hyperparameter values were indeed associated with the peak performance of each model.

Finally, we should commend the authors for making an effort to account for class imbalance by employing the SMOTE technique during the training phase. SMOTE and the other over-sampling techniques, such as AdaSyn, Borderline SMOTE, and SVM-SMOTE, are widely accepted as the state-of-the-art technique for handling class imbalance and have been shown to improve learning.

Model Comparisons

Our primary criticism of this article stems from this one sentence in the conclusion section of the paper that reads "machine learning and deep learning models outperformed the current standard of risk prediction among patients with cirrhosis". The authors claimed that their machine learning models outperform the MELD-Na score, which is the current standard of risk prediction in cirrhotic patients, in predicting short-term and long-term mortality. The introduction section of the paper also made several references to the limitations of the MELD-Na score. If we take a moment to consider the method section and the result section carefully, however, we realize that the study was never designed to directly compare the performance of the machine learning models to that of the MELD-Na score. What the authors did was to compare the full models, containing 41 features, against the partial models, containing only 4 features. Unless the additional 37 features were all made up of randomly generated values, it is no surprise that the full models would perform better than the partial models. The only way to support the authors’ claim that machine learning models outperform the MELD-Na score was to apply the models on one of the original datasets that validated the MELD score as a mortality predictor in cirrhotic patients. Unfortunately, this was not done in this paper. The authors may claim that their full models perform reasonably well based on the F1-score, but they cannot claim that they outperform the MELD-Na score, unless they were able to prove that their partial models were equivalent to the MELD-Na score.

We also take exception to the authors’ claim that the RF model and the DNN model outperform the LR model based on the F1 scores. We will discuss the issue of performance metrics later, but we must first highlight the futility of directly comparing the performance of different modeling techniques, which has unfortunately become quite a common occurrence in the biomedical literature these days, such as [[this](https://pubmed.ncbi.nlm.nih.gov/30049182/)](https://pubmed.ncbi.nlm.nih.gov/32744882/), this, and this. Decision tree models, such as random forest and gradient boosting machines, are limited by the fact that each decision point can only consider one feature at a time. Logistic regression models are limited by the linear relationship between the feature variable and the log-odds of the outcome variable. Neural networks are not restricted by these limitations. Given a sufficient number of neurons and layers, an artificial neural network can, in theory, approximate any continuous function. It is, therefore, not at all surprising that a properly tuned neural network can outperform models developed by other machine learning techniques. There are, however, many practical considerations, other than raw accuracy, that we must take into account when choosing a machine learning technique. For instance, a logistic regression model can be concisely captured by a mathematical formula and implemented easily in any production environment, whereas a random forest model or neural network must be implemented in a programmable environment. Training a neural network requires significantly more computational power and, in general, a large training dataset compared to decision tree models. Publishing a paper stating that a neural network performs better than a logistic regression model in predicting a certain clinical outcome after being trained on the same dataset is no more meaningful than stating that an airliner can travel faster between two places on earth than an automobile. We, as a community, need to stop making these pointless comparisons.

Performance Metrics

Let’s move on to the use of performance metrics in this paper. As we discussed during our previous journal club entry, accuracy and F1 score (also sensitivity/recall/true positive rate, specificity/selectivity/true negative rate, precision/positive predictive value, negative predictive value) are point estimates of a classification model’s performance. Their values are dependent on the particular threshold being used for classification. The appropriate threshold is typically chosen based on balancing the cost of making a correct prediction versus that of an incorrect prediction. The area under the curve (AUC) measures a model’s performance over a range of classification thresholds and is, therefore, an assessment of the intrinsic discriminatory power of the model. The authors of this paper correctly reported the AUC as the primary performance metric in Figure 2, but unfortunately missed the fact that they should have used the area under a precision-recall curve (AUPRC) instead of the area under a ROC curve (AUROC). Remember that the ROC curve is susceptible to class imbalance while the precision-recall curve is not. Since the SMOTE technique was only applied to the training dataset, but not the validation dataset, using the AUROC as a performance measure would result in a skewed estimation dominated by the negative outcomes. While we have already pointed out the futility of making performance comparisons between different machine learning techniques, we should also point out that making such a comparison based on F1 scores, without specifying how the optimal F1 score for each model was determined, is even worse. Remember that the F1 score of a model can vary significantly along the precision-recall curve and the default threshold of 0.5 is often not the optimal threshold.

Class Imbalance and Performance Metrics. Image by Author.
Class Imbalance and Performance Metrics. Image by Author.

Network Architecture

Finally, we should like to comment on the neural network model used in this article. The authors used a multilayer perceptron (which is probably a more accurate and less gimmicky description than deep neural network) containing 4 hidden layers, each containing 128 nodes. That was quite a massive network for a simple binary classification task. The authors stated in the method section that they tested two configurations (128 nodes, 256 nodes) during their optimization process, but it was unclear why they chose 128 nodes as the starting point. While there are no strict rules on neural network architecture, a rule-of-thumb method is to set the number of neurons in the hidden layers to be somewhere between the input layer (41 in this case) and the output layer (1 in this case). We wonder if the authors could have achieved similar (or even better) results with a more efficiently designed network. The authors also did not appear to have performed any optimization on the Adam optimizer settings or the epoch count, all of which can significantly affect the performance of the model. The authors should also specify the normalization and encoding schemes used to preprocess numerical and categorical values in the method section because they are necessary for reviewers and readers to reproduce the model.

Summary

This article attempted to address the shortcomings of the MELD-Na score, namely its poor discriminatory power at low ranges and its inability to predict long-term mortality, using machine learning techniques. The choice of a logistic regression model, random forest model, and artificial neural network was appropriate given the binary classification task. The authors provided sufficient details in the method section to help reviewers and readers reproduce the models. The authors took into account the issue of class imbalance by implementing the SMOTE technique. Unfortunately, the overall conclusion of the article that machine learning models outperform the MELD-Na score in predicting mortality was not actually supported by the results of this study. The article also made several irrelevant and unsubstantiated comparisons regarding the performance of the three modeling techniques based on presumably suboptimal F1 scores. Overall, the article was able to successfully demonstrate the use of machine learning models in predicting short-term and long-term mortality in cirrhotic patients with reasonable discriminatory power. Other researchers should attempt to externally validate these models by applying them to other datasets.

Learning Points

  • A full model (a model constructed with all available features) will always perform at least as well as a partial model (a model constructed with only a subset of features) if both models were properly optimized.
  • The choice of machine learning modeling technique is based on many factors in addition to the discriminatory power of the resultant model. Comparing the performance of different models constructed using different techniques but trained on the same dataset does not add any scientific value.
  • Certain performance metrics, such as the area under ROC curve and accuracy can be adversely affected by class imbalance, whereas others including the area under precision-recall curve, balanced accuracy, and F1 score, are not affected.

Related Articles