Chronic Kidney Disease Prediction: A Fresh Perspective

Utilizing SHAP to build an interpretable model that is consistent with medical literature

Diksha Sen Chaudhury
Towards Data Science

--

Photo by Robina Weermeijer on Unsplash

Introduction

The kidneys work hard to remove any wastes, toxins, and excess fluids from the blood and their proper functioning is crucial for good health. Chronic Kidney Disease (CKD) is a condition in which the kidneys cannot filter blood as well as they should, leading to the buildup of fluids and waste in the blood which in the long term can lead to renal failure. [1] CKD affects more than 10% of the global population and is predicted to be the fifth highest cause of years of life lost globally by 2040. [2]

In this article, my objective was not to build the most accurate model that can predict the occurrence of CKD in patients. Instead, it was to check whether the best model developed using standard machine learning algorithms is also the most meaningful model according to medical literature. I have used the principles of SHAP (SHapley Additive exPlanations), a game theoretic approach to explain the output of the ML model.

What does Medical Literature say?

Medical literature has associated the development and progression of CKD with a few key symptoms.

  1. Diabetes mellitus and Hypertension: Diabetes and hypertension are two of the most important risk factors associated with CKD. In a study conducted in the USA from 2011–2014, the prevalence of CKD (stages 3–4) was found to be 24.5% in diabetics, 14.3% in prediabetics, and 4.9% in non-diabetics. In the same study, the prevalence of CKD was observed to be 35.8% in hypertensive individuals, 14.4% in prehypertensive individuals, and 10.2% in non-hypertensive individuals. [2]
  2. Decreased hemoglobin and red blood cell levels: The kidneys produce a hormone called erythropoietin (EPO), which helps in the production of red blood cells. In CKD, the kidneys are unable to produce sufficient EPO, leading to the development of anemia, i.e., a drop in the level of red blood cells and thereby hemoglobin in the blood. [3]
  3. Increased serum (blood) creatinine: Creatinine is a waste product of normal muscle and protein breakdown, and excess is removed from the blood via the kidneys. In CKD, the kidney is unable to effectively remove the excess creatinine, leading to high levels in the blood. [4]
  4. Decreased urine specific gravity: The specific gravity of urine is an indicator of how well the kidney can concentrate urine. Patients suffering from CKD have decreased urine specific gravity since the kidneys lose their ability to effectively concentrate urine. [5]
  5. Hematuria and Albuminuria: Hematuria and Albuminuria refer to the presence of red blood cells and albumin in urine respectively. Normally, the filters in the kidneys prevent blood and albumin from entering the urine. However, impairment to these filters can cause blood (or red blood cells) and albumin to enter the urine. [6][7]

The Dataset

The dataset used for this article is the ‘Chronic Kidney Disease’ dataset available on Kaggle, initially provided by UCI under their ML repository. It consists of data from 400 patients, including 24 features and 1 binary target variable (CKD absent = 0, CKD present = 1). A detailed description of the features can be found here.

Data Preprocessing

The CKD dataset had a lot of missing values that needed to be imputed before further analysis. This plot shows a visual representation of the missing data, with the yellow lines indicating missing values in that column.

Visual representation of the missing data (indicated by the yellow lines)

The missing values were imputed in the following ways:

  1. For numerical features, missing values were filled in using the median. The mean was not used since the mean is sensitive to outliers whereas the median is not. Due to the presence of outliers in these columns, the median is a better measure of the central value.
  2. The categorical features ‘rbc’ and ‘pc’ were missing 38% and 16.25% of their data respectively. Since this is a large chunk of missing data, the missing values were filled in as ‘unknown’. Using the mode here would not be the best decision as it would be a bit risky to categorize such a large group of observations into the same category.
  3. All other categorical features were missing less than or equal to 1% of their data. Thus, the missing values were filled in using their respective modes.

Building the Model and Checking the Interpretability Using SHAP

After filling in the missing values, the data was split into train and test (70–30 split) and a simple Random Forest Classification model was run. The test accuracy was 100%, i.e., the model was able to correctly classify patients it had not seen before 100% of the time. The confusion matrix has been shown below.

Confusion matrix generated when the model was run on test data

Now of course we have a great classification model. But what if we were interested in interpretability, i.e., how each feature contributes positively or negatively to the prediction? What are the most important features that drive the predictions? Are the results in accordance with clinical findings? These are questions that SHAP can help us answer.

SHAP is a mathematical approach based on game theory that can be used to explain the prediction of any ML model by calculating the contribution of each feature to the prediction. It can help us determine the most important features that help drive the prediction and the direction in which they influence the target variable. [8] A SHAP explainer was fitted to the test data and a global feature importance plot was generated as shown below.

Global feature importance plot generated using SHAP

The top three features driving the prediction are hemoglobin levels (‘hemo’), the specific gravity of urine (‘sg’), and whether the patient had red blood cells in their urine (‘rbc_normal’). Since the feature importance is calculated by taking the mean of the absolute SHAP value for that feature over all given samples, the plot only provides information regarding the order of importance and not the direction of influence. Let us produce a more informative plot that encapsulates both these objectives.

Beeswarm plot generated using SHAP

This beeswarm plot is a great way to show how the top features in a dataset impact the model’s prediction. The pink dots indicate patients who were predicted to have CKD and the blue dots indicate patients who were predicted to not have CKD. Now that we know the top features driving the prediction, let us see if their direction of influence is in accordance with the clinical findings presented earlier in this article.

  1. The presence of diabetes mellitus (‘dm_yes’) and hypertension (‘htn_yes’) is associated with the presence of CKD. This matches the clinical findings, although it would be expected to see them higher up in terms of global importance since they are major risk factors associated with CKD.
  2. Having low hemoglobin levels (‘hemo’), low packed cell volume (‘pcv’: the volume percentage of red blood cells in the blood), and low red blood cell count (‘rc’) are associated with CKD. This also matches clinical findings as patients suffering from CKD are unable to produce sufficient levels of RBCs.
  3. Having a low urine specific gravity (‘sg’) is associated with CKD, which can be explained clinically as the kidneys lose their ability to concentrate urine.
  4. Having high albumin in the urine (‘al’) and high serum creatinine (‘sc’) levels are associated with CKD, which is in accordance with clinical findings as the kidneys lose their ability to filter blood effectively.
  5. The presence of red blood cells in urine or abnormal urine (‘rbc_normal’; a binary categorical feature where value = 1 suggests normal urine with no RBCs and value = 0 suggests abnormal urine which might contain RBCs) is associated with CKD. This supports clinical findings as hematuria is more commonly found in patients suffering from CKD.

In summary, the top features and their directions of influence on prediction are in accordance with medical literature.

Conclusion

In this article, there are two main takeaways:

  1. Medical literature has associated the development and progression of CKD with the same top features that the ML model uses to classify whether a patient is predicted to have CKD.
  2. The direction in which these top features influence the target variable supports clinical findings, suggesting that the model is not only 100% accurate in predicting CKD but also medically meaningful and the results are entirely interpretable.

One possible limitation of this study is the small sample size. Once more data is available, the model should be tested on a larger pool of patients to check whether it continues to perform with high accuracy. It would also be interesting to see if the order of importance of the features changes for a larger group of patients.

In the medical field, the most accurate model may not always be the most meaningful model. In this study, SHAP was utilized to check whether our model is in accordance with medical literature. The advantage of the resulting model is that it is not only highly accurate but also easily interpretable and supported by clinical findings. This model can be of great use in telemedicine, where it can be used to identify patients who are at a higher risk of developing CKD. Future studies can involve looking into individual observations, and seeing which features of the model are driving the prediction at an individual level.

The code for this project can be found here. All images in the body of this article have been generated by me via Google Colab.

References

License for the Original Dataset: L. Rubini, P. Soundarapandian and P. Eswaran, Chronic_Kidney_Disease (2015), UCI Machine Learning Repository (CC BY 4.0)

‘Chronic Kidney Disease’ Dataset on Kaggle: https://www.kaggle.com/datasets/mansoordaku/ckdisease

Original SHAP Documentation: https://shap.readthedocs.io/en/latest/api_examples.html#plots

[1] Chronic Kidney Disease Basics (2022), Centers for Disease Control and Prevention

[2] C.P. Kovesdy, Epidemiology of chronic kidney disease: an update 2022 (2022), Kidney International Supplements

[3] H. Shaikh, M.F. Hashmi and N.R. Aeddula, Anemia of Chronic Renal Disease (2023), National Library of Medicine

[4] Serum (blood) creatinine (2023), National Kidney Foundation

[5] J.A. Simerville, W.C. Maxted and J.J. Pahira, Urinalysis: A Comprehensive Review (2005), American Family Physician

[6] P.F. Orlandi, et al., Hematuria as a risk factor for progression of chronic kidney disease and death: findings from the Chronic Renal Insufficiency Cohort (CRIC) Study (2018), BMC Nephrology

[7] Albuminuria (2016), National Institute of Diabetes and Digestive and Kidney Diseases

[8] R. Bagheri, Introduction to SHAP Values and their Application in Machine Learning (2022), Towards Data Science

--

--

Undergraduate at The University of British Columbia | Studying Biochemistry and Statistics | Exploring Applications of Data Science in Healthcare