The world’s leading publication for data science, AI, and ML professionals.

Developing machine learning models for non-invasive digital health wearables

A review on feature selection, biomarker identification, and disease prediction

Photo by FLOUFFY on Unsplash
Photo by FLOUFFY on Unsplash

Introduction

Precision Medicine, combined with big health data, has presented an opportunity in health monitoring that lets people take more control of their health. Digital health wearables like smartwatches add a robust dimension to health monitoring. Instead of relying on a snapshot of information obtained during a doctor’s visit and scheduled check-ups, digital health wearables make it possible to monitor health continuously. With this technology, healthcare providers can have wholesome insight into health and disease progression.

There is a race to create the ultimate smartwatch. Soon, the smartwatch will not only monitor blood pressure, oxygen levels, body temperature, physical activity, sleep patterns and glucose, but will also predict several disease conditions.

An increase in such capabilities will mean an increase in personal health data. The development of wearable devices relies on identifying and developing sensors that can detect biomarkers. The data obtained can then be used to train robust machine learning (ML) algorithms. This project aims to use machine learning models to select features, identify biomarkers and predict diabetes mellitus. Also, the project hopes to elucidate the process of developing robust machine learning models for digital health wearables.

Data and model overview

In this section, an overview of global COVID patient admission health data is provided, the dataset, features and predictive models.

Dataset: The dataset used for training and evaluating the model consists of routine health records obtained from COVID patient admissions globally. This dataset was put together to predict diabetes and use it as an indicator for the severity of the COVID disease. This training dataset contained 180 features and 150,000 observations. The columns that had more than 40,000 missing values and unique identifiers were removed from the dataset which resulted in a dataset of 100 features.

Target: Diabetes mellitus is the target for this prediction. The target had an imbalance class. There were 78.4% non-diabetics and 21.7% diabetics in the dataset. This class imbalance was good at predicting the non-diabetics with about 85% accuracy and only predicted non-diabetics with 65% accuracy. The classes were resampled by under-sampling the non-diabetic class resulting in a 56,302 observation size.

Predictor variables

The hospital data contained 100 features which reduced to 49 after feature engineering and data wrangling. The features were divided into ICU status, disease status, blood chemistry, biometrics and physiology.

The resulting data set was then split into training and validation at a ratio of 80:20 and stratified using the target variable.

Predictive model

The random Forest and Xgboost classifier models were used to predict diabetes mellitus. Feature importances for the random forest classifier was identified using the ELi5. Any features that had feature importance of zero and below were removed because they did not contribute to the model which resulted in 39 features. These features and the target were used to train the Xgboost classifier.

Table1: Features importances(top 14 features)(Table by Author)
Table1: Features importances(top 14 features)(Table by Author)

Performance metrics

Accuracy, precision and recall: The baseline model accuracy – which is the same as the majority class percentage, was 50%. The accuracy of the Xgboost classifier model was 75%. The recall and precision of diabetic class(1) were 0.74 and 0.75 respectively as shown by the classification report below.

Classification report for the XGboost classifier model (Image by Author)
Classification report for the XGboost classifier model (Image by Author)

AUC-ROC

The AUC- ROC score of the Xgboost model is 0.82. This model is good at distinguishing between observations that are diabetic and those that are not. The diagram below shows the AUC-ROC curve. Setting the threshold probability at 0.38 increases the recall from 74% for class 1 (diabetic) to 84%. The accuracy at this probability is 0.70. For disease prediction, recall is especially important because all resources in the prediction are allocated to ensure everyone with the condition is identified.

ROC curve (Image by author)
ROC curve (Image by author)
Setting the threshold probability to 0.38 to increase recall. (Image by Author)
Setting the threshold probability to 0.38 to increase recall. (Image by Author)

Partial dependence plots

The partial dependence plot (PDP) for diabetes mellitus glucose is shown in the graph below. The PDP shows that the probability decreases from 50 mg/l until it reaches 100 mg/l and starts to increase to 250 and then levels off.

(Image by Author)
(Image by Author)

Shapley values

The figure below shows Shapley values for a 68 yr old. A prediction of 0.19 predicted this person to be diabetic because it is below the threshold of 0.5. Those with probabilities above 0.5 are not diabetic in this prediction. The features that lower the probability, hence increasing the chances of someone being diabetic, include glucose, BMI, gcs_unable_apache, haemoglobin, creatinine etc. The features that add to the probability and hence increase the chances of some not being diabetic include age, ICU _los_days, WBC etc.

Shapley results of a 68-year-old individual (Image by author)
Shapley results of a 68-year-old individual (Image by author)

Findings

Finding one: Health care providers used diabetes to measure the chances that someone will have COVID disease severity. Glucose blood sugar level contributes the most to this prediction. Developing digital health products that can measure glucose levels invasively don’t only present an opportunity for consumers to manage their blood sugar but also to predict the severity of disease due to viruses like COVID.

Finding two: The availability of digital health wearables is for health care providers to have wholesome insights into their patient’s health and for people to take a proactive approach towards their health. As seen in the prediction the top two predictors of diabetes are glucose and BMI, which can both be changed through lifestyle adjustments.

Finding three: The biomarker haemoglobin, which is the fourth contributing factor for diabetes prediction, can be determined non invasively using sensors. This biomarker is routinely used in healthcare for the prediction of anaemia. Researching health data using machine learning can help to re-purpose biomarkers for the prediction of other diseases hence making digital health wearables robust.

Conclusion

Machine learning models are instrumental in biomarker research for digital health wearables and increase the robustness of sensors by re-purposing features for the prediction of multiple conditions.

Find the GitHub repo here: https://github.com/Fellylove/Biotech/blob/master/BUILDWEEK2_SBurris.ipynb

Sources:

W. Luo, D. Phung, and T. Tranet, Guidelines for Developing and Reporting Machine Learning Predictive Models in Biomedical Research: A Multidisciplinary View(2016). J Med Internet Res 18(12)

V. Kannan, M. Shapiro, and M. BilgicKannan, Hindsight Analysis of the Chicago Food Inspection Forecasting Model(2019). Presented at the AAAI Fall Symposium Series (FSS) 2019:

A. Vaid, S. Jaladanki, and J. Xu, Federated Learning of Electronic Health Records to Improve Mortality Prediction in Hospitalized Patients With COVID-19: Machine Learning Approach(2021). JMIR MEDICAL INFORMATICS vol 9

L. Muhammad, E. Algehyne, and S. Usman, Supervised Machine Learning Models for Prediction of COVID‑19 Infection using Epidemiology Dataset(2021). SN computer Science 2:11


Written By

Topics:

Related Articles