Understanding Patient Hospital Stays: A Classification and Clustering Analysis in R

An application of machine learning methods to patient hospitalization records

Duncan W.
Towards Data Science

--

Co-author: Sophie Courtemanche-Martel

Photo by camilo jimenez on Unsplash

The effective management of patient hospital stays is one of the most challenging yet paramount priorities of modern healthcare systems. It is currently estimated that managing patient admissions and stays costs the US over $377.5 billion each year [1], and long term hospital stays risk overwhelming hospitals with limited capacities and patient resources. Research demonstrates that longer term stays can increase the probability of hospital-acquired conditions and infections [2], while decreasing a patient’s length of stay (LOS) can more effectively allocate resource consumption by hospitals and decrease the social and economic burden borne by patients.

Machine learning and data mining techniques have been increasingly used to draw insights from healthcare data, which can aid public health agencies and hospitals in managing human and financial resources. As such, there is significant value in being able to apply advanced analytics techniques to patient hospital records — such as using patient data to predict the length of a hospital stay, or to better understand the utilization of hospital resources by different patient groups.

Objective

The Medical Information Mart for Intensive Care (MIMIC) database is one source of patient data which provides a comprehensive stream of patient hospitalization records that have proven useful in past descriptive, predictive, and hypothesis-driven studies. Developed by the Massachusetts Institute of Technology Computational Physiology lab, the database contains anonymized entries for roughly 60,000 intensive care unit admissions at the Beth Israel Deaconess Medical Center, a teaching hospital of Harvard University located in Boston, Massachusetts.

For this analysis, we used data extracted from the MIMIC database to build two models in order to generate both predictive and exploratory insights regarding patient hospital stays. We first built a machine learning classification model to predict the categorical length of a patient’s hospital stay, given a patient’s observable characteristics at time of admission. Then, we used unsupervised learning techniques to cluster patients based on the number of various patient-caretaker interactions — such as procedures, inputs taken, and drugs prescribed — which can quantify the amount of human or physical resources used by a patient during their stay.

By combining these models, our aim is to explore both characteristics inherent to a patient’s background and condition upon admission, and attributes associated with a patient’s treatment process and resource consumption. These indicators can be used to benchmark and increase the effectiveness of medical service provision, enable the proactive allocation of critical healthcare resources, and potentially reduce the length of unnecessary stays.

Contents

Note that the models were built in R, and you can reference the full code here.

Data Description

The MIMIC III database contains more than 50,000 stays for adult patients and 8,000 neonatal patients recorded between June 2001 and October 2012.

For the purposes of the analysis, we utilized a subset of attributes from the MIMIC III database compiled by physician Dr. Alexander Scarlat. This dataset contains information on patient demographics, admission details, patient condition, as well as information on the average daily number of various patient-caretaker interactions. In addition, the data includes the total length of a patient stay and whether or not the patient passed away during their stay.

Here’s a subset of the attributes:

Taking a Look at the Data

Even though the pre-prepared dataset extracted from the MIMIC III database was relatively clean, we still found that it was necessary to perform further preprocessing to facilitate the analysis. Class variations of the same category — such as “GI” vs. “Gastrointestinal”, were regrouped into the same class, and several high-cardinality categorical variables were reclassified to reduce the number of classes. Our key variable of interest — patient Length of Stay (LOS) — ranged from 0–294 days. To categorize the target for the classification task, we re-grouped LOS into three classes with a comparable number of observations in each class:

  • Short stays: 0–5 days
  • Medium stays: 6–10 days
  • Long stays: greater than 10 days

By looking at the various patient attributes, we see below that older patients more frequently experienced medium to long stays, whereas younger patients had the shortest hospital stays. While there is considerable variation across patients, it also seems that white patients tend to be older overall when compared to all other ethnicities.

Patient length of stay (LOS) by age, across various patient ethnicities. Figure by authors.

We also see that most patients were covered by either Medicare or private insurance, and the majority of patients were white. The most common admission diagnosis was for childbirth, which might explain the high number of emergency admissions but low mortality rates observed.

Distributions of various patient characteristics. Figure by authors.
Frequency of patient admissions by admission diagnosis. Figure by authors.

Model Building

Classification Model

After data preparation, our first task was to predict the length of a patient’s hospital stay — as either short (0–5 days), medium (6–10 days), or long term (more than 10 days).

Feature selection

After eliminating invalid predictors (i.e. variables that can only be observed once the patient is admitted to the hospital), we were left with only categorical variables with the exception of age. To perform feature selection, we conducted Chi-Squared tests of independence, which is a hypothesis test used to establish the presence of a significant relationship between two categorical variables. Then, for the features identified as significantly related to LOS, we used Bonferroni-adjusted post-hoc tests to perform pairwise comparisons and identify which specific categories of each significant predictor were significantly related to LOS. This left us with the following variables, which were dummified prior to classification:

After eliminating invalid predictors (i.e. variables that can only be observed once the patient is admitted to the hospital), we were left with only categorical variables with the exception of age. To perform feature selection, we conducted Chi-Squared tests of independence, which is a hypothesis test used to establish the presence of a significant relationship between two categorical variables. Then, for the features identified as significantly related to LOS, we used Bonferroni-adjusted post-hoc tests to perform pairwise comparisons and identify which specific categories of each significant predictor were significantly related to LOS. This left us with the following variables, which were dummified prior to classification:

List of features and classes selected for classification.

Model Training

To predict the length of a patient’s stay, we decided to explore the following 3 different classification models and compare their relative performances:

  • Multinomial Logistic Regression (MLR)
  • Random Forest (RF)
  • Gradient Boosting Machine (GBM)

In simple terms, the models we tested generate a LOS prediction by calculating the probability that a patient record belongs to each LOS class (short/medium/long), and assigns a LOS class to the patient based on the highest predicted probability.

MLR does this by using maximum likelihood estimation to define a likelihood function for calculating the conditional probability of observing a given outcome under an assumed probability distribution. It then “searches” for the optimal coefficients of the function that achieves the maximum likelihood of that function. On the other hand, RF and GBM are ensemble learning methods that construct a number of decision trees and output the LOS class corresponding to the average prediction of each individual tree. The key difference between RF and GBM is that RF constructs several full trees in parallel, resulting in fully grown trees with low individual bias (error) but potentially high prediction variance between trees, while GBM grows several small trees sequentially. Each of these small trees produced by GBM has a high bias, but learns from and corrects for the error produced by the previous tree, therefore reducing the overall variance between predictions.

K-Means Clustering Model

To complement the classification task, we also decided to explore unsupervised methods by clustering patients by the average daily number of various patient-caretaker interactions, as a measure of the quantity of human or physical resources utilized on patients during their stay. To simplify the process, we targeted a priority group of 1349 “emergency” patient admissions under the following five most common admission diagnoses: gastrointestinal bleed, coronary artery disease, pneumonia, sepsis, and congestive heart failure.

We subsequently selected eight quantitative variables to cluster patients by, which are measured on an average daily scale:

Selected features for K-Means clustering model.

Due to different ranges of each variable, we first scaled the variables using min-max normalization before applying the K-Means clustering method. After assigning each patient to a cluster, Principal Component Analysis was applied to the dataset to perform dimension reduction in order to visualize the clusters by the top two principal components.

Classification of patients by length of hospital stay

From our classification models, we can see that the Gradient Boosting Machine marginally outperformed the other two models tested. Since GBM uses an ensemble approach to grow classification trees sequentially, with each tree using knowledge gained from previous trees to improve aggregate performance, it generally performs more optimally than simple logistic regression or random forest. However, we still see here that the overall performance of the models is poor; in fact, we weren’t able to obtain an accuracy much higher than that of a random guess.

Accuracy of models used to predict patient Length of Stay (LOS).

These results suggest that it is difficult to predict the length of a hospital stay based on patient characteristics, diagnosis, and admission circumstances alone. This is unsurprising, given that the length of a hospital stay may be determined by many factors exogenous to the patient, such as the availability of physicians, specialized equipment, and the effectiveness of managing patient cases, or if the treatment of an initial condition leads to the discovery of additional conditions requiring treatment.

We also generated confusion matrices to further break down our results and examine the precision and recall of the models.

As a quick refresher: Recall is a measure of the classifiers’ completeness; the ability of the model to correctly identify a short, medium, or long stay over all true existing observations of that class, while Precision counts the number of correctly identified observations in each LOS class over the total number of observations predicted as as a short, medium, or long term stays.

Confusion matrix comparisons of MLR, RF, and GBM models.

The confusion matrices reveal that all models performed best at identifying short stays. However, it appears that this may be due to the tendency of the models to overwhelmingly predict short stays over any other class, as the models were much less likely to identify medium and long term stays when they did occur. Under this model, hospitals would most likely underestimate the duration of patient stays, which could lead to an inadequate preparation of beds, staff, and resources to handle realized capacity needs. Given the high-consequential risk of misinformation in the context of healthcare resource allocation, we can infer that hospitals wouldn’t have enough information to make any LOS related decisions based on a patient’s admission characteristics alone.

Clustering of patients by patient-caretaker interactions

Next, let’s look at the results from the clustering model. With three clusters, we found that the total sum of squares variation — calculated as the variation between clusters relative to the total variation — was 64.5%, representing a total measure of variance in the patients which can be explained by the three clusters. The plot of clusters by the top two principal components reveals that cluster 3 represents the vast majority of patients, who experienced comparably lower average daily numbers of patient-caretaker interactions and with a relatively low degree of variation. Patients assigned to cluster 2 show a higher variation in patient-caretaker interactions, while cluster 1 represents patients with the highest overall number of and variation in patient-caretaker interactions.

Results of the K-Means clustering of patients by patient-caretaker interactions, plotted by the first two principal components. Figure by authors.

Now, let’s break down each cluster by looking at the relative frequency of patient characteristics within each cluster, Here, we see that patients in cluster 1 were most likely to have been diagnosed with sepsis, had the shortest stays, and most frequently died. This observation is logical, given that sepsis is classified as a medical emergency which can rapidly progress to septic shock, triggering tissue damage, organ failure, and is oftentimes fatal [3]. Although we aren’t able to assume causality, it is possible that the high variation in patient-caretaker interactions in this cluster could be driven by the dominance of high-fatality conditions whereby a patient either died before treatment, or died while being treated.

Relative frequency of patient traits clustered by patient-caretaker interactions. Figure by authors.

We found that the majority of patients in the clusters 2 and 3 had a much smaller and less variable number of patient-caretaker interactions, and longer lengths of stay. Compared to cluster 1, it could be that the number of interactions may be initially high immediately following admission, but eventually plateaus once the patient is stabilized. Patients in these groups were also less-likely to die, and were more likely to be diagnosed with conditions such as gastrointestinal bleed, coronary artery disease, and congestive heart failure. These conditions have varying levels of severity, but are less likely to be immediately life-threatening.

Overall, we were able to identify a few key traits associated with the number of patient- caretaker interactions, which mainly concerned a patient’s condition rather than their demographic traits. The distribution of clusters suggests that there are few patients which may require more resources on an average daily basis, but that the majority of patients utilized a similar amount of resources. However, the overlap in clusters nonetheless suggests that patient-caretaker interactions may be influenced by factors outside of the scope of this analysis — such as patient history and condition severity — and varies widely on a case-by-case basis.

Limitations

Model interpretation and patient cost quantification

Applying predictive analytics to the healthcare setting can be quite challenging, given that the modeling of patient outcomes must take into consideration both the generalizability of models as well as the substantial variation in treatment costs and outcomes even amongst similar patient groups. Patients are admitted with conditions requiring varying treatment procedures, and into different units with varying capacities and resources. Since the current data was also collected over an 11 year period, restructuring of hospital programs, units, and divisions may have also occurred over time. Therefore, to quantify the cost of a patient’s LOS, it may be useful to segment patients by diagnosis, admission location, as well as time period.

While we tried to group patients by patient-caretaker interactions to measure patient resource consumption, it is also important to keep in mind that an interaction such as taking a patient’s vitals is considerably less costly than performing a surgical procedure, making it difficult to quantify the dollar-cost of each patient group based on these results alone.

Sample bias

It is important to recognize that the data used for the analysis consists of admissions to one hospital only. These patients were also predominantly white, covered by private insurance or Medicare (federal health coverage for patients 65 years and older), and had a median age of 59 years old. Interestingly, we found that 70% of patients were classified as emergency room admissions, and over 10% of admitted patients ultimately died, compared to the US average of 0.77–1.48 deaths per 1,000 emergency room admissions [4]. Since the patient profile of the MIMIC III dataset is not a representative sample of the demographic composition of US patients, care needs to be taken not to extrapolate the results to the wider US population.

Feature validity

One of the major limitations of our classification model is the lack of comprehensive and quantitative measures. In fact, we were only able to use six categorical predictors (admission type, location and diagnosis, religion, marital status, ethnicity), one binary predictor (gender) and only one continuous predictor (age). Including other critical metrics recorded upon the time of admission — such as Body Mass Index (BMI), heart rate, temperature, nervous reflexes, as well as general indicators such as pain levels and pre-existing medical conditions — would potentially improve the predictive power of the classification models.

Final Thoughts

One of the central aims of this analysis was to use machine learning models to better understand the underlying patterns of patient hospitalizations and subsequent Length of Stay (LOS). While the results of the models demonstrated that it might be difficult to predict a patient’s LOS based on admission characteristics alone, we were able to cluster patients by patient-caretaker interactions to successfully identify three groups of patients on the basis of hospital resource utilization, and understand what characteristics, diagnoses, and patterns of mortality this was associated with.

Healthcare analytics shows great potential for contributing to advances in healthcare resources management, prognostic and diagnostic analysis, and even for the early detection of disease and disability. Yet, in order to operationalize predictive models for such tasks, data collection, algorithm tuning, and model interpretation must be undertaken in a manner which ultimately furthers the interest of the individuals, communities, and populations they are designed to serve. These models can have large benefits, but must be approached with care when human lives are on the line.

Project Code:

Github: https://github.com/duncan-wang/LOS-prediction

References:

[1] Health Catalyst (2016), “Patient-centered los reduction initiative improves outcomes, saves costs”.

[2] Hassan M., David K. (2006), “Hospital length of stay and probability of acquiring infection,” International Journal of Pharmaceutical and Healthcare Marketing, vol. 4, no. 4, pp. 324–338.

[3] Center for Health and Diseases (CDC) (2020), “What is sepsis?”.

[4] Shmerling, R. (2018), “Where people die.” Harvard Health Blog.

MIMIC III Database:

MIMIC-III, a freely accessible critical care database. Johnson AEW, Pollard TJ, Shen L, Lehman LH, Feng M, Ghassemi M, Moody B, Szolovits P, Celi LA, and Mark RG. Scientific Data (2016). DOI: 10.1038/sdata.2016.35. Available at: http://www.nature.com/articles/sdata201635

Please feel free to reach out to us on LinkedIn if you want to share any thoughts!

--

--