Combining Medicine and Data Science to Predict Heart Disease

Diksha Sen Chaudhury
Towards Data Science
9 min readSep 8, 2020

--

Photo by Jude Beck on Unsplash

Introduction

The purpose of this project is to combine the principles of data science and medicine to develop a model that can predict heart disease. The advantage of such a model is that it is easily interpretable and in sync with medical literature, unlike other machine learning models that yield results that are not interpretable. Adopting such an approach has helped me build a model that by screening just 34% of the population can predict the occurrence of heart disease with 84% accuracy.

According to WHO, heart diseases (broadly known as cardiovascular diseases or CVDs) claim an estimated 17.9 million lives each year, which is about 31% of all deaths worldwide. This ranks CVDs as the number one cause of death globally. [1]

Now, what if we could build a meaningful model that could predict the likelihood of heart disease in a patient, just based on a few parameters? The word ‘meaningful’ here is very important. We don’t necessarily want a model that will give us the highest accuracy rate, but rather one which incorporates significant features and can be explained from a medical point of view. For this project, I used Google Colab to develop my models.

Dataset

I worked with the ‘Heart Disease Cleveland UCI’ dataset from Kaggle (The dataset has been originally posted under the title ‘Heart Disease Data Set’ by UCI in their ML repository). The Kaggle dataset has the data of 297 patients, 13 features, and 1 binary target variable called ‘condition’( 0 = heart disease absent, 1 = heart disease present). The detailed description of all 14 attributes has been included here.

Step 1: What does Medical Literature have to say?

Medical research stresses on 5 factors to be the most influential in predicting heart disease.

  • Age- Increasing age adds to the risk of developing heart disease [2]
  • Sex- Males are at a higher risk of heart disease than pre-menopausal females. The risk is comparable between males and post-menopausal females. [3]
  • Serum Cholesterol levels-Increased serum cholesterol levels contribute to the development of heart disease. [4]
  • Blood Pressure- Hypertension or high blood pressure is a huge risk factor for the development of heart disease. [5]
  • Chest pain- Approximately 25–50% of patients with heart disease suffer from silent myocardial ischemia (SMI), which means that they do not feel any chest discomfort. Hence, even an absence of chest pain can indicate the presence of heart disease. [6]

Luckily, all 5 factors above are included as variables in our dataset! Let’s take a quick look at how they are distributed.

Figure 1. Top Row: Distributions of Age, Sex and Cholesterol (chol) respectively; Bottom Row: Distributions of Resting Blood Pressure (trestbps) and Chest Pain (cp) respectively

Observations

  • Ages of the patients are widely distributed, with an average of about 55 years.
  • Overall, there are more males (class = 1) in this study than females (class = 0). From this plot, it is clear that males are more prone to heart disease than females. There are more males who have heart disease than those who don’t. On the contrary, there are far fewer females who have heart disease than those who don’t.
  • Serum cholesterol levels are widely distributed, with an average of about 250 mg/dl.
  • The distribution of resting blood pressure is rather irregular, with an average of about 130 mm Hg.
  • Among those who suffer from heart disease, most patients are asymptomatic (class = 3). Hence, the data supports medical literature. Chest pain in the forms of typical angina (class = 0), atypical angina (class = 1), and non-anginal pain (class = 2) are mostly reported by patients who do not have heart disease.

Step 2: Running a Logistic Regression Model with all 13 features.

My first intuition was to run a Logistic Regression model with all 13 features to check whether the most inclusive model is also the most medically meaningful model. I chose logistic regression since it can be easily explained. The model gave me an overall 86.7% accuracy on the test data and was able to correctly predict the presence of heart disease (predicted class = 1) in 82.9% of patients. Pretty good, right? Well, I was pretty disappointed when I took a look at the coefficients of each variable.

Figure 2. Results of 13-feature Logistic Regression Model

The first coefficient (-0.03121) corresponds to the age variable. Just a few moments ago, we talked about how medical literature says that increasing age adds to the risk of developing heart disease. In that case, shouldn’t age have a positive coefficient? Similarly, the coefficient for ‘fbs’ or fasting blood sugar is negative (-0.41832). According to medical literature, when fbs > 100 mg/dL (in this case, class = 1 if fbs >120 mg/dL and class = 0 if fbs < 120 mg/dL), the risk of heart disease greatly increases [7]. Hence, the sign of the coefficient should ideally be positive.

From a medical perspective, it would be incorrect to accept a model that doesn’t get the sign of the variables correct, even if it is highly accurate.

Step 3: Running a New Logistic Regression Model in accordance with Medical Literature

Now that I had built a model purely based on Machine Learning, I decided to try a model that would be meaningful instead. I wanted to construct a model involving a subset of the 13 features, including only those features which had ended up with the right signs in conformity with medical literature. As a result of this, I ended up with a logistic regression model with the 5 most important features according to medical literature — age, sex, serum cholesterol levels (chol), resting blood pressure (trestbps), and chest pain (cp).

The model gave me an overall 74.4% accuracy on the test data and was able to correctly predict the presence of heart disease (predicted class = 1) in 75.6% of patients. The coefficients of the features and intercept of the model are shown below.

Figure 3. Results of 5-feature Logistic Regression Model

All the coefficients are positive, just as we expected! Although this model may not be the most accurate, it is meaningful and can be easily explained by any medical practitioner.

Step 4: How can we make the New Model more Reliable?

The new logistic regression model with a subset of variables is meaningful but falls short of the prediction power of the 13-feature model (82.9% vs 75.6%).

To overcome these shortcomings, we must complement it with another ML model that incorporates all 13 features. The combination will do a better job of predicting the presence of heart disease in patients.

I developed 5 ML models with all 13 features and their performance is summarized in the table below.

Figure 4. Summary of the Results of all five 13-feature Models

Although Logistic Regression happens to be the most accurate model out of the five, I consciously ignored it based on reasons outlined above (non-intuitive signs of variables). Our best bet is to choose the Random Forest model, which gives an overall 83.33% accuracy on the test data and correctly predicts the presence of heart disease (predicted class = 1) in 78.05% of patients. The Random Forest (RF) feature importance chart is shown below.

Figure 5. Random Forest Model Feature Importance Chart

The interesting thing is that the top 4 variables according to RF are maximum heart rate achieved (thalach), chest pain (cp), the number of major vessels colored by fluoroscopy (ca), and heart defects (thal) while important features from our logistic model (age, sex, cholesterol, resting blood pressure) are towards the bottom. This tells us that the Random Forest model should not be trusted in itself, but rather be used in conjugation with our 5-feature logistic regression model for good prediction results.

Step 5: Combining the Two Models and Final Recommendations

This section talks about the way I combined both the scores to come up with the best prediction. I created a table comparing the model predictions and the actual conditions for all patients truly inflicted with heart disease in our test dataset. The notation that I use for this section has also been described below.

Figure 6. Notation Used
Figure 7. Table Comparing Combined Model Predictions and Actual Conditions for all Patients Truly Suffering from Heart Disease

Interesting! If we solely relied on our logistic model to correctly predict heart disease (LH), we would get a 31/44 or 70.5% accuracy score. On the other hand, if we relied only on our Random Forest model (RH), we would get a 31/36 or 86.11% accuracy score. The RF score is high because of 100% arising in class LLRH, which is probably a result of the small sample size in this class, i.e., only 5 patients.

If we rely on a combination of the two models (LHRH), we end up selecting only 31 patients (31/90 = 34% of our test sample) where the occurrence of heart disease is 26/31 or 84%. Hence, we have been able to improve the hit rate from 50% (50–50 chance of being detected with heart disease) to 84%, while ensuring the sanctity of medical literature. This model is efficient and does a good job of prediction.

As for the classes where the logistic model and the RF model give conflicting predictions (LHRL and LLRH), more research is needed.

Conclusion

In the medical field, the most accurate model may not be the most meaningful and vice versa. Models like these are an example of ways in which we can combine principles of data science in the form of machine learning models with medical literature to give us the best result possible. The advantage of this model is that it is easily interpretable and in sync with medical literature. Regarding the accurate prediction of heart disease, it has been able to improve the hit rate from 50% to 84% screening just 34% of the population. This model can be leveraged for telemedicine, particularly for underdeveloped countries where there is not much access to cardiologists. Future endeavors can entail collaboration with cardiologists to work on other medical datasets and check for the sanctity of this model.

The code for this project can be found here.

References

‘Heart Disease Cleveland UCI’ Dataset on Kaggle: https://www.kaggle.com/cherngs/heart-disease-cleveland-uci?select=heart_cleveland_upload.csv

Original Source of Dataset: https://archive.ics.uci.edu/ml/datasets/Heart+Disease

[1] Cardiovascular Diseases WHO (2020), https://www.who.int/health-topics/cardiovascular-diseases/#tab=tab_1

[2] R. Dhingra and R.S. Vasan, Age as a Cardiovascular Risk Factor (2012), https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3297980/

[3] J.L. Sullivan, Iron and the Sex Difference in Heart Disease Risk (2003), https://www.sciencedirect.com/science/article/abs/pii/S0140673681924636

[4] W.B. Kannel, W.P. Castelli, T. Gordon and P.M, Mcnamara, Serum Cholesterol, Lipoproteins, and the Risk of Coronary Heart Disease (1971), https://www.acpjournals.org/doi/abs/10.7326/0003-4819-74-1-1

[5] C. Rosendorff, H.R. Black, C.P. Cannon, B.J. Gersh, J. Gore, J.L. IzzoJr, N.M. Kaplan, C.M. O’Connor, P.T. O’Gara and S. Oparil, Treatment of Hypertension in the Prevention and Management of Ischemic Heart Disease (2007), https://www.ahajournals.org/doi/full/10.1161/circulationaha.107.183885

[6] A.H. Ahmed, K.J. Shankar, H. Eftekhari, M.S. Munir, J. Robertson, A. Brewer, I.V. Stupin and S.W. Casscells, Silent myocardial ischemia: Current perspectives and future directions (2007), https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2359606/

[7] C. Park, E. Guallar, J.A. Linton, D. Lee, Y. Jang, D.K. Son, E. Han, S.J. Baek, Y.D. Yun, S.H. Jee and J.M. Samet, Fasting glucose level and the risk of incident atherosclerotic cardiovascular diseases (2013), https://pubmed.ncbi.nlm.nih.gov/23404299/

--

--

Undergraduate at The University of British Columbia | Studying Biochemistry and Statistics | Exploring Applications of Data Science in Healthcare