The world’s leading publication for data science, AI, and ML professionals.

Data Science for the Heart

Cardiovascular Disease Data Analysis and Predictive Modeling

Photo by Alex Lee on Unsplash
Photo by Alex Lee on Unsplash

Introduction

The purpose of this data exploration and predictive analysis is to better understand which health factors affect a patient’s risk for Heart Disease. To accomplish this, an introduction to the data will be made, along with a graphical analysis of the health factors in the dataset. The predictive modeling process will be introduced, giving the background for the evaluation of the logistic regression predictive model. This evaluation will consist of reviewing performance metrics from the confusion matrix. Finally, an explanation of the model’s calculation will be given for a specific example to demonstrate how the factors drove the prediction.

Dataset Explanation

The Heart Disease Dataset selected for this project comes from the UCI Machine Learning Repository. The dataset consists of 461 patients’ data, which describe the individual’s health factors and diagnosis of heart disease. The 12 health factors in the dataset used in this project are outlined below.

1. Age – age of the patient in years

  1. Sex – sex of the patient
  • 0 indicating Female
  • 1 indicating Male
  1. CP – chest pain type of the patient
  • 1 indicating typical angina
  • 2 indicating atypical angina
  • 3 indicating non-anginal pain
  • 4 indicating an asymptomatic patient
  1. TrestBps – resting blood pressure in mmHg
  2. Chol – serum cholesterol in mg/dl
  3. Fbs – fasting blood sugar
  4. RestEcg – resting electrocardiographic results
  • 0 indicating normal
  • 1 indicating having ST-T wave abnormality
  • 2 indicating probable or definite left ventricular hypertrophy
  1. Thalach – maximum heart rate achieved
  2. Exang – exercised induced angina
  • 0 indicating no
  • 1 indicating yes
  1. Oldpeak – ST depression induced by exercise relative to rest
  2. Slope – the slope of the peak exercise ST segment
  • 1 indicating upsloping
  • 2 indicating flat
  • 3 indicating downsloping
  1. Cardio – diagnosis of heart disease
  • 0 indicating absence
  • 1 indicating presence

For the analysis and predictive modeling, the data was processed such that groupings were used for the age, resting blood pressure, and serum cholesterol factors.

Data Analysis

The table above in figure 1 gives the basis for understanding how the health factors in the dataset are correlated with one another. In the age column, trestbps (resting blood pressure) and thalach (maximum heart rate achieved) are the strongest correlated factors with age. As a patient’s age increases, their resting blood pressure tends to increase and their maximum heart rate achieved tends to decrease. The correlation between age and cardio is positive, however it is not a very strong correlation with a value of .176.

Of particular interest is the row in the table for cardio, the diagnosis of heart disease. This row shows the correlation between cardio and the other health factors. Some of the factors with the strongest correlations to cardio are cp, thalach, exang, oldpeak and slope which were defined in the Dataset Explanation section. While age and sex do not have as strong correlations to the cardio factor, this is not to say there are no conclusions that can be drawn from these factors. Analyzing how these factors interact with the cardio factor when evaluated in unison by looking at trends between groupings of both age and sex reveals additional information on how the age and sex of a patient can statistically contribute to the overall risk for cardiovascular disease.

The data sample consists of 461 total patients, 124 of which are female and 337 of which are male. It is important to consider this sample size disparity between the patients’ sex while analyzing the data, as trends in the overall population will shift towards the larger group’s direction. The age range of the patients in the dataset spans from 30 to 79 years old. Figure 2 above represents the patient count by both sex and age.

Grouping by age in figure 3 above, the data shows that the cardiovascular disease percentage generally increases with the patient’s age. The one outlier is in the 70–79 year old grouping, where the percentage of patients with cardiovascular disease drops from 69.2% for patients in their 60s, to 58.8% for patients in their 70s.

There is a large imbalance in the cardiovascular disease percentage between males and females, shown in figure 4 above. Within the dataset, male patients are more than twice as likely to have cardiovascular disease, with 66.5% of males having the disease contrasted to only 29.8% of females.

Using the knowledge gained from viewing the data by sex and age individually, figure 5 below combines both factors to give a more detailed perspective. The data shows both that male patients and older patients are more likely to have cardiovascular disease, which figures 3 and 4 displayed as well. Figure 5 further details that the relationship between age and cardiovascular disease holds true in both males and females separately, and is not observed only for the sample population as a whole.

The one outlier is for females in their 70s, where the disease percentage drops from 41.7% for females in their 60s to zero percent. This is the factor that contributed to the noted drop in disease percentage between patients, both female and male, from their 60s to their 70s in figure 3. It should be noted for males that the correlation between age and higher cardiovascular disease percentage holds true for all age groupings in the dataset.

For the one outlying female age grouping, it should be noted that the sample size for this group is the smallest out of all the groupings. Figure 2 shows that only one percent of the data sample, 5 of the 461 total patients, are females in their 70s. This is very low when compared to all other age and sex groupings, other than females in their 30s.

While age and sex are the most readily available health factors from the dataset, they are not the only measurements used in this data exploration. For example, blood pressure is an important factor in a patient’s risk for cardiovascular disease. Figure 6 below is split by sex and blood pressure groupings. The blood pressure groupings are based on the guidelines from Harvard Health Publishing.

For both females and males, a higher systolic blood pressure measurement is correlated with a higher chance for cardiovascular disease. The one outlier is for males in the [130,139] systolic blood pressure grouping, where the percentage of patients with the disease drops .3% , to 61.4%, from the previous grouping of [120,129] at 61.7%.

The change in cardiovascular disease risk between each grouping is fairly steady for females, while for males the change increases as the blood pressure groupings increase. The average change in disease risk between the groupings for females is 12.73% and is 10.85% for males.

Figure 6 continues to show the disparity between females and males in cardiovascular disease rates in the dataset. Males with a systolic blood pressure reading between [90,119], what is considered a normal measurement, are only 3.4% less likely to have cardiovascular disease than females in the highest blood pressure grouping of [180,200] which is considered to be a measurement for a patient in hypertensive crisis.

Dataset Resource

The cleaned dataset, as well as the pivot tables used to create the figures above, can be found at the following GitHub Repository. Please feel free to download the file to further slice the data into different visualizations. There are many additional columns representing health factors that were not specifically visualized in depth during this discussion, but are used in the logistic regression predictive model.

Predictive Model Process

The predictive model used in this exploration is the logistic regression model. Logistic regression is a valid model to use in this scenario because the dependent variable that is being predicted is either true or false, 1 or 0. A predicted value of true, or 1, indicates that the patient is predicted to have cardiovascular disease. A value of false, or 0, indicates that the patient is predicted to not have the disease. This prediction of the Cardio datapoint is based on the given inputs, which are the 11 other health factors in the dataset.

The software used to build, train and test the logistic regression model was Orange, an open source Machine Learning and data visualization toolkit. Figure 7 shows a visualization detailing the setup of the model and the execution of the model’s performance evaluation.

To briefly summarize this visualization

  1. The dataset is loaded into the workspace, expressed by the Data File widget
  2. The data is fed into the Data Sampler which splits the data into two sets
  3. Set 1 is the Training Data, comprising around 85% of the total data
  4. Set 2 is the Testing Data, comprising around 15% of the total data
  5. The Training Data is fed into the Logistic Regression model widget to build the model
  6. The Logistic Regression model and the Testing Data are mapped to the Predictions widget, where the model is evaluated against the testing data
  7. A Confusion Matrix is generated to further break down the model’s performance
  8. The Explain Prediction widget maps the Logistic Regression model, along with the Training and Testing Data to provide an explanation for which features contribute the most, and how they contribute, to the prediction for a single instance

Model Evaluation Using the Confusion Matrix

The evaluation metric used to describe the model’s performance is the confusion matrix. The two possible classes in the confusion matrix are 0 and 1, or false and true. A predicted value of 0, or false, indicates that the model predicted that the patient does not have cardiovascular disease. A predicted value of 1, or true, indicates that the model predicted that the patient does have cardiovascular disease.

With this knowledge, the following terms can be defined

  1. True Negatives (TN) – Cases where the model predicted 0, that the patient does not have cardiovascular disease, and the patient does not have the disease
  2. True Positives (TP) – Cases where the model predicted 1, that the patient has cardiovascular disease, and the patient does have the disease
  3. False Negatives (FN) – Cases where the model predicted 0, that the patient does not have cardiovascular disease, and the patient does have the disease
  4. False Positives (FP) – Cases where the model predicted 1, that the patient does have cardiovascular disease, and the patient does not have the disease

From these definitions, a perfect model will have every case fall in the True Negatives or True Positives classes, with no cases falling in the False Negatives or False Positives classes. Therefore, the goal when building the model should be to reduce the FN and FP cases.

For the specific case of predicting cardiovascular disease, reducing the FN class is of utmost importance because these are the cases in which a patient is predicted to not have the disease, but does have the disease. If making decisions regarding treatment based on this prediction, these patients may not receive the proper treatment or medication for cardiovascular disease.

The confusion matrix above displays the TN, TP, FN and FP values for the logistic regression model when evaluated on the testing data. In total, 77 patients comprise the testing data, which is approximately 16.7% of the total dataset. To evaluate the performance of the model, some additional terms will be defined, along with their corresponding values.

  1. Accuracy – How often the model is correct overall
  • (TN + TP)/Total = (37 + 26)/77 = .818
  1. Recall – How often the model predicts 1, when the actual value is 1
  • TP/(TP + FN) = 26/(26 + 6) = .813
  1. Precision – How often is the model correct when the predicted value is 1
  • TP/(TP + FP) = 26/(26 + 8) = .765
  1. F1 Score – The weighted harmonic mean of recall and precision
  • This score tends to be more useful than accuracy alone because it also takes both FP and FN into account
  • 2(Recall Precision)/(Recall + Precision) = 2(.813 .765)/(.813 + .765) = .788
  1. Null Error Rate – How often the model would be incorrect if it always predicted 0
  • 0 is the majority class in the testing data for the confusion matrix above
  • (Actual 1)/Total = (FN + TP)/Total = (6 + 26)/77 = .416

These metrics, through the evaluation of the confusion matrix, summarize the logistic regression model’s performance. From the Recall, it is seen that the model correctly caught 81.3% of patients who had cardiovascular disease. From the Precision, it is seen that when the model predicts that a patient has cardiovascular disease, that prediction is correct 76.5% of the time.

With a Null Error Rate of 41.6%, a model that always predicted 0, that the patient does not have cardiovascular disease, would be correct 58.4% of the time. This can be used as a baseline metric to compare the actual logistic regression model against. With an accuracy of 81.8%, a recall of 81.3%, a precision of 76.5%, and a F1 score of 78.8%, it is clear that the logistic regression model statistically outperforms the basic model that always predicts that the patient does not have the disease.

Prediction Explanation

Figure 10 is a prediction explanation view that explains the extent to which features contribute to a prediction for a single instance, based on the model. The target class in the explanation is 1, meaning that the model is evaluating whether or not the patient is predicted to have cardiovascular disease. If the feature contributions total above .5 the prediction will be 1 for cardiovascular disease. Similarly, if the feature contributions total below .5, the prediction will be 0 for the patient to not have cardiovascular disease.

From the figure, the .35 in the grey box is the probability calculated by the model, indicating that the patient is not predicted to have the disease. For the prediction, the features that contribute the most are the patient’s sex, exercised induced angina category, chest pain category and cholesterol group category. The patient in this example is a female, specified by the sex=0 = 1. In these types of cases, the categorical features are labeled with the format feature-name=feature-value = 0/1 (false/true).

The .29 in the blue bar related to the patient being female shows that this factor lowers the calculated probability for having the disease by 29%. Similarly, exang=0 = 1 demonstrates that the patient does not have exercised induced angina, lowering the calculated probability for having the disease by 12%.

The .06 in the red bar is related to cholesterol group 3. This indicates that being in this cholesterol grouping raises the patient’s calculated probability for having cardiovascular disease by 6%. The cholesterol groupings are based on the guidelines from Mayo Clinic.

Conclusion

This data exploration and predictive analysis of heart disease has identified which health factors statistically affect a patient’s disease risk, in addition to the extent to which those factors contribute to the risk. Both a graphical analysis as well as a prediction explanation demonstrated that even a limited amount of health data from a patient provides a template to understand their risk for disease.

This understanding is supported by the evaluation in the confusion matrix, which resulted in an overall model accuracy of 81.8%. The recall of the model, how often the model predicts a patient to have the disease when they do have the disease, was 81.3%. The precision of the model, how often the model is correct when the predicted value is that the patient has the disease, was 76.5%. From this performance, it can be concluded that health factors contributing to heart disease can be identified and employed to better understand a patient’s disease risk.


Related Articles