Developing a model for heart disease prediction using PyCaret

Published in

Towards Data Science

10 min readJul 29, 2020

Source: Adobe Stock (License #262173764)

Background

In a developed professional data science environment, you might have access to AWS, GCP, Azure or other platforms or software with tools to perform experiment set-up, tracking and logging. But what if you just want to get a pilot up and running fast or you are doing a research project? This is where PyCaret (https://pycaret.org/) with the ability to do experimentation quickly and systematically might come in handy!

What is PyCaret? Well its creators describe it as an “…open source, low-code machine learning library in Python that allows you to go from preparing your data to deploying your model within seconds in your choice of notebook environment.” Anything that lets you spend more time on content and maximizing impact has got to be good, right?

In the following example we will develop, evaluate, and inspect a model for predicting heart disease using PyCaret. As this is my first experience with PyCaret I will also provide some thoughts and first impressions at the end. As always in any example application involving health, the associations, inferences, and commentary in no way constitute medical advice or expertise. The notebook and other materials used to support this article are available to access on my Github (https://github.com/jason-bentley/medium_articles/tree/master/heart_disease_pycaret).

Data

The processed Heart Disease dataset (https://archive.ics.uci.edu/ml/datasets/heart+Disease) for the Cleveland Clinic contains 13 features and an outcome for 303 patients. The outcome when dichotomized indicates whether a patient had heart disease and is the target of interest. Overall, 45% patients in the cohort had some degree of heart disease, so our target here is well balanced.

For our abbreviated exploratory data analysis (EDA), we will compare the features between those with and without heart disease and look at continuous features in more detail using plots (Figures 1 and 2).

The indications from the EDA are what we might expect. Patients with heart disease are more likely to be older, male, have asymptomatic chest pain, higher serum cholesterol, lower maximum heart rate, exercise induced angina, higher ST depression during exercise compared to rest, a flat or down-sloping peak exercise ST segment slope, affected major vessels, and a blood flow defect (fixed or reversible) identified by the thallium stress test.

A few observations from EDA that we need to consider for our pre-processing:

A few missing values for a continuous (count of vessels identified via fluoroscopy) and a categorical feature (thallium stress test blood flow to the heart).
The resting electrocardiogram result of ST-T abnormality has only 4 cases in the entire cohort — this is a rare level which we could combine with another, or you can think of it as a feature with low variance. We might also want to combine levels of other features.
Some potential outliers, with a patient with no heart disease having a serum cholesterol level above 500, and two patients in the heart disease group with ST depression values > 5.

Experiment set-up

I combined some categories (typical and atypical angina in chest pain; ST-T abnormality and LV hypertrophy in resting ECG; flat or down-sloping in peak exercise ST segment slope; fixed defect and reversible defect in thallium stress test of blood flow) and trimmed serum cholesterol and ST depression during exercise to the 99.5th percentile to address points 2 and 3 above.

This modified data is then the basis of the experiment via the setup() function. This function has an enormous array of inputs as it creates a data processing pipeline covering many common tasks such as transformations, normalizing, imputation etc. A full listing can be found here (https://pycaret.org/preprocessing/).

In the experiment set-up we will use 65% of our data for training. To address missing values we set the imputation methods to median and mode for numeric and categorical features, respectively. When run we first confirm the designation of categorical and numeric variables. Then the rest of the pipeline is created, and the experiment set-up is complete with a nice summary table generated that highlights aspects of the pipeline (Figure 3).

Figure 3. Experiment set-up involves confirming feature types and then a summary table (truncated) for the pipeline is generated.

We can see that we have kept all observations from the original data (if we had opted to use PyCaret to filter outliers the number of observations would be lower in sampled data compared to original). With our 65% train 35% test split we have 196 and 107 observations, respectively. There are a total of 15 features (5 numeric and 10 categorical). The highlight in row 4 indicates missing data was detected and rows 13 and 14 indicate our choices for how these were imputed.

Importantly, this set-up object forms the basis of everything to follow. Many subsequent function calls for model development will have the results saved as part of the experiment object so they can be accessed at any time, but also so the entire process can be saved. For example, the experiment object contains separate items for the sampled, train and test feature matrices and target vector.

Model development

With the set-up complete, to quickly identify a good set of candidate models (or model), we use compare_models(). This evaluates different classification models using default hyper-parameters and assesses performance in the training data with 10-fold cross-validation. There are 18 possible classifiers available, but for fun let’s restrict ourselves to a few boosting approaches (gradient, extreme and light), two flavors of SVM (linear, radial basis), and a random forest. The output of compare_models() is a nice summary of six common classifier metrics for those assessed (Figure 4).

Ideally, we would take the top candidate models and after tuning choose a final option. However for simplicity, given our extreme gradient boosting model is a good candidate for accuracy, let’s proceed with this. If we want to look at the CV results for a selected classifier from the above table we can use the create_model() function to create an object for a specific model.

Hyper-parameter tuning is done using the tune_model() function, which improved performance ever so slightly (Figure 5). For example, accuracy improved from 0.82 to 0.85. Note the if you look under the hood a randomized grid-search with 10 iterations is being used. Of course, many more iterations could be used or even a different approach such as hyperopt, however for the purpose of this article we will move on to looking at the model.

Figure 5. Performance of candidate model after tuning

The evaluate_model() function provides an interactive cell output with many common useful plots and summaries. Availability may depend upon the model you are evaluating, but the majority are common to all models and represent a good standard set of outputs we would want when evaluating a model. The interactive cell is shown below, where each option in a grey box is clickable in the notebook. Looking at the confusion matrix for our as an example output (Figure 6), we can see in the test set we have 45 true negatives, 36 true positives, and 13 instances each for false positives and false negatives, which gives us an accuracy of 76%.

Figure 6. Confusion matrix for our tuned model from evaluate_model()

To understand what features are driving the models heart disease predictions it is common to use SHAP values (https://shap.readthedocs.io/en/latest/). PyCaret allows us to create selected SHAP-based outputs to better understand our model with the function interpret_model().

For our model the three top factors (Figure 7) are the number of affected major vessels where the lowest value (0) reduces predicted risk, asymptomatic chest pain which if present increases predicted risk, and a thallium stress test indicating normal blood flow (high value) which lowers predicted risk. We can also see a few factors the seemingly make no difference to predicted risk: high fasting blood sugar, chest pain (either anginal or non-anginal) and sex.

To dive a bit deeper, we can look at the feature contributions to the predicted risk score for a patient in our test set (Figure 8).

Figure 8. Prediction contributions for a high-risk patient

For this patient the high degree of ST-T depression, their maximum heart rate, a blood flow defect according to the thallium stress test, asymptomatic chest pain and 2 affected major vessels all contribute to a high predicted risk score (4.97, probability = 0.99) for heart disease compared to the base risk score (average model output = -0.21, probability = 0.45).

Using the predict_model() function we can check performance using the hold-out set. The performance is a bit worse than in training CV, so we might be over-fitting a little, but it could be worse (Figure 9). Plus with proper hyper-parameter tuning we might overcome this as well.

Figure 9. Hold-out set performance for our selected model in comparison to default and tuned model performance in training data

Finally, to use this model in practice we might also want to divide the predicted likelihood of heart disease into groups so we can define sets of actions to take. One way to do this is to create a plot of heart disease incidence in the test group by risk decile of predictions. Where we put thresholds is context dependent and needs to be done with expert clinical knowledge and understanding the benefits vs. costs. As this is an example, I am glossing over this. After eyeballing, we will assume there are three distinct risk groups (low, medium and high) which can each have a different set of follow-up actions.

Figure 10. Heart disease incidence in the test group by risk decile of predictions. Dashed line is overall risk in the cohort (0.45)

Readers may have noticed a lack of Python code accompanying the above outputs. So after my own data preparation and generating a few custom figures above, how much PyCaret code was really needed to set-up the data pipeline, use CV to identify a candidate model and tune it, then evaluate and interpret, and finally assess performance on the hold-out test set? Well here it is, 7 functions with minimal input!

# experiment set-up
exp1 = setup(hd_df,
             train_size=0.65,
             target='target',
             session_id=101,
             ignore_features=['num', 'target_c'],
             numeric_imputation = 'median',
             categorical_imputation = 'mode')# find a good candidate model
compare_models(blacklist=['lr', 'lda', 'qda', 'et', 'catboost', 'ada', 'ridge', 'dt', 'nb', 'gpc', 'knn', 'mlp'],
               fold=10,
               turbo=False)# tune the candidate model - you can skip create_model as this is called as part of tune_model anyway
tuned_model = tune_model('xgboost')# evaluate
evaluate_model(tuned_model)# SHAP-based feature importance and understanding of prediction
interpret_model(tuned_model)
interpret_model(tuned_model, "reason", observation=94)# performance in hold-out test set
predict_model(tuned_model)

Summary

This was my first attempt using PyCaret and while there is still a decent chunk of documentation to get through, my first impressions are it is great! It really let me get into as much detail as I wanted to quickly, and the documentation and available tutorials are helpful.

As we saw experiment set-up, finding a model, tuning, evaluating, and inspecting was certainly low code. Low code of course does not mean skipping due diligence when it comes to data understanding, data cleaning and feature engineering, but the PyCaret approach means you can spend more time on these things.

Of course, with any low-code solution the trade-off is flexibility. So here are a few of my observations:

In the experiment set-up you cannot choose pre-processing options to be applied to select features only, and not all possibilities are available. For example, you cannot apply a normal transform to a single feature or set thresholds to trim features for outlier mitigation.
You need to inspect the constructed pipeline object to understand the order in which pre-processing steps were performed. Functionality to generate a flow diagram representation of the pipeline could be a fantastic future addition.
Flexibility to allow for different approaches to hyper-parameter tuning would be great, such as an easy way to access hyperopt. I did try a bit of experimentation on my own and it is possible (but clunky) to use hyperopt outside the experiment object and then pass those parameters back in to continue to the model summaries.
Inclusion of functions for SHAP values is great, but this functionality does not extend to the full set of visualizations available through the standard SHAP library. For example, the multiple patient force plot, passing through the link=’logit’ argument for a single force plot, or selecting what variable to use for the interaction term in a dependence plot is not available via interpret_model(). You can however take the model object and test data from the experiment object and do SHAP yourself.
What you can access via the experiment object was a little hit and miss as and you might need to search a little for what you want or you might need to check the actual functions (i.e., parameter grid used in tune model for your chosen estimator). However, these can be accessed so custom summaries/visualizations of model development can be created, and hopefully in future updates getting what you need from under the hood might be a little more user-friendly.
The structure leads the user to follow an appropriate and standardized workflow for developing predictive models.
The ability to save both the final model and the entire experiment as separate objects is great for deployment and reproducibility.
While I was able to extract information and create my own summaries of for example performance metrics between initial train, tuning and hold-out, the functionality to create a summary reports of this nature from the experiment object would be fantastic.

In the future I hope to leverage PyCaret more for larger datasets where the in-built handling of high cardinality features could be incredibly helpful. As always comments, thoughts, feedback and discussion are very welcome.