Exploratory Data Analysis on Stroke Dataset

Published in

Towards Data Science

6 min readDec 8, 2020

Introduction

Stroke is a critical health problem globally. It remains as the second leading cause of death worldwide since 2000 [1]. Apart from that, stroke is the third major cause of disability. Long term disability affects people severely, in terms of their productive life [2]. As such, stroke possesses significant threat to global health.

This post aims to identify the risk factors for stroke. The patient data was obtained from Kaggle. Methods to ascertain whether a variable is a risk factor were described. Results were visualised and discovered insights were discussed. It is ended with a conclusion and some ideas were suggested for future work.

Descriptive Data Analysis

Now, let’s dive deep into the dataset! First we import the necessary Python’s libraries.

Let’s load the downloaded csv and explore the first 5 rows of the dataset.

Fig. 1: First 5 rows of the dataset.

Fig. 2: Summary of the dataset.

The dataset consisted of 10 metrics for a total of 43,400 patients. These metrics included patients’ demographic data (gender, age, marital status, type of work and residence type) and health records (hypertension, heart disease, average glucose level measured after meal, Body Mass Index (BMI), smoking status and experience of stroke).

After knowing the basic information, let’s determine how many records where stroke happened before.

Fig. 3: Breakdown of the stroke attribute.

In line with other healthcare datasets, this dataset was highly unbalanced as well. Only 783 patients suffered a stroke while the remaining 42,617 patients did not have the experience.

Before we can proceed further, we must preprocess the data, in order to extract meaningful insights from the dataset.

Data Preprocessing

ID attribute

This attribute was used to identify patients solely and did not have other meaningful information. Hence, the entire column was removed.

2. BMI attribute

From Fig. 1, 1,458 records were listed as NaN (not a number) in the BMI column. The first thought was to remove them since they represented a small fraction of the dataset. Nevertheless, by probing further, it contained 140 records where patients suffered a stroke. This information was valuable considering the fact that only 783 patients suffered a stroke in this dataset. Hence, records with empty value in BMI was replaced with mean of BMI.

Fig. 4: Records where patient suffered from stroke but had missing value in bmi attribute.

3. Smoking Status attribute

In addition, 13,292 records or about 30.6% of the dataset had missing values in smoking status feature (from Fig. 1). It was a huge proportion of the dataset. As such, a new category named “not known” was created to account for all these records, rather than dropping them altogether.

4. Gender attribute

There were 11 patients who were categorized as ‘Other’ in the gender column. They were dropped because their size was insignificant to the dataset (11 vs ~43K records).

5. Normalize numerical attributes

In this dataset, there are 3 numerical attributes, i.e. age, average glucose level and bmi. Let’s normalize them to ensure that they have equal weightage when building a classifier. Noted that new columns were created rather than replacing the initial columns. This preserved the original data.

6. Discretize numerical attribute

Apart from normalization, they were discretized into bins for visualization later on.

With that, we can (finally) move on to the exploratory data analysis.

Exploratory Data Analysis

Numerical attributes

Let’s start by plotting the correlation matrix on the numerical attributes.

Fig. 6: The correlation matrix

Insight #1: It seemed like both BMI and Age were positively correlated, though the association was not strong.

For numerical attributes, histogram was plotted to discover any potential relationship between the variable and stroke. A function was created to avoid duplication of codes. It takes in the name of the column and outputs the histogram.

In addition, 100% stacked bar charts were plotted to discover any potential relationship between the variable and stroke. With little tweak, a new yet similar function was created to avoid duplication of codes. It takes in the name of the column and outputs the 100% stacked bar chart.

a) Age

Fig. 7: Histogram (top) and 100% stacked bar chart (down) of age variable.

The risk of experiencing a stroke increased as patient’s age advanced.

Insight #2: Older patient was more likely to suffer a stroke than a younger patient.

b) BMI

Fig. 8: Histogram (top) and 100% stacked bar chart (down) of bmi variable.

Percentage of patient who had BMI between 25 and 35 was the highest to suffer a stroke than patients from other groups.

Insight #3: Higher BMI does not increase the stroke risk.

c) Average glucose level

Fig. 9: Histogram (top) and 100% stacked bar chart (down) of avg_glucose_level variable.

Fig. 9 visualises that stroke incidence occurred to some patients regardless of the average glucose level measured after meal. Although there was no stroke incidence reported on the last two columns on the right, these columns were represented by only 3 patients, i.e. not significant. Nevertheless, higher proportion of patient who had average glucose level measured after meal of more than 150mg/dL (milligrams per decilitre) suffered a stroke. This observation can be explained by the presence of diabetes. Diabetes was present in patient who had reading of more than 200mg/dL. Pre-diabetes was also considered in patient if the reading was between 140–199mg/dL.

Insight #4: Diabetes is one of the risk factors for stroke occurrence and prediabetes patients have an increased risk of stroke.

2. Categorical attributes

a) Hypertension, Heart disease

Fig. 10: 100% stacked bar chars for hypertension (top) and heart_disease (down).

Insight #5: Higher proportion of patients who suffered from hypertension or heart disease experienced a stroke, all else being equal.

b) Gender, Residence type

Fig. 11: 100% stacked bar chars for gender (top) and Residence_type (down).

Insight #6: Regardless of patient’s gender, and where they stayed, they have the same likelihood to experience stroke.

3. Age-related variables

a) Work type

Fig. 12: Record grouped by work type and age (top), & 100% stacked bar chart by work type (bottom).

Fig. 12 shows an interesting observation. At first glance, proportion of patient who was self-employed and suffered a stroke was relatively higher than other categories. However, this variable was highly associated with age.

Both never worked and children categories were pretty self-explanatory. Almost non-existent stroke was recorded due to lower average age. On the other hand, the mean age of patients who were self-employed was 59.3 years old. It was the highest among all categories.

Insight #7: Work type variable was highly associated with age.

b) Ever married

Fig. 13: Record grouped by ever_married and age (top), & 100% stacked bar chart by ever_married variable (bottom).

Bottom chart of Fig. 13 shows a similar observation as the work type variable. However, the top chart displays the stark difference in mean of age of both categories.

Insight #8: Marital status variable was highly associated with age.

Conclusion

There a total of 8 insights found in the stroke dataset:

It seemed like both BMI and Age were positively correlated, though the association was not strong.
Older patient was more likely to suffer a stroke than a younger patient.
Higher BMI does not increase the stroke risk.
Diabetes is one of the risk factors for stroke occurrence and prediabetes patients have an increased risk of stroke.
Higher proportion of patients who suffered from hypertension or heart disease experienced a stroke, all else being equal.
Regardless of patient’s gender, and where they stayed, they have the same likelihood to experience stroke
Work type variable was highly associated with age.
Marital status variable was highly associated with age.

Key Takeaways

In this post, EDA was performed on stroke dataset. There are several key takeaways from this post as follows:

Data preprocessing is a very important step. Do not jump straight to analysis or prediction while the data is dirty.
Do not automatically drop all records which contain missing values. They may contain valuable information. Consider other alternatives, i.e. replace them with mean or median value if it is a numerical attribute, or create a new category if it is a categorical attribute.
Not all insights are breakthrough. Probe further. They may be highly associated with another variable after all.

Reference

[1] E. S. Donkor, Stroke in the 21st century: a snapshot of the burden, epidemiology, and quality of life (2018), Stroke research and treatment

[2] W. Johnson, O. Onuma, M. Owolabi and S. Sachdev, Stroke: a global response is needed (2016), Bulletin of the World Health Organization