Exploratory Data Analysis on Stroke Dataset
Introduction
Stroke is a critical health problem globally. It remains as the second leading cause of death worldwide since 2000 [1]. Apart from that, stroke is the third major cause of disability. Long term disability affects people severely, in terms of their productive life [2]. As such, stroke possesses significant threat to global health.
This post aims to identify the risk factors for stroke. The patient data was obtained from Kaggle. Methods to ascertain whether a variable is a risk factor were described. Results were visualised and discovered insights were discussed. It is ended with a conclusion and some ideas were suggested for future work.
Descriptive Data Analysis
Now, let’s dive deep into the dataset! First we import the necessary Python’s libraries.
Let’s load the downloaded csv and explore the first 5 rows of the dataset.
The dataset consisted of 10 metrics for a total of 43,400 patients. These metrics included patients’ demographic data (gender, age, marital status, type of work and residence type) and health records (hypertension, heart disease, average glucose level measured after meal, Body Mass Index (BMI), smoking status and experience of stroke).
After knowing the basic information, let’s determine how many records where stroke happened before.
In line with other healthcare datasets, this dataset was highly unbalanced as well. Only 783 patients suffered a stroke while the remaining 42,617 patients did not have the experience.
Before we can proceed further, we must preprocess the data, in order to extract meaningful insights from the dataset.
Data Preprocessing
- ID attribute
This attribute was used to identify patients solely and did not have other meaningful information. Hence, the entire column was removed.
2. BMI attribute
From Fig. 1, 1,458 records were listed as NaN (not a number) in the BMI column. The first thought was to remove them since they represented a small fraction of the dataset. Nevertheless, by probing further, it contained 140 records where patients suffered a stroke. This information was valuable considering the fact that only 783 patients suffered a stroke in this dataset. Hence, records with empty value in BMI was replaced with mean of BMI.
3. Smoking Status attribute
In addition, 13,292 records or about 30.6% of the dataset had missing values in smoking status feature (from Fig. 1). It was a huge proportion of the dataset. As such, a new category named “not known” was created to account for all these records, rather than dropping them altogether.
4. Gender attribute
There were 11 patients who were categorized as ‘Other’ in the gender column. They were dropped because their size was insignificant to the dataset (11 vs ~43K records).
5. Normalize numerical attributes
In this dataset, there are 3 numerical attributes, i.e. age, average glucose level and bmi. Let’s normalize them to ensure that they have equal weightage when building a classifier. Noted that new columns were created rather than replacing the initial columns. This preserved the original data.
6. Discretize numerical attribute
Apart from normalization, they were discretized into bins for visualization later on.
With that, we can (finally) move on to the exploratory data analysis.
Exploratory Data Analysis
- Numerical attributes
Let’s start by plotting the correlation matrix on the numerical attributes.
Insight #1: It seemed like both BMI and Age were positively correlated, though the association was not strong.
For numerical attributes, histogram was plotted to discover any potential relationship between the variable and stroke. A function was created to avoid duplication of codes. It takes in the name of the column and outputs the histogram.
In addition, 100% stacked bar charts were plotted to discover any potential relationship between the variable and stroke. With little tweak, a new yet similar function was created to avoid duplication of codes. It takes in the name of the column and outputs the 100% stacked bar chart.
a) Age
The risk of experiencing a stroke increased as patient’s age advanced.
Insight #2: Older patient was more likely to suffer a stroke than a younger patient.
b) BMI
Percentage of patient who had BMI between 25 and 35 was the highest to suffer a stroke than patients from other groups.
Insight #3: Higher BMI does not increase the stroke risk.
c) Average glucose level
Fig. 9 visualises that stroke incidence occurred to some patients regardless of the average glucose level measured after meal. Although there was no stroke incidence reported on the last two columns on the right, these columns were represented by only 3 patients, i.e. not significant. Nevertheless, higher proportion of patient who had average glucose level measured after meal of more than 150mg/dL (milligrams per decilitre) suffered a stroke. This observation can be explained by the presence of diabetes. Diabetes was present in patient who had reading of more than 200mg/dL. Pre-diabetes was also considered in patient if the reading was between 140–199mg/dL.
Insight #4: Diabetes is one of the risk factors for stroke occurrence and prediabetes patients have an increased risk of stroke.
2. Categorical attributes
a) Hypertension, Heart disease
Insight #5: Higher proportion of patients who suffered from hypertension or heart disease experienced a stroke, all else being equal.
b) Gender, Residence type
Insight #6: Regardless of patient’s gender, and where they stayed, they have the same likelihood to experience stroke.
3. Age-related variables
a) Work type
Fig. 12 shows an interesting observation. At first glance, proportion of patient who was self-employed and suffered a stroke was relatively higher than other categories. However, this variable was highly associated with age.
Both never worked and children categories were pretty self-explanatory. Almost non-existent stroke was recorded due to lower average age. On the other hand, the mean age of patients who were self-employed was 59.3 years old. It was the highest among all categories.
Insight #7: Work type variable was highly associated with age.
b) Ever married
Bottom chart of Fig. 13 shows a similar observation as the work type variable. However, the top chart displays the stark difference in mean of age of both categories.
Insight #8: Marital status variable was highly associated with age.
Conclusion
There a total of 8 insights found in the stroke dataset:
- It seemed like both BMI and Age were positively correlated, though the association was not strong.
- Older patient was more likely to suffer a stroke than a younger patient.
- Higher BMI does not increase the stroke risk.
- Diabetes is one of the risk factors for stroke occurrence and prediabetes patients have an increased risk of stroke.
- Higher proportion of patients who suffered from hypertension or heart disease experienced a stroke, all else being equal.
- Regardless of patient’s gender, and where they stayed, they have the same likelihood to experience stroke
- Work type variable was highly associated with age.
- Marital status variable was highly associated with age.
Key Takeaways
In this post, EDA was performed on stroke dataset. There are several key takeaways from this post as follows:
- Data preprocessing is a very important step. Do not jump straight to analysis or prediction while the data is dirty.
- Do not automatically drop all records which contain missing values. They may contain valuable information. Consider other alternatives, i.e. replace them with mean or median value if it is a numerical attribute, or create a new category if it is a categorical attribute.
- Not all insights are breakthrough. Probe further. They may be highly associated with another variable after all.
Reference
[1] E. S. Donkor, Stroke in the 21st century: a snapshot of the burden, epidemiology, and quality of life (2018), Stroke research and treatment
[2] W. Johnson, O. Onuma, M. Owolabi and S. Sachdev, Stroke: a global response is needed (2016), Bulletin of the World Health Organization