Basic Medical Data Exploration / Visualization — Heart Diseases

Jae Duk Seo
Towards Data Science
8 min readJun 26, 2018

--

GIF from this website

Today, I wanted to practice my data exploration skills again, and I wanted to practice on this Heart Disease Data Set.

Please note that this post is for my future-self to look back and review the basic techniques of data exploration.

Data Set

Image from this website

So this data set contains 302 patient data each with 75 attributes but we are only going to use 14 of them, which can be seen below.

And if anyone is interested in what each of the attributes means exactly, please have a look at the screen shot below.

Image from this website

Overview of the Data Set / Cleaning / View

Red Box → Data Type Object

As always lets start off simple, taking a look at the general mean, std and other stuffs. And right off the bat we can recognize that some of our data is in type of object.

However, we can also observe the fact that there are no null variables.

After doing simple clean up, changing non-numerical value to NaN and replacing NaN with 0. We can now safely say our data is somewhat clean.

First / Last 10 Rows

Again, just to get a better feel of the data, lets examine first/last few rows. And as seen above, there is nothing too out of the ordinary.

Histogram of Data

With simple histogram of our data, we can easily observe the distribution of different attributes. One thing to note here is the fact that it is extremely easy for us to see which attributes are categorical values and which are not.

Just to inspect little bit more closer lets take a look at the distribution of ages and fbs (fasting blood sugar). We can see that the age distribution is closely resembling of Gaussian distribution while fbs is a categorical value.

Variance-Covariance Matrix

All of the images seen above are Variance-Covariance Matrix, however the difference is that for the most left one I used Numpy to manually calculate. For the middle one I used Tensorflow, and finally for the right one I used the built in Data-frame function. And we can observe that most attributes does not have a strong covary relationship.

Correlation matrix

Again, the left image was created by manual numpy calculation, and we can observe that among the attributes there are actually strong correlation with one another. (especially heart disease and thal)

Interactive Histogram

Now I know this is redundant but I wanted to include since there is an interactive part to it. 👌 👌

Bar Plot / Box Plot / Pair Plot

Lets first take a look at the average age of people who have heart disease vs who does not. And we can observe that people who are slightly older have more chance of having heart disease. (only from this data set.)

Again, when we create a box plot related to the average of people who have / doesn’t have heart disease we can observe the younger people are less likely to have heart disease.

And finally, I wanted to show the pair plot against few of the attributes such as age, thal, ca
(chest pain type), thalach ( maximum heart rate achieved) and presence of heart disease. And as seen in the correlation matrix we can observe a strong negative correlation between age and thalach.

Uniform Manifold Approximation and Projection embedding (UMAP)
t-distributed Stochastic Neighbor Embedding (t-SNE)

Again follow the tradition from previous blog post, I wanted to perform simple dimensionality reduction techniques to see if we are able to cluster the data into two group. And as seen above, umap does a fairly okay job at clustering each classes.

Finally, above is the resulted graph of dimensionality reduction from t-SNE.

GitHub Code

To access the code for this post please click here.

Final Words

This was another good session of plotting and simple data exploration, I wish to create more advance plots in the near future.

If any errors are found, please email me at jae.duk.seo@gmail.com, if you wish to see the list of all of my writing please view my website here.

Meanwhile follow me on my twitter here, and visit my website, or my Youtube channel for more content. I also implemented Wide Residual Networks, please click here to view the blog post.

Reference

  1. UCI Machine Learning Repository: Heart Disease Data Set. (2018). Archive.ics.uci.edu. Retrieved 25 June 2018, from https://archive.ics.uci.edu/ml/datasets/Heart+Disease
  2. RPubs — Machine learning for heart disease prediction. (2016). Rpubs.com. Retrieved 25 June 2018, from https://rpubs.com/mbbrigitte/heartdisease
  3. DataFrame, H. (2018). How to add header row to a pandas DataFrame. Stack Overflow. Retrieved 25 June 2018, from https://stackoverflow.com/questions/34091877/how-to-add-header-row-to-a-pandas-dataframe
  4. JaeDukSeo/Daily-Neural-Network-Practice-2. (2018). GitHub. Retrieved 25 June 2018, from https://github.com/JaeDukSeo/Daily-Neural-Network-Practice-2/blob/master/Medical_EXP/Pima_Indians/a.ipynb
  5. pandas.DataFrame.describe — pandas 0.23.1 documentation. (2018). Pandas.pydata.org. Retrieved 25 June 2018, from https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.describe.html
  6. Pandas DataFrame: replace all values in a column, b. (2018). Pandas DataFrame: replace all values in a column, based on condition. Stack Overflow. Retrieved 25 June 2018, from https://stackoverflow.com/questions/31511997/pandas-dataframe-replace-all-values-in-a-column-based-on-condition
  7. series?, P. (2018). Pandas — How to replace string with zero values in a DataFrame series?. Stack Overflow. Retrieved 25 June 2018, from https://stackoverflow.com/questions/33440234/pandas-how-to-replace-string-with-zero-values-in-a-dataframe-series
  8. pandas.to_numeric — pandas 0.23.1 documentation. (2018). Pandas.pydata.org. Retrieved 25 June 2018, from https://pandas.pydata.org/pandas-docs/stable/generated/pandas.to_numeric.html
  9. dataframe, H. (2018). How can I replace all the NaN values with Zero’s in a column of a pandas dataframe. Stack Overflow. Retrieved 25 June 2018, from https://stackoverflow.com/questions/13295735/how-can-i-replace-all-the-nan-values-with-zeros-in-a-column-of-a-pandas-datafra
  10. dataframe, P. (2018). Plotting histogram using seaborn for a dataframe. Stack Overflow. Retrieved 25 June 2018, from https://stackoverflow.com/questions/32923301/plotting-histogram-using-seaborn-for-a-dataframe/33137122
  11. 0.11.0, h. (2018). how to Increase the figure size of Dataframe.hist for pandas 0.11.0. Stack Overflow. Retrieved 25 June 2018, from https://stackoverflow.com/questions/43392588/how-to-increase-the-figure-size-of-dataframe-hist-for-pandas-0-11-0
  12. matplotlib?, H. (2018). How do you change the size of figures drawn with matplotlib?. Stack Overflow. Retrieved 25 June 2018, from https://stackoverflow.com/questions/332289/how-do-you-change-the-size-of-figures-drawn-with-matplotlib
  13. Visualizing the distribution of a dataset — seaborn 0.8.1 documentation. (2018). Seaborn.pydata.org. Retrieved 25 June 2018, from https://seaborn.pydata.org/tutorial/distributions.html
  14. seaborn.distplot — seaborn 0.8.1 documentation. (2018). Seaborn.pydata.org. Retrieved 25 June 2018, from https://seaborn.pydata.org/generated/seaborn.distplot.html
  15. zero?, H. (2018). How to set all the values of an existing Pandas DataFrame to zero?. Stack Overflow. Retrieved 25 June 2018, from https://stackoverflow.com/questions/42636765/how-to-set-all-the-values-of-an-existing-pandas-dataframe-to-zero
  16. DataFrame, C. (2018). Converting strings to floats in a DataFrame. Stack Overflow. Retrieved 25 June 2018, from https://stackoverflow.com/questions/16729483/converting-strings-to-floats-in-a-dataframe
  17. Pandas & Seaborn — A guide to handle & visualize data in Python | Tryolabs Blog. (2017). Tryolabs.com. Retrieved 25 June 2018, from https://tryolabs.com/blog/2017/03/16/pandas-seaborn-a-guide-to-handle-visualize-data-elegantly/
  18. Covariance Matrix . (2018). Stattrek.com. Retrieved 25 June 2018, from https://stattrek.com/matrix-algebra/covariance-matrix.aspx
  19. How to build a variance-covariance matrix in Python. (2015). Firsttimeprogrammer.blogspot.com. Retrieved 25 June 2018, from http://firsttimeprogrammer.blogspot.com/2015/01/how-to-build-variance-covariance-matrix.html
  20. seaborn.barplot — seaborn 0.8.1 documentation. (2018). Seaborn.pydata.org. Retrieved 25 June 2018, from https://seaborn.pydata.org/generated/seaborn.barplot.html
  21. seaborn.pairplot — seaborn 0.8.1 documentation. (2018). Seaborn.pydata.org. Retrieved 25 June 2018, from https://seaborn.pydata.org/generated/seaborn.pairplot.html
  22. 3D scatterplot — Matplotlib 2.2.2 documentation. (2018). Matplotlib.org. Retrieved 25 June 2018, from https://matplotlib.org/gallery/mplot3d/scatter3d.html
  23. Pyplot tutorial — Matplotlib 2.0.2 documentation. (2018). Matplotlib.org. Retrieved 25 June 2018, from https://matplotlib.org/users/pyplot_tutorial.html
  24. Notebook, P. (2018). Python & Matplotlib: Make 3D plot interactive in Jupyter Notebook. Stack Overflow. Retrieved 25 June 2018, from https://stackoverflow.com/questions/38364435/python-matplotlib-make-3d-plot-interactive-in-jupyter-notebook
  25. color example code: colormaps_reference.py — Matplotlib 2.0.2 documentation. (2018). Matplotlib.org. Retrieved 25 June 2018, from https://matplotlib.org/examples/color/colormaps_reference.html
  26. R: Solve a System of Equations. (2018). Stat.ethz.ch. Retrieved 25 June 2018, from https://stat.ethz.ch/R-manual/R-devel/library/base/html/solve.html
  27. R solve Function Examples — EndMemo. (2018). Endmemo.com. Retrieved 25 June 2018, from http://www.endmemo.com/program/R/solve.php
  28. R: Standard Deviation. (2018). Stat.ethz.ch. Retrieved 25 June 2018, from http://stat.ethz.ch/R-manual/R-devel/library/stats/html/sd.html
  29. (2018). Users.stat.umn.edu. Retrieved 25 June 2018, from http://users.stat.umn.edu/~helwig/notes/datamat-Notes.pdf
  30. Correlation and dependence. (2018). En.wikipedia.org. Retrieved 25 June 2018, from https://en.wikipedia.org/wiki/Correlation_and_dependence
  31. Holtz, Y. (2017). #372 3D PCA result. The Python Graph Gallery. Retrieved 26 June 2018, from https://python-graph-gallery.com/372-3d-pca-result/
  32. operator, I. (2018). Interactive Session in Tensorflow — different output for convolution operator. Stack Overflow. Retrieved 26 June 2018, from https://stackoverflow.com/questions/40221651/interactive-session-in-tensorflow-different-output-for-convolution-operator
  33. tensorflow, D. (2018). Dot product of two vectors in tensorflow. Stack Overflow. Retrieved 26 June 2018, from https://stackoverflow.com/questions/40670370/dot-product-of-two-vectors-in-tensorflow
  34. Installing mpld3 — Bringing Matplotlib to the Browser. (2018). Mpld3.github.io. Retrieved 26 June 2018, from http://mpld3.github.io/install.html
  35. numpy.histogram — NumPy v1.14 Manual. (2018). Docs.scipy.org. Retrieved 26 June 2018, from https://docs.scipy.org/doc/numpy/reference/generated/numpy.histogram.html
  36. Interactive plots in IPython notebook using pygal. (2015). Ino de Bruijn. Retrieved 26 June 2018, from http://ino.pm/blog/ipython-pygal/#.WzGzS6dKiUk
  37. Basic Medical Data Exploration with Interactive Code. (2018). Towards Data Science. Retrieved 26 June 2018, from https://towardsdatascience.com/basic-medical-data-exploration-with-interactive-code-aa26ed432265
  38. JaeDukSeo/Daily-Neural-Network-Practice-2. (2018). GitHub. Retrieved 26 June 2018, from https://github.com/JaeDukSeo/Daily-Neural-Network-Practice-2/blob/master/Medical_EXP/heart/heart.ipynb

--

--

Exploring the intersection of AI, deep learning, and art. Passionate about pushing the boundaries of multi-media production and beyond. #AIArt