data_dashboard: Python package for EDA and baseline ML Model creation

HTML-based Dashboard for wrapping your head around your data

Maciej Dowgird
Towards Data Science

--

Scatter Plot Grid from Features Page (Iris Dataset)

Have you ever been provided with an Excel spreadsheet, CSV file or any other data without any useful context but with a task of “make something useful out of it”? Or perhaps you have just joined another Kaggle competition, downloaded the data and begun to wonder where to start? I remember being there, using Jupyter Notebook to create a myriad of tables in some cells, visualizations in others and my own comments somewhere else. In the end, I was left with a mess of a Notebook and nowhere near understanding what’s going on underneath. Irritated and annoyed, I ventured on a journey to create code that would not only do all of it, but also present the data in such a way that actually helps to understand it. And so, after a lot of thinking and even more coding, I managed to finalize the version that should not only work with almost every data but is also available for everyone else (not just me). Presenting to you…

data_dashboard

One of the Pages in the HTML Dashboard

data_dashboard is a Python package that tries to help you in those initial moments when you get your hands on the data and don’t know where to start. The package was designed to:

  • Provide overview of the data with descriptive statistics and visualizations that can help Data Scientist understand it;
  • Help in Feature Engineering/Transformations/Data Cleanup by offering an intuitive interface;
  • Create the best possible ‘baseline’ Machine Learning model (sklearn) by performing GridSearch on a set of predefined models (which can be customized to suit Data Scientist’s needs);
  • Create HTML Dashboard that is not only user-friendly and easy to navigate, but also one that is focused on design and styling (for both aesthetic pleasure and “business” presentation readiness);
  • Provide educational materials for beginner Data Scientists, who might lack the ability to extensively manipulate the data and create appropriate Visualizations;

You can download the package via pip:

pip install data-dashboard

To start using it, you need 3 things:

  • X: your data (already loaded into memory)
  • y: your target variable (already loaded into memory)
  • output_directory: path to the folder where the Dashboard will be created

Keep in mind that there is no need to transform your data beforehand — Dashboard will take care of it. In case you don’t have any datasets handy, you can use built-in examples (toy datasets from sklearn).

With that out of the way, the Dashboard creation is quite simple:

from data_dashboard import Dashboard
from data_dashboard.examples import iris
output_directory = “your_path/dashboard_output”
X, y, descriptions = iris()
# descriptions is an optional argument, more in documentation
dsh = Dashboard(X, y, output_directory, descriptions) dsh.create_dashboard()

HTML Dashboard will be created in provided output directory and opened in your default browser for you to investigate. If you’re interested in customizing the creation of the Dashboard, please visit the documentation here.

If you’re not able to create your own dashboard, you can also check the deployed example (with Titanic dataset): https://example-data-dashboard.herokuapp.com/

Dashboard

Dashboard is nothing else than a set of “static” HTML files — there is no need to put them on the server or emulate with localhost to view them. You can freely move the files around or scrap them and quickly create new ones (without any additional complexity layer). However, keep in mind that every piece of data used to create visualizations and tables is embedded directly into the HTML/JS code — it would be unwise to globally share that Dashboard unless your data is publicly available.

Created Dashboard is divided into 3 sections:

  • Overview
  • Features
  • Models

Overview

Overview Page of Dashboard

First page in the created Dashboard is Overview — simple but important beginning into understanding your data. Every feature present in your X gets its’ respective statistics here: mean, median, min, max, etc. This is also the case for Categorical variables, which are internally transformed to numbers to enable that. If your feature won’t be used in the process (e.g. this is a date type feature), the name of it is included in the Unused Columns section. Keep in mind that all Feature names in tables (e.g. AGE as seen in the picture) are “hoverable” — upon going over them with your mouse pointer, description box with additional information appears. Last but not least, the bottom element of the page is a good old-fashioned seaborn pairplot — all useful information condensed in a single plot.

Features

Features Page of Dashboard

Features page gives you the insight into a single feature and its’ relationships with others. The most important thing here is the “burger” button in the upper left corner — menu with which you can change the active Feature. Page is also divided into smaller subsections to prevent unnecessary cluttering.

First two sections are focused on your Active Feature — descriptive statistics alongside a distribution histogram in the first and Transformers with actual transformed feature in the second. As already mentioned, Dashboard transforms all provided data so they are ready to be used in Model training. However, default Transformers are quite simple — if you feel that they are inadequate, you can always customize them further (or even transform your feature beforehand and skip Dashboard’s transformations altogether). Again, please refer to documentation if you’re interested. One additional thing to note is the extra visualization row for Numerical type features — 3 different Transformers are used to “normalize” the values and they are then plotted as histograms — this way you can decide which Normalizer does the best job in that regard.

Third section is a heatmap with Pearson correlations between features. This is the least interactive visualization in the page, as it doesn’t respond to the change in Feature. The visualization also does not “care” about the direction of the correlation — 0.8 and (-0.8) are treated the same in terms of intensity of the color. Correlations are also calculated on both “raw” and normalized values. Take the results with a grain of salt, as categorical variables are treated in the same way as numerical variables.

Scatter Plot Grid in Features Page

Last section of the page is a Scatter Plot Grid. Remember seaborn’s pairplot from Overview page and how it plotted every feature against each other? This is the same concept, but with an additional twist — Chosen Feature now acts a coloring factor in every plot. You might find it useful to identify some patterns in your data.

On a side note, sometimes both seaborn’s pairplot from Overview as well as Scatter Plot Grid are not created. There is a good reason for that, as those elements are based on features in your X — the more you have, the more subplots needs to be created (). Therefore, if an arbitrary limit is reached then those two elements get disabled for both runtime and visualization purposes (there is no fun in eye-looking for patterns at ~200 plots or more). Documentation provides you ways on how to enable them in case you wish to do so.

Models

Dashboard’s work does not end on Exploratory Data Analysis with tables and visualizations — Machine Learning model (sklearn) will also be searched for, chosen and trained on your data. Dashboard automatically assesses what’s inside your target variable and decides what kind of problem it’s facing (regression, binary classification or multiclass classification). Next, set of default models are GridSearched and the best performing Model is chosen (based on a scoring function picked by you). This step is customizable — you can control if the GridSearch is done with all Models, or if only the best performing “default” models will be put under the scrutiny (similar to LazyPredict package). What is more, you can also provide your own Models and their parameters to be searched and compared for. Everything is written down in the documentation should you be interested.

Just to make it clear, Dashboard tries to incorporate all the good practices on training the model — data is split into train and test sets, training of all Models and Transformers is done on the train split and the results that you see in the Dashboard are calculated using test splits.

Last but not least, appropriate Visualizations are included in the Models page — you can compare top performing Models, see where they lack in terms of problem-specific assessments, etc. Depending on the problem type, different types of visualizations will be created (e.g. ROC/Precision-Recall/DET curves for binary classification, Prediction Error and Residuals for regression and Confusion Matrices for multiclass classification). Plots are built using bokeh library, so they are highly interactive — you can move the plot around, zoom in or out, click on the legend to mute some of the elements, hover to get additional information, etc.

Table with scores for top performing Models is always included — you can see how the models worked in regard to different scoring functions, not only the one selected in the beginning. Additionally, every Model name is “hoverable” (similar to Feature Names in tables), but instead of descriptions it will show you the parameters used to create the Model.

Similarly, Predictions Table is also present in all types of problems — this is a simple table comparing predictions of different Models for every row of data. You can try to use it to identify any “problematic” rows.

Conclusion

data_dashboard package was created to not only automate all the initial steps that Data Scientist needs to address in their first contact with new data, but to also do it in a user-friendly and design-focused way. Understanding the data is one of the keys to a successful Analysis and Prediction —nurture that “bond” and you will become a master in no time . Hopefully data_dashboard will help you in that regard!

Documentation

https://data-dashboard.readthedocs.io/en/latest/

Github

Disclaimer

Please keep in mind that that it was my first attempt at a Python package (or even open-source code to be used by anyone else). If you encounter any bug or you feel that something could have been improved, feel free to reach out to me (any social portal or email: dowgird.maciej@gmail.com). I’d be happy to get feedback!

--

--

Data Analyst disliking default visualizations. Pharmacist by education.