The world’s leading publication for data science, AI, and ML professionals.

3 Powerful Python Libraries to (Partially) Automate EDA And Get You Started With Your Data Project

All machine learning problems are data problems. So, it makes sense that you should spend time understanding and cleaning your data

All machine learning problems are data problems.

To avoid the old adage of "garbage in, garbage out," it makes sense that you should spend considerable time understanding and cleaning your data. I recently read "The Kaggle Book" by Konrad Banachewicz & Luca Massaron, where they interview many Kaggle grandmasters. Interestingly, rushing or skipping the EDA is the most common mistake they and beginners make.

Photo by Choong Deng Xiang on Unsplash
Photo by Choong Deng Xiang on Unsplash

We all know how important EDA is, and yet we still skip this step. It might be because it is hard to know where to start, what questions you should be asking, or maybe we are too eager to jump into modeling.

Here are 3 Python libraries you can use to partially automate your Exploratory Data Analysis and get you started with your data project.

The data for the below analysis is from Kaggle, House Prices – Advanced Regression Techniques competition.

YData Profiling

This is the new version of Pandas profiling supported by Spark and now goes beyond just Pandas DataFrame.

The goal, however, remains the same: provide a one-line Exploratory Data Analysis (EDA) experience. This package highlights the importance of having an easy-to-implement data quality evaluation framework. This framework shouldn’t be limited to the initial phase of your project but rather implemented throughout the data project.

Ydata profiling can be run in two lines.

!pip install ydata-profiling
from ydata_profiling import ProfileReport

#Generate the data profile report
profile = ProfileReport(train,title='EDA')

#show the report on the notebook
profile.to_notebook_iframe()
Alerts indicating high correlation, class imbalances, missing data, etc... Image by author
Alerts indicating high correlation, class imbalances, missing data, etc… Image by author
Variables distribution. Image by author
Variables distribution. Image by author

The output shows the distribution of the variables and provides you with a set of alerts regarding correlated variables, imbalances, and high missing values, all things that you need to explore before modeling your data. On top of that, the output is highly interactive, allowing you to keep exploring the data from this output.

AutoViz

Autoviz aims to provide a more efficient, user-friendly, and automated approach to exploratory data analysis (EDA).

AutoViz puts a lot of emphasis on being beginner-friendly. You can easily gain insights into the data with a few simple lines of code.

!pip install autoviz
from autoviz.AutoViz_Class import AutoViz_Class

#Instantiate AutoViz_Class
AV = AutoViz_Class()

#Run the EDA
df = AV.AutoViz(filename='',
                dfte=train,
                verbose=1)
Output showing Data quality fixes. Image by author
Output showing Data quality fixes. Image by author
Graphical output. Image by author
Graphical output. Image by author

The output is less interactive. However, the great thing about this output is that it gives your recommendations on how to fix the data quality issues. You can even go as far as implementing the data fixes with the below code.

from autoviz import FixDQ

#Instantiate FixDQ 
fixdq = FixDQ()

#Impletment the outlier cap recommendation
fixdq.cap_outliers(train)

#Implement all fixes - use with caution
fixdq.fit_transform(train)

Sweetviz

Sweetviz’s goal is to help quickly analyze target characteristics and easily compare training vs test sets.

This package emphasizes the importance of ensuring that your training and test sets are comparable to ensure your model can generalize to unseen data. As with the previous tools, this can be easily implemented in a few lines of code.

!pip install sweetviz 
import sweetviz as sv

# Generate a comparison report
compare = sv.compare(source=[train,'Training'],
                     compare=[test,'Test'],
                     target_feat="SalePrice")

#Shows the report in the notebook
compare.show_notebook(w=None, 
                h=None, 
                scale=None,
                layout='widescreen',
                filepath=None)
High-level comparison of training/test data. Image by author
High-level comparison of training/test data. Image by author
Variable distribution comparison. Image by author
Variable distribution comparison. Image by author

The output is also interactive and allows you to keep exploring your data.

These three tools are similar; however, they emphasize different aspects of the EDA.

Not one is better than the other. You should use them for what they are: tools in your toolbox to help you guide your EDA.

But remember, these tools are only the beginning of the EDA. They free your time so you can spend more time asking better questions about your data.

References

[1] https://github.com/ydataai/ydata-profiling

[2] https://github.com/AutoViML/AutoViz

[3] https://pypi.org/project/sweetviz/

[4] Banachewicz, K. & Massaron, L. (2022). The Kaggle Book: Data Analysis and Machine Learning for Competitive Data Science. Packt>

[5] Anna Montoya, DataCanary. (2016). House Prices – Advanced Regression Techniques. Kaggle. https://kaggle.com/competitions/house-prices-advanced-regression-techniques


Related Articles