Introduction
EDA, or exploratory Data analysis, is the first stage of every data science endeavor. It is also the first phase of data mining, assisting in acquiring data insights while making no assumptions. Analysts may use EDA to glance at data descriptions, comprehend the relationship between variables, and evaluate data quality by verifying data sources, discovering missing values, and identifying outliers. In short, EDA plays an essential role in generating precise data reports and accurate data models.
For beginners, it may take time to do this initial screening process because of the unfamiliarity with the syntaxes and procedures in data mining. I was in the same situation when first getting to know Python. At the time, I only wished there would be a tool that could do all the jobs for me at once, automatically. Even today, as I get more familiar with data mining and cleaning procedures, I want to save time in some specific processes and move on swiftly.
That’s why today, I’d want to expose you to some powerful EDA tools that may help you gain a better overall perspective of the data we’re diving through.
Dataset
In Python package seaborn
, there are many free data sets to try. However, I will go with the dataset named "diamonds." Here is how I get the data set:
import seaborn as sns
df = sns.load_dataset('diamonds')
Data source reference: Waskom, M. et al., 2017. mwaskom/seaborn: v0.8.1 (September 2017), Zenodo. Available at: https://doi.org/10.5281/zenodo.883859.
SweetViz
Personally, this is one of my favorite automated EDA. Why? Because of its super cool interface. To prove my saying, in figure 1, you can see the overview of the report produced by SweetViz. Cool, right?
So, what is SweetViz?
Well, to put it simply, SweetViz Package is an open-source Python library that can automatically launch EDA and create stunning visuals with just a few lines of code. Output is an entirely self-contained HTML application, as shown in Figure 1. SweetViz helps to quickly view different dataset characteristics and primarily provides complete information about the associations between variables. One more outstanding feature of SweetViz is its target analysis, which explains how a target value relates to other variables.
And how do we run it?
Basically, there are three functions to create reports, which are analyze()
, compare()
and compare_intra()
First, let’s try to get the summarization of all data set. analyze()
is the function you should use in this case.
#Installing the library
pip install sweetviz
#Importing the library
import sweetviz as sv
report = sv.analyze(df)
report.show_html()
And the result is what you can see in Fig 1 above.
How about when you want to compare two separate data frames? For example, in my case, I want to compare the training and testing set? Very simple, using compare()
and here is what I got:
#Spliting data set into training and testing set
training_data = df.sample(frac=0.8, random_state=25)
testing_data = df.drop(training_data.index)
#Applying compare function
report2 = sv.compare([training_data,"TRAINING SET"], [testing_data, "TESTING SET"])
report2.show_html()
In case you might want to make a comparison between the subsets of your data, compare_intra()
is your selection. For instance, figure 3 shows the comparison between the subset of D color and the rest.
report3 = sv.compare_intra(df, df["color"] == "D", ["D", "The rest"])
report3.show_html()
DataPrep
If there is one sentence to describe this package, I would say, "it already did all the work." In other words, we can find almost all the information in the created report. One bonus point of this package is that the output is interactive, which makes the report more convenient to follow.
DataPrep is definitely my favorite automated EDA among all. Similar to SweetViz, this library also helps explore data in just a single line of code. This is all you have to do:
#Installing the library
!pip install dataprep
#Importing
from dataprep.eda import create_report
#Creating report
create_report(df)
Skim
Skimpy is a small Python package that provides an extended version of data summarization. As you can see in Figure 5, the data report is quite simple but includes almost all necessary information. This library is not as complete as previous reports; however, I think this summary is enough to use sometimes. It also runs quicker than the other two libraries.
from skimpy import skim
skim(df)
What’s next?
There are many exciting libraries for automated EDA that I will definitely learn further, such as Bamboolib, Autoviz, or Dora. That’s all I have for now.
If you guys have any recommendations, please share them with me 😀 I’m happy to know more.
In order to receive updates regarding my upcoming posts, kindly subscribe as a member using the provided Medium Link.