Data Tools Review

Awesome Data Science Tools to Master in 2023: Data Profiling Edition

5 Open Source Python Packages for EDA and Visualization

Published in

Towards Data Science

15 min readFeb 22, 2023

This is a column series focusing on open-source tools for data science: each article focuses on a specific topic and introduces the reader to a set of different tools, showcasing their features with a real-world dataset. This piece focuses on data profiling and reviews ydata-profiling, dataprep, sweetviz, autoviz, and lux. Readers are encouraged to follow along the tutorial: I’ll be referring to all projects on their individual GitHub repositories, but a curated list of tools, as well as the Google Colab notebooks used throughout this article are available in the awesome-data-centric-ai repository.

In a world of Data Imperfection, a carefully-designed tool for data understanding is the philosopher’s stone.

A “multidimensional” philosopher stone: data cubes. Analyzing data often involves twisting and twirling our datasets around to find the most insightful perspectives. Photo by aaron boris on Unsplash.

Data Quality plays a vital role in the outcome of our machine learning models. Some data imperfections may severely handicap the internal workings of models themselves, rendering them inapplicable (e.g., missing data). Others may pass unnoticed during the model development stage and come back with nefarious consequences when models are deployed to production (e.g., class imbalance, underrepresented data, dataset shift).

After struggling long and hard with producing more robust machine learning models, both academics and engineers are now moving towards a Data-Centric AI paradigm. We’ve collectively realized that data can make or break our models, and that oftentimes, a problem may be solved with a “simpler” model, if data is smart.

But how can we move from imperfect to smart data?

Data Profiling is the essence of Data Understanding

Since models are fed by data and data is curated by people, people need to understand the peculiarities of the data they’re asking models to digest.

Data Profiling is deeply linked to the concept of Exploratory Data Analysis. However, when talking about profiling data, we tend to associate it with an automated reporting — i.e., “taking a profile” — of data characteristics, ideally alerting us for potential issues that may occur immediately or further down the line.

More importantly, data profiling is an essential skill to master by all roles in data teams, from data scientists, machine learning engineers, data engineers, and data analysts.

“Our imputation method is returning high deviation values, did the input change? The model is outputting invalid predictions, we need to adjust it quickly, what happened? This data flow is completely breaking today, what went wrong? I need these results in a dashboard fast to discuss them in the next meeting, can someone get me a snapshot of the current state of data?”

These are just some of the many “data rants” that data teams face in the wild.

Fortunately, they can be minimized with the help of data profiling profiling tools.

Let’s see how by exploring a real-world use case? Get ready to open your Google Colabs!

The HCC dataset… with slight modifications!

Throughout this article I will be using the HCC dataset which I personally curated during my MSc thesis. You can find it on Kaggle and UCI Repository. Feel free to use it, it is licensed and we only appreciate proper referencing.

For the purpose of this review, I’ve taken a subset of features from the original dataset and further modified the data by artificially introducing some issues. The idea is to see how the different data profiling tools are able to identify and characterize them.

Here are the details regarding the data we’ll be using:

The following subset of original features is considered: Gender, Age, Alcohol, Hallmark, PS, Encephalopathy, Hemoglobin, MCV, Total_Bil, Dir_Bil, Ferritin, HBeAg, and Outcome. Essentially, I’ve chosen a set of numeric and categorical (nominal and binary) features, with and without missing values, some category underrepresentation, and some high correlations;
The original MCV contained some missing values, which were replaced with other values in the feature (just for the sake of having another complete numerical feature besides Age);
The O2 feature was artificially introduced: it contains only “999” values and represents an error in the data acquisition or collection. Imagine that a sensor has blown off and started outputting absurd values, or that the person collecting the data decided to code their absence as “999”;
The Hallmark was modified: I concatenated its values with a “A, B, C, D, (…)” coding, which could represent the concatenation of the patient ID with the actual hallmark results. This represents an error during data processing or storage (an “ETL hitch”, if you will);
The HBeAg was also modified. In this case, I simply deleted all “Yes” values. This could mimic a Missing Not At Random mechanism, where all the missing values would be “Yes”, had they been observed.

Since the data contains some modifications, I added it to the repository as well. Recall that this is for purely academic discussion: you can access the complete and untouched data in Kaggle or UCI Repository, as mentioned.

So now without further due, let’s get cracking the actual code!

1. ydata-profiling

You may know it as pandas-profiling, since the new name is very recent still: in short, version 4.0.0 now also supports Spark DataFrames beyond only Pandas DataFrames, and for this reason it changed its name to YData Profiling.

The package is currently a crowd favorite for exploratory data analysis. Its efficiency and simplicity seem to win the hearts of technical and non-technical audiences alike, as it enables a very fast and straightforward visual understanding of the data.

Let me show you how to get it up to speed! First, install the package:

pip install ydata-profiling=4.0.0

Then, the generation of a data profiling report is straightforward:

# Import libraries
import pandas as pd
from ydata_profiling import ProfileReport

# Load the data
df = pd.read_csv("hcc.csv")

# Produce and save the profiling report
profile = ProfileReport(df,title="HCC Profile Report")
profile.to_file("report.html")

The report is quickly generated as follows, containing a general overview of the dataset’s properties, summary statistics for each feature, interaction and correlation plots, and insightful visualizations for missing values and duplicate records:

YData Profiling Report. Screencast by author.

Yet, the most praised feature of ydata-profiling is perhaps the automatic detection of potential data quality issues.

This is exceptional when working with a new dataset for which we do not have any prior insights. It saves us a lot of time as it immediately highlights data inconsistencies and other complex data characteristics that we might want to analyze prior to model development.

In what concerns our use case, note how the following alerts are generated:

CONSTANT: for HBeAg and 02;
HIGH CORRELATION: between Total Bil and Dir Bil, and Encephalopathy and Total Bil;
IMBALANCE: for Encephalopathy;
MISSING: for Hemoglobin, HBeAg, Total Bil, Dir Bil, and Ferritin with the highest missing data percentage (nearly 50%);
UNIFORM, UNIQUE, and HIGH CARDINALITY regarding Hallmark.

With such a comprehensive support for data quality issues, ydata-profiling is able to detect all inconsistencies introduced, being especially informative as it generates several different alerts for the same feature.

For instance, how could Hallmark simultaneously raise high cardinality, unique, and uniform alerts?

This immediately raises a red flag that would make a data scientist suspect of a possible association with an ID of some sort. When further inspecting the feature, this “data smell” would become rather obvious:

YData Profiling Report: Inspecting Hallmark. Image by Author.

The same is true for HBeAg: it raises both CONSTANT and MISSING alerts, which could ring a bell in the mind of an experienced data scientist: more complex missing mechanisms can be at play.

ydata-profiling has several other useful features such as the support for time-series data and the comparison of datasets side-by-side, which we could use to enhance some data quality transformations on our dataset, such as missing data imputation.

To further investigate this possibility, I’ve performed a simple imputation (e.g., mean imputation) on Ferritin and produced the comparison report between the original and imported data. Here’s the result:

YData Profiling: Comparison Report. Screencast by Author.

The source code, documentation, and several examples for ydata-profiling are available on this GitHub repository. You can replicate the example above using this Google Colab notebook, and further investigate additional features of the package here.

2. DataPrep

DataPrep also creates interactive data profiling reports with one line of code. The installation is rather simple, and after the modules are imported, the report generation is performed by calling the create_report:

pip install -U dataprep

import pandas as pd
from dataprep.eda import create_report

df = pd.read_csv("hcc.csv")
create_report(df)

In terms of overall look and feel, dataprep seems to build extensively over the previous package, including similar summary statistics and visuals (the report similarities are uncanny!):

DataPrep Profiling Report. Screencast by Author.

However, a useful feature of this package is the ability to explore feature interactions more deeply.

We may explore insightful interactions between features (both numerical and categorical), using several types of plots.

Here’s how to explore the relationship between several types of features:

from dataprep.eda import plot

plot(df, "Age", "Gender") # Numeric - Categorical
plot(df, "Age", "MCV") # Numeric - Numeric
plot(df, "Gender", "Outcome") # Categorical - Categorical

DataPrep Profiling Report: Exploring Feature Interactions. Screencast by Author.

dataprep also allows to investigate the impact of missing values (comparing the data before and after dropping missing values) across all features.

Here’s the visualization of the impact of dropping missing values on Ferritin:

DataPrep Profiling Report: Missing Impact of Ferritin. Screencast by Author.

You can explore all the generated visualizations in the respective Colab notebook for you to explore. dataprep has a great documentation on GitHub and also a dedicated website, in case you’d like to check additional resources and details.

3. SweetViz

Similarly to the previous tools, SweetViz also boosts EDA by generating simple visualizations and summarizing important feature statistics. You may get started with sweetviz by following these steps:

pip install sweetviz

# Import libraries
import pandas as pd
import sweetviz as sv

# Load the data
df = pd.read_csv("hcc.csv")

# Produce and save the profiling report
report = sv.analyze(df)
report.show_html('report.html')

sweetviz has a more “retro” look, and the summary statistics are not so neatly displayed as for the other packages we’ve reviewed so far.

Regarding data quality, it alerts for the presence of missing values and their impact on the data (e.g., using green, yellow, and red highlighting, depending on the percentage of missing data). Other inconsistencies are not directly highlighted and may pass unnoticed (low correlation values are however signaled inside each feature’s details).

Beyond the “Associations” matrix, it does not focus on extensively exploring feature interactions or missing data behavior.

Here’s the look and feel of the report:

Sweetviz Profiling Report. Screencast by Author.

However, its greatest feature is that it is built around visualizing target values and comparing datasets.

In this regard, an interesting use case for sweetviz would be simultaneously performing target analysis (investigating how the target values relate to other features in the data) while comparing either different sets (e.g., training versus test sets) or intra-set feature characteristics (e.g., comparing sub-populations such as “Male” versus “Female” groups).

Here’s how to tweak the “Outcome” feature to enable the exploration of categories (“Male” and “Female”) and compare the insights:

# Create a 'Survival' feature based on 'Outcome'
df.Outcome = pd.Categorical(Outome)
df['Survival'] = df.Outcome.cat.codes

# Create a comparison report
comparison_report = sv.compare_intra(df, df["Gender"] == 'Male', ["Male", "Female"], 'Survival')
comparison_report.show_notebook()

Sweetviz Profiling Report: Comparing “Male” and “Female” subgroups. Screencast by Author.

All of the documentation of sweetviz can be found in this GitHub repository. Please feel free to use this notebook to perform further transformations on the data, and even explore other use cases with the help of this dedicated post.

4. AutoViz

AutoViz is another simple and straightforward EDA package that offers extensive support for visualizations and customization.

Similarly to previous packages, it is as simple to install as:

pip install autoviz

You can then import the package, but autoviz does not display plots automatically, so you need to run %matplotlib inline before trying it on your data:

from autoviz.AutoViz_Class import AutoViz_Class
%matplotlib inline

Now you are good to go! Running the following line will generate a plethora of visualizations that you can later customize and refine:

AutoViz_Class().AutoViz('hcc.csv')

AutoViz Profiling Report. Screencast by Author.

Interesting features of autoviz are the automatic data cleaning recommendations and the ability to perform “supervised” visualizations — i.e., categorizing the results by a given target feature, similarly to what we have done with sweetviz.

The data cleaning recommendations are given during the report generation, as shown below:

AutoViz Profiling Report: Cleaning Recommendatins. Image by Author.

As far as cleaning recommendations go, autoviz does a really great job:

It associates Hallmark with a “possible ID column”, suggesting that it should be dropped;
It identifies several missing or skewed features, such as Ferritin, Hemoglobin, Total_Bil, Dir_Bil, Encephalopathy, and HBeAg;
It recognizes that HBeAg and O2 with invariant values, and suggests to drop O2.

The generated alerts are not as comprehensive as those generated by ydata-profiling, and their are not as intuitive to spot (e.g., autoviz shows a heat map of the number of unique values, but other alerts such as HIGH CARDINALITY, UNIQUE, or CONSTANT could be more insightful to pin out specific details regarding the issues we’re facing.

Yet, I truly appreciated the explainability it introduces on its cleaning recommendations, which for a newcomer to the data science field would be surely a helpful guide.

If you’d like to try out the “supervised” analysis I mentioned before, you first need to define a depVar parameter as the target feature. In this case I set it up to “Outcome”, but I could have set it to “Gender” to get similar plots to those returned by sweetviz:

import pandas as pd
df = pd.read_csv("hcc.csv")

AV = AutoViz_Class()
dft = AV.AutoViz(
    filename="",
    sep="",
    depVar="Outcome",
    dfte=df,
    header=0,
    verbose=1,
    lowess=False,
    chart_format="png",
    max_rows_analyzed=200,
    max_cols_analyzed=30,
)

Here’s the generated report taking the “Outcome” into account:

AutoViz Profiling Report: Comparing “Outcome” categories. Screencast by Author.

Play around with this Google Colab notebook to explore additional features of autoviz. You can also take a look at this post, where they are beautifully detailed, or refer back to the documentation on GitHub.

5. Lux

Lux enables a fast and easy EDA in perhaps the most beginner-friendly way possible since it can be used by simply creating a Pandas DataFrame: once you install it, you just need to print out a DataFrame and it will automatically recommend a set of visualizations that best suits the discovery of interesting trends and patterns in your data.

Start by installing the package and reading the data:

pip install lux

import lux
import pandas as pd
df = pd.read_csv("hcc.csv")
df

Then, you can browse through an interactive widget and choose the one that best fits your needs. Here’s how it looks:

Lux Profiling Report. Screencast by Author.

The package does not analyze any potential data quality issues and the analysis is not very comprehensive, but it still covers the basics.

The widget toggles between Pandas and Lux and focuses on 3 main visualizations: Correlation and Distribution for numeric features, and Occurrence for categorical features.

A perhaps underrated feature, however, is the possibility to export or delete visualizations directly from the widget by selecting the desired plot. When producing a quick report this simplicity is actually really valuable.

lux also tries to support us through the EDA process itself. Imagine that we’re interested in further exploring specific features, say, “PS” and “Encephalopathy”. We can specify your “intent” and lux will guide us towards potential next steps:

df.intent = ["PS", "Encephalopathy"]

The newly generated recommendations are divided by Enhance, Filter, and Generalize tabs.

The Enhance tab adds an additional feature to the visualization regarding the “intent features”, essentially exploring what type of information additional dimensions could introduce.

The Filter tab is self-explanatory: it keeps intent features fixed and introduces filters such as Gender = Male or Outcome = Alive in the visualizations.

Finally, the Generalize tab shows the intent features on their own to determine whether more general trends can be derived:

Lux Profiling Report: Exploring the "intent" functionality. Screencast by Author. — Lux Profiling Report: Exploring the “intent” functionality. Screencast by Author.

Feel free to fidget around my Colab notebook and play with examples of your own. Both the documentation page of Lux and the GitHub repository are quite comprehensive, but take a look at this post to fill in any additional gaps.

Final Thoughts: Is there a best way to go?

In all fairness, no. The listed tools all share similarities as well as introduce different flavors.

With the exception of lux, which seems a bit limited, all of the remaining packages would serve us just fine for a basic exploratory data analysis.

Now, depending on your level of expertise, your particular use case, or the goal you aim to achieve with your data, you might want to try out different packages, or even combine some of their functionalities.

All in all, here’s my take on it:

ydata-profiling would be the best way to go for a professional data scientist hoping to get a handle of a new dataset, due to itsextensive support of automatic data quality alerts. With the new support for Spark DataFrames, the package is also extremely useful for data engineers that need to troubleshoot their data flows, and for machine learning engineers to understand the reasons why their models are behaving erratically. An excellent bet for experienced data professionals who need to understand the behavior of their data, especially in what concerns data quality. The provided visualizations are also a good asset, but the package could improve in terms of interactivity (e.g., allowing us to “mouse over” plots and checking particular feature or correlation values).
dataprep seems to build over ydata-profiling, which means that there could be a mismatch between both project’s roadmap (e.g., the latter having more functionalities “faster” than the former) especially in what concerns support data quality alerts. Or vice-versa, depending on which components we’re interested in following up with! However, I did enjoy the additional features enabled for interactions, but I would say that in some cases “less is more’’. If you’re an experienced data scientist, you’ll know what to look into more deeply and what to discard, but the same is not true when you’re entering the field. You can spend quite some time trying to understand why a certain plot is available if it is not insightful (e.g., Line Charts for Numeric - Categorical features, Box Plots for Numeric - Numeric features). Yet, the individual plots for missing values (before and after dropping them) are genuinely useful. I can see myself using them in the future to diagnose some missing mechanisms, especially Missing At Random, where the missing values in one feature are related to the observed values of another.
autoviz is a very interesting tool for an entry-level data scientist. It gives a well-round characterization of the data and reports truly useful cleaning recommendations. For an experienced professional they may not be necessary, but for someone starting out these could be fundamental to follow up on and guide the data preparation process. The profiling report is also very comprehensive, automatically listing all possible combinations between features. Although verbose, this is helpful for someone without a lot of experience, since the next step is simply skimming through the generated plots and see if something “stands out”. The code is not super friendly but still easy to grasp for a junior, and I particularly liked the “supervised” analysis: the visuals are clean and beautiful and I would definitely use them for my own research.
sweetviz has a really insightful feature, although it could be better designed. Overall the report looks quite retro and the quality alerts are not very intuitive (e.g., why highlighting the lowest correlations rather then the highest?). Yet, the idea of simultaneously checking up on target features and subgroups is something I’d love to see integrated in other packages! For more complex datasets, we need to start analyzing more than 2, 3, or 4 dimensions of the data at once, and keeping track of the target value is extremely useful (in most cases, this is what we’re trying to map out, right?). I can tell you that this is the case for the HCC dataset: it is an heterogeneous disease where patients in similar stages can map onto different survival outcomes. I had to jump through hoops to get a quick and proper visualization of some sub-clusters back then, and this tool would have been instrumental to look for that type of insights and communicate them to the medical team taking care of the patients.
Finally, lux could be used as a learning tool, for a basic exploration of the data, and to teach students or newcomers into the field the foundations of statistics and data-handling: types of features, types of plots, data distribution, positive and negative correlations, representation of missing values (e.g., null , NaN). It is super low-code and the perfect tool to “test the waters”.

So there you have it, it seems that there are no free lunches. I sincerely hope that you’ve enjoyed this review! Feedback and suggestions are always appreciated: you can leave me a comment, star and contribute to the repo, or even contact me at the Data-Centric AI Community to discuss other data-related topics. See you soon, and happy science-ing!

About me

Ph.D., Machine Learning Researcher, Educator, Data Advocate, and overall “jack-of-all-trades”. Here on Medium, I write about Data-Centric AI and Data Quality, educating the Data Science & Machine Learning communities on how to move from imperfect to intelligent data.

Data-Centric AI Community | GitHub | Google Scholar | LinkedIn

References

M. Santos, P. Abreu, P. J. García-Laencina, A. Simão, A. Carvalho, A new cluster-based oversampling method for improving survival prediction of hepatocellular carcinoma patients (2015), Journal of Biomedical Informatics 58, 49–59.