The world’s leading publication for data science, AI, and ML professionals.

Clean, Connect and Visualize Interactively with DataPrep

One-in-all package for your data analysis process

Photo by UX Indonesia on Unsplash
Photo by UX Indonesia on Unsplash

Data preparation is the initial step that any data professional does. Whether you want to analyze the data or preprocess the data for a machine learning model, you need to prepare the data.

Preparing data means you need to collect, clean, and explore the data. To do all the activities I have mentioned, there is a Python package developed called DataPrep. How did this package would help us? Let’s explore it together.


DataPrep

DataPrep is a Python Package developed to prepare your data. This package contains three main APIs for us to use, they are:

  • Data Exploration ( dataprep.eda )
  • Data Cleaning( dataprep.clean )
  • Data Collection ( dataprep.connector )

DataPrep packages are designed to have a fast data exploration and work well with Pandas and Dask DataFrame objects. To explore the DataPrep capability, we need to install the package first.

pip install -U dataprep

After we finish installing the package, let’s use the APIs to prepare our data.


DataPrep Exploration

DataPrep offers us to create an interactive profile report with one line of code. This report object is an HTML object separated from our Notebook with many choices of exploration. Let’s try the API with the sample data.

from dataprep.datasets import load_dataset
from dataprep.eda import create_report
df = load_dataset("titanic")
df.head()
Image by Author
Image by Author

We would use the Titanic sample dataset for our data. After we have loaded our data, we would use the create_report function to produce the interactive report.

create_report(df).show_browser()
GIF by Author
GIF by Author

As we can see in the GIF above, API creates a nice interactive report for us to explore. Let’s try to dissect the information one by one.

Overview Tab (Image by Author)
Overview Tab (Image by Author)

From the overview tab, we can see all the overview information from our dataset. The information we could get included Missing data numbers and percentages, Duplicate data, variable data type, and detailed information for each variable.

Variable Tab (Image by Author)
Variable Tab (Image by Author)

The variables tab gives us detailed information for each variable within our dataset. Almost all the information you need is available, for example, unique, missing data, quantile and descriptive statistics, distribution, and normality.

Interactions tab (Image by Author)
Interactions tab (Image by Author)

The interactions tab would create a scatter plot from two numerical variables. We could set the X-axis and Y-axis ourselves, which gives us control over how we want to visualize it.

Correlations tab (Image by Author)
Correlations tab (Image by Author)

The correlations tab gives us the statistical correlation calculation heatmap plot between numerical values. Currently, there are three calculations we could use— Pearson, Spearman, and KendallTau.

Missing Values tab (Image by Author)
Missing Values tab (Image by Author)

The missing Values tab gives us all the detailed information regarding the missing values in our tab. We could choose the Bar Chart, Spectrum, Heat Map, and Dendrogram to fully explore the missing values information.


DataPrep Cleaning

DataPrep Cleaning API collection offers more than 140 APIs to clean and validate our DataFrame. For example, the APIs we can use are:

And many more. There are so many functions we could try, and this article could not cover all the APIs. If you are interested, you can check out the documentation here.

Let’s try the Columns Headers cleaning function with our Titanic dataset example.

from dataprep.clean import clean_headers
clean_headers(df, case = 'const').head()
Image by Author
Image by Author

Using the ‘Const’ case, we would end up with all capitalized columns names. If we switch the case into ‘Camel.’

clean_headers(df, case = 'camel').head()
Image by Author
Image by Author

The result is all lower columns name except for the ‘sibSp’ column, where they have two words within their column name.

If you want to have a complete clean DataFrame, we could use the clean_df API from DataPrep. This API would have two outputs – the inferred data type and the cleaned DataFrame.

from dataprep.clean import clean_df
inferred_dtypes, cleaned_df = clean_df(df)
Image by Author
Image by Author

There are many parameters you could play around with the API. I suggest you read all the documentation to see which parameters suit your data preparation purposes.


DataPrep Collection

DataPrep Collection APIs are used for collecting the data from Database or Web API. If we have access to the database such as MySQL or PostgreSQL, you could connect it with the DataPrep API, but it is also possible to access the public API using DataPrep connect API.

If you want to collect the data from the web, it would still need the API Code, but everything is simplified. If you want to read more about the collection API, you can read it all here.


Conclusion

DataPrep is a one-liner Python Package used to clean, connect, and explore the dataset you have. The capability includes:

  • Data Exploration ( dataprep.eda )
  • Data Cleaning( dataprep.clean )
  • Data Collection ( dataprep.connector )

I hope it helps!


Visit me on my LinkedIn or Twitter.

If you enjoy my content and want to get more in-depth knowledge regarding data or just daily life as a Data Scientist, please consider subscribing to my newsletter here.

If you are not subscribed as a Medium Member, please consider subscribing through my referral.


Related Articles