Data preparation is the initial step that any data professional does. Whether you want to analyze the data or preprocess the data for a machine learning model, you need to prepare the data.
Preparing data means you need to collect, clean, and explore the data. To do all the activities I have mentioned, there is a Python package developed called DataPrep. How did this package would help us? Let’s explore it together.
DataPrep
DataPrep is a Python Package developed to prepare your data. This package contains three main APIs for us to use, they are:
- Data Exploration (
dataprep.eda
) - Data Cleaning(
dataprep.clean
) - Data Collection (
dataprep.connector
)
DataPrep packages are designed to have a fast data exploration and work well with Pandas and Dask DataFrame objects. To explore the DataPrep capability, we need to install the package first.
pip install -U dataprep
After we finish installing the package, let’s use the APIs to prepare our data.
DataPrep Exploration
DataPrep offers us to create an interactive profile report with one line of code. This report object is an HTML object separated from our Notebook with many choices of exploration. Let’s try the API with the sample data.
from dataprep.datasets import load_dataset
from dataprep.eda import create_report
df = load_dataset("titanic")
df.head()

We would use the Titanic sample dataset for our data. After we have loaded our data, we would use the create_report
function to produce the interactive report.
create_report(df).show_browser()

As we can see in the GIF above, API creates a nice interactive report for us to explore. Let’s try to dissect the information one by one.

From the overview tab, we can see all the overview information from our dataset. The information we could get included Missing data numbers and percentages, Duplicate data, variable data type, and detailed information for each variable.

The variables tab gives us detailed information for each variable within our dataset. Almost all the information you need is available, for example, unique, missing data, quantile and descriptive statistics, distribution, and normality.

The interactions tab would create a scatter plot from two numerical variables. We could set the X-axis and Y-axis ourselves, which gives us control over how we want to visualize it.

The correlations tab gives us the statistical correlation calculation heatmap plot between numerical values. Currently, there are three calculations we could use— Pearson, Spearman, and KendallTau.

The missing Values tab gives us all the detailed information regarding the missing values in our tab. We could choose the Bar Chart, Spectrum, Heat Map, and Dendrogram to fully explore the missing values information.
DataPrep Cleaning
DataPrep Cleaning API collection offers more than 140 APIs to clean and validate our DataFrame. For example, the APIs we can use are:
And many more. There are so many functions we could try, and this article could not cover all the APIs. If you are interested, you can check out the documentation here.
Let’s try the Columns Headers cleaning function with our Titanic dataset example.
from dataprep.clean import clean_headers
clean_headers(df, case = 'const').head()

Using the ‘Const’ case, we would end up with all capitalized columns names. If we switch the case into ‘Camel.’
clean_headers(df, case = 'camel').head()

The result is all lower columns name except for the ‘sibSp’ column, where they have two words within their column name.
If you want to have a complete clean DataFrame, we could use the clean_df
API from DataPrep. This API would have two outputs – the inferred data type and the cleaned DataFrame.
from dataprep.clean import clean_df
inferred_dtypes, cleaned_df = clean_df(df)

There are many parameters you could play around with the API. I suggest you read all the documentation to see which parameters suit your data preparation purposes.
DataPrep Collection
DataPrep Collection APIs are used for collecting the data from Database or Web API. If we have access to the database such as MySQL or PostgreSQL, you could connect it with the DataPrep API, but it is also possible to access the public API using DataPrep connect
API.
If you want to collect the data from the web, it would still need the API Code, but everything is simplified. If you want to read more about the collection API, you can read it all here.
Conclusion
DataPrep is a one-liner Python Package used to clean, connect, and explore the dataset you have. The capability includes:
- Data Exploration (
dataprep.eda
) - Data Cleaning(
dataprep.clean
) - Data Collection (
dataprep.connector
)
I hope it helps!
Visit me on my LinkedIn or Twitter.
If you enjoy my content and want to get more in-depth knowledge regarding data or just daily life as a Data Scientist, please consider subscribing to my newsletter here.
If you are not subscribed as a Medium Member, please consider subscribing through my referral.