DataPrep.Clean: Accelerate Your Data Cleaning

A better way to clean data

Brandon Lockhart

Published in

Towards Data Science

7 min readMar 16, 2021

Photo by The Creative Exchange on Unsplash

Authors: Brandon Lockhart and Alice Lin

DataPrep is a library that aims to provide the easiest way to prepare data in Python. To address the onerous data cleaning step of data preparation, DataPrep has developed a new component: DataPrep.Clean.

DataPrep.Clean contains simple and efficient functions for cleaning, standardizing, and validating data in a DataFrame. The functions use a unified interface to perform common data cleaning operations required for various types of data.

This article demonstrates how to use DataPrep.Clean to simplify and speed-up data cleaning tasks.

Incorrect or inconsistent data can lead to false conclusions. Therefore, it is imperative for data scientists to clean their data to ensure the results are accurate. However, data cleaning consumes a significant portion of a data scientist’s working time — 26% according to a recent survey — and is normally considered to be tedious and mundane work.

Data cleaning tasks usually involve writing regular expressions to ensure values follow an allowed pattern and writing scripts to transform and standardize values, which can be difficult and error-prone. Furthermore, these tasks are often data or domain-specific, and need to be performed anew for each dataset.

Why DataPrep.Clean?

There are three reasons DataPrep.Clean is the ideal tool for data cleaning in Python:

A Unified API: each function follows a simple naming convention, clean_type() and validate_type(), where type is the semantic data type of interest.
Efficiency: the data is processed with the parallel computing library Dask to achieve fast performance.
Transparency: after executing a cleaning function, a report is generated that describes the alterations that were made to the data.

Let’s get started with DataPrep.Clean.

Installation

You can install DataPrep with pip using the command:

pip install -U dataprep==0.3.0a0

This is an alpha version, and DataPrep version 0.3 will be released soon.

Loading the Dataset and Getting Started

We will use the dataset waste_hauler from DataPrep’s internal dataset repository. Let’s start by loading the DataPrep.Clean functions and the dataset into a pandas DataFrame:

from dataprep.clean import *
from dataprep.datasets import load_dataset
df = load_dataset('waste_hauler')

Let’s take a look at the dataset:

df.head()

Notice the words in the column headers are uppercase and separated by spaces. However, it is easier to work with headers in snake case style since you don’t need to hold down the shift key when typing a header and you can access each column directly as an attribute of the DataFrame (e.g, df.local_address). To transform the headers, we can use the function clean_headers()which takes a DataFrame as input and returns the DataFrame with the headers in a desired style:

df = clean_headers(df)
df.columns

clean_headers() by default converts the headers into snake case, however, many other styles are supported. Check out the user guide for more information.

Data Standardization

DataPrep.Clean provides simple functions that parse, reformat and standardize values in a column. These functions follow the naming convention clean_type(), where type is the data type (such as phone numbers or email addresses) of the column that is to be cleaned. A DataFrame and column name are passed as input, and the DataFrame is returned with a new column containing the cleaned values of the specified column. The data can also be cleaned in place by specifying the parameter inplace=True.

Let’s see how to standardize phone numbers and street addresses using DataPrep.Clean.

Phone Numbers

Let’s take a look at the column phone:

df.phone.head()

As we can see, the phone numbers have different formats. To standardize their formats, we can use the function clean_phone(), which will by default convert all phone numbers into the format NPA-NXX-XXXX. clean_phone() takes a DataFrame and the name of the column that is to be cleaned as input, and it returns the original DataFrame augmented with a new column phone_clean containing the standardized phone numbers.

df = clean_phone(df, 'phone')
df[['phone', 'phone_clean']].head()

Also, after calling clean_phone() a summary report is printed that describes the alterations made to the data to clean the column and the quality of the data in the resulting column:

To output the phone numbers in the format (NPA) NXX-XXXX, we can set the output_format parameter to national:

df = clean_phone(df, 'phone', output_format='national')
df[['phone', 'phone_clean']].head()

For more information about clean_phone(), check out the user guide.

Street Addresses

Next, let’s take a look at the local_address column:

df.local_address.head()

There are several inconsistencies in these address representations. For example, using “AVE.” (index 0) and “AVENUE” (index 1) to represent “avenue”, or “EAST” (index 3) and “E” (index 4) to represent “east”. To standardize these inconsistencies, we can use the function clean_address():

df = clean_address(df, 'local_address')
df[['local_address', 'local_address_clean']].head()

Now each address in the column local_address_clean has the same, consistent format.

We may additionally like to extract components from the address, such as city and state. We can accomplish this by setting the parameter split to True which will create a new column in the DataFrame for each address component. Let’s call clean_address() with split=True and take a look at the new columns:

df = clean_address(df, 'local_address', split=True)

These individual components make it easy for the user to format an address as desired or perform aggregate analysis on cities and states.

For more information about clean_address(), check out the user guide.

Data Validation

Data validation refers to the process of ensuring that data is correct and correctly represented. DataPrep.Clean provides functions for validating various data types. The data validation functions follow the naming convention validate_type(), where type is the name of the semantic data type that is to be validated.

Each validation function takes a data Series as input and returns a boolean Series that indicates whether each value is of the specified data type or not. For example, let’s use the function validate_email() on the Series df['email']:

df['valid_email'] = validate_email(df['email'])
df[['email', 'valid_email']].head()

As we can see, some values in the email column are valid email addresses, others are not, and validate_email() returns True and False appropriately. Moreover, we can filter the dataset to identify records that have incorrect email addresses:

df[~validate_email(df['email'])].dropna()

As we can see, these rows of the email column erroneously contain text and dates. Thus, validate_email() enables quick identification of these values so the data scientist can remove or correct the invalid email addresses as needed.

Data Type Detection

Next, we will see how DataPrep.Clean can be used for semantic data type detection. Although frequently a dataset is accompanied by informative column names that denote the type of data in their respective column, sometimes it is not, and it would be useful to identify the data type of each column. Alternatively, you may have a large number of columns or multiple datasets and would like to identify columns having specific data types, which could be difficult and time-consuming to do.

Below is a function that identifies the data type of each column, specifically checking for phone numbers, email addresses, street addresses, and dates.

def detect_data_types(df, thresh):
    ls = []
    for col in df.columns:
        n = len(df[col].dropna())
        if validate_phone(df[col].dropna()).sum() / n > thresh:
            ls.append((col, "phone"))
        elif validate_email(df[col].dropna()).sum() / n > thresh:
            ls.append((col, "email"))
        elif validate_address(df[col].dropna()).sum() / n > thresh:
            ls.append((col, "address"))
        elif validate_date(df[col].dropna()).sum() / n > thresh:
            ls.append((col, "date"))
    return pd.DataFrame(ls, columns=["Column Name", "Data Type"])

detect_data_types() takes a DataFrame df and threshold thresh as input. For each column in df, the validate_type() functions are used to determine how many values in the column satisfy each data type. If the proportion of values that satisfy the type is greater than thresh, the column is determined to be of that type.

Let’s call detect_data_types() with a threshold of 0.8:

df_types = detect_data_types(df, thresh=0.8)
df_types

As we can see, the address, phone and email data types are correctly identified. Moreover, the column created is identified as containing dates, which is non-trivial to infer from the column name.

Conclusion

With the exponential increase in data collection across many fields, users of various skill levels need to derive insights from data. To avoid the garbage in, garbage out truism, data cleaning needs to be performed which can be difficult, time-consuming and tedious.

DataPrep.Clean through its simple APIs has the potential to turn data cleaning from the bane of each data scientist’s existence into a fast and easy process.

DataPrep.Clean currently contains functions for:

Column Headers
Country Names
Dates and Times
Email Addresses
Geographic Coordinates
IP Addresses
Phone Numbers
URLs
US Street Addresses

and more functions are currently being developed.

To learn more about DataPrep.Clean, check out the user guides and API references.