The world’s leading publication for data science, AI, and ML professionals.

Data Cleaning Using Python Pandas

A comprehensive guide to using built-in Pandas functions to clean data prior to analysis

Photo by Stan Y on Unsplash
Photo by Stan Y on Unsplash

Introduction

Over time companies produce and collect a massive amount of data, depending on the company this can come in many different forms such as user-generated content, job applicant data, blog posts, sensor data and payroll transactions. Due to the immense number of source systems that can generate data and the number of people that contribute to data generation we can never guarantee that the data we are receiving is a clean record. These records may be incomplete due to missing attributes, they may have an incorrect spelling for user-entered text fields or they may have an incorrect value such as a date of birth in the future.

As a data scientist, it’s important that these data quality issues are recognised early during our exploration phase and cleansed prior to any analysis. By allowing uncleaned data through our analysis tools we run the risk of incorrectly representing companies or users data by delivering poor quality findings based on incorrect data. Today we will be using Python and Pandas to explore a number of built-in functions that can be used to clean a dataset.

Getting Started

For today’s article, we are using PyCharm which is an integrated development environment built for Python. For beginners its an excellent tool to use as it streamlines the creation of virtual environments and the installation of specific Python packages such as Pandas. Using virtual environments allows you to manage your project dependencies without impacting your operating systems default Python installation.

Begin by opening Pycharm and selecting File > New Project from the navigation bar. Here you can name your project using Location and either create a new virtual environment or reference an existing one.

PyCharm's 'New Project' dialogue box showing where to name a project and virtual environment creation paths.
PyCharm’s ‘New Project’ dialogue box showing where to name a project and virtual environment creation paths.

To get you started we have created a base Python script below and a CSV file both of which will be referenced throughout the article. Both of these files can be download and saved to your root project directory. The final script can be found here.

The above script demonstrates a number of DataFrame manipulations after reading a file into memory. On lines 5–7, we are overriding a number of Pandas default configurations which when dealing with larger files can clip the console output when printing. To read the data into memory we use Pandas built-in function read_csv() on line 10 which takes a file name as a parameter. On lines 13–15, we set the data type of three columns which has a number of benefits. Firstly, setting the data type improves performance when processing DataFrame rows by reducing the memory footprint. Secondly, it enriches the descriptive statistics output we get when running Pandas .describe() built-in function. On line 18, we perform a column rename which is commonly practised to convert a generic column name to something more meaningful.

Cleaning a DataFrame

In order to get an understanding of which aspects of the dataset need cleaning we first need to see what data we are dealing with. The best way to do this is by using Pandas built-in functions .info() and .describe().

Console output after calling Pandas built-in function .info().
Console output after calling Pandas built-in function .info().

The result of .info() above clearly shows the amount of non-null elements and therefore the number of null elements can be deduced using the RangeIndex from the first line. A quick way of summarising this is calling df.isnull().sum() which sums the number of null elements in each column.

Console output after calling Pandas built-in .isnull().sum() function.
Console output after calling Pandas built-in .isnull().sum() function.

If we need to know the percentage of the null elements as a percentage of the whole dataset then we can do the following.

Removing Columns

One element that jumps out after calling .info() and .isnull().sum() is the tax_file_no which across 1,000 records has 1,000 null values. The easiest way to remove these types of rows is by using Pandas .dropna(). The .dropna() function takes the form .dropna(axis=0, how='any', thresh=None, subset=None, inplace=False). The axis parameter determines whether the function is applied to rows axis='index' or columns axis='columns'. The how parameter can be how='any' or how='all', which means that the column or row can be dropped if any or all elements have null values. The thresh parameter allows you to set the minimum number of non-null elements within the column or row that are required otherwise it will be dropped. The subset parameter is optional and can receive a list of column names, if the list is empty then the .dropna() function will be applied across all columns or rows. inplace can either be True or False, if True then None will be returned. For our example, we would use df.dropna(axis='columns', how='all', inplace=True). After executing .dropna() and printing out the DataFrame info to the console you will notice that the tax_file_no column is no longer included.

Console output from .info() after executing .dropna() showing the column tax_file_no has been dropped.
Console output from .info() after executing .dropna() showing the column tax_file_no has been dropped.

Removing Rows

After checking for completely null columns it’s worth checking to see if there are any rows that do not contain enough usable elements. We can achieve this by making use of .dropna(thresh=2) to remove any rows that have less than two elements.

On line 2, we drop any rows that have less than two elements. Line 3 then returns any rows from the original DataFrame whose index does not exist in under_threshold_removed i.e. dropped rows. It accomplishes this by negating ~ pandas built-in .index.isin() function. In our example, one record is returned that only contains the employee number.

Filling Missing Values

In certain circumstances, we may want to retain rows that contain missing values and instead give them a default value when missing. For this example, we are going to map the null values within the gender column to U for Unknown. Pandas provide a built-in function that can achieve this .fillna(value=None, method=None, axis=None, inplace=False, limit=None, downcast=None).

Pandas .fillna() is an incredibly powerful function when cleaning data or manipulating a DataFrame. The value parameter can accept a dictionary which will allow you to specify values that will be used on specific columns to fill null values. The benefit of this is when you don’t wish to apply the same null filling value to all null elements within a DataFrame. The method parameter allows you to fill the null values based on other values within a series. You can set the method to ffill which will fill the null with the last valid value that exists in the series. The bfill method option performs a similar function however the filling value is taken from the next valid value that appears in the series. Both the axis and inplace parameters achieve the same outcome as in the .dropna() function discussed above. If you have set the method parameter, you can use the limit parameter to tell .fillna() the maximum number of consecutive null values to check forward/backwards to reach a valid value. If the method parameter is None the limit parameter can be used to set the maximum number of null values in the series that will be filled by the passed values.

As the gender column has a categorical data type, before we can map missing values to U, we first need to create a new category in the series.

To add a new category we need to isolate the gender column as a series and call .add_categories(). This will allow you to pass in new categories using new_categories=[] which accepts a list. Once your new category has been created you can then use .fillna() and reference the new category. To view the changes that were made you can call df['gender'].value_counts() which will isolate the gender column and provide a value count for each category within the series.

Console output showing 206 missing values now mapped to the new category U in the gender column.
Console output showing 206 missing values now mapped to the new category U in the gender column.

Improving Readability

In this context we understand the meaning behind the categories M, F and U however, the stakeholders receiving the data may not understand. To improve the readability of the analysis we can rename the categories to something more meaningful.

The above snippet uses the built-in function .rename_categories() which accepts a dictionary that maps the current categories to a new category. The below shows the results of the renaming.

Console output showing the value counts of the newly renamed categories from the gender column.
Console output showing the value counts of the newly renamed categories from the gender column.

Summary

As a data scientist, throughout your career, you are going to encounter incomplete and poor quality datasets. To generate high-quality analysis we need to ensure that we are cleaning data in order to accurately represent the dataset. Pandas offer a diverse range of built-in functions that can be used to clean and manipulate datasets prior to analysis. It can allow you to drop incomplete rows and columns, fill missing values and improve the readability of the dataset through category renaming.

Thank you for taking the time to read our story, we hope you found it valuable.


Related Articles