Introduction
At some point in our Data Science career, we are going to come across poor quality data, whether it be partially completed records or incorrectly formatted attributes. Being able to manage poor quality data has become a crucial skill as a successful data scientist.
Thankfully, there are numerous libraries that have been developed such as Pandas that we can use to efficiently manipulate datasets. Today we are going to look at utilizing Pandas built-in function .apply() to learn how to manipulate a dataset.
Pandas Apply Function
The Pandas apply function does at its name suggests, it allows you to apply a function across an axis of a DataFrame or to a Series. This function is incredibly powerful when needing to manipulate a column or row in the case of reformatting or cleaning data.
The first parameter func
allows you to pass in a function. This can either be a function from an imported package, a function you’ve declared in your script or even a lambda function. The below Python snippet demonstrates how to pass in the three different types of functions between lines 21–23. Here we are creating three new columns generated by Pandas .apply()
function by passing three different functions yielding the same result.

The second parameter axis
allows you to define which axis the function is going to be applied to. This can either be 0
or index
for the rows or 1
or columns
for the columns within the DataFrame. Personally, I prefer to use index
and columns
as it improves the readability of your code for those that don’t understand the meaning behind 0
and 1
. The third parameter raw
determines whether a row or column is passed or an n-dimensional arrary (ndarray) object. If raw=False
is passed then the row or column is passed to the apply function as a Series, if raw=True
is passed then ndarray objects are passed to the function instead.
The parameter result_type
alters how the apply function is applied to the columns axis of a DataFrame. By default the result_type
is set to None
however, can accept expand
broadcast
and reduce
. If expand
is passed any list type results that are returned will be expanded across the columns. If the length of the returned lists is greater than the number of columns in the DataFrame then additional columns will be created to expand across. During the expansion of the lists, the original column names of the DataFrame will be overridden to a range index. If the original shape of the DataFrame needs to be retained then broadcast
can be passed, this will also ensure the retention of the original column names. If you need to ensure that list type results aren’t expanded than you can pass reduce
to result_type
.
The final parameter args=()
allows you to pass in values to the function that you are applying. In the example below, we are passing 25
and 75
as our lower
and higher
parameters for the function between_range()
. As the function is being applied to the column, each value in the result
series will be assessed as to whether it is inside or outside the value range and a boolean will be returned within the in_range
column.

Summary
Pandas built-in apply() function is an incredibly powerful tool to know and understand when it comes to dealing with poor quality data. It provides an efficient way to apply a function along an axis of a DataFrame to clean or manipulate your data. The apply() function with the flexibility of parameters you can pass will allow you to tackle almost any data quality issue.
Thank you for taking the time to read our story, we hope you have found it valuable 🙂