Pandas Transform — More Than Meets the Eye

Daphna Regev
Towards Data Science
4 min readOct 28, 2018

--

Lately I’ve been working with Pandas. While working on a project I encountered a nifty function I hadn’t known about, and after asking around it seems I’m not the only one missing out, so let’s remedy that.

First, let’s review the basics.

Split — Apply — Combine

This is a common methodology. We want to split our data into groups based on some criteria, then we apply our logic to each group and finally we combine the data back together into a single data frame. Let’s look at our data. We have the yearly salaries for several employees in two different firms:

Well researched data, aka I made it up.

A well known function for the “apply” stage is aggregate (or agg which is the same). Why have all those pesky cells when you can just squish them down to one? With aggregate you can sum up your data, find the average or any other such calculation, like this:

mean_salary = df.groupby('Company')['Yearly Salary'].\
aggregate('mean').rename("Mean Salary").reset_index()

You can even specify different functions for each column!

In any case, the result is that each column in the group has been reduced to a single cell of data. Now in order to combine it back to our dataframe we need to use merge.

df1 = df.merge(mean_salary)

Another function we can use in the “apply” stage is (surprise!) the apply function. It lets us apply any function we want to a column (or row) in the data frame. It can be a lambda function or a function we defined elsewhere. This time the data isn’t reduced, instead we operate on each cell individually:

df['Yearly Salary K'] = df['Yearly Salary'].apply(lambda x: x*1000)

I had been happily using these functions for a while when I encountered a problem. I had data that I wanted to split, run some logic on and recombine — but I didn’t want to reduce the data or run on each row individually. Instead I wanted some calculation to run on each cell of my column in each group and return a result that took into account the whole group. The size of the data needed to remain unchanged.

Long story short —after wrestling with this for a while, a teammate told me to look up aggregate’s lesser known cousin, transform. Bingo!

Transform

Lets take a look at how transform works:

df.groupby('Company').transform('mean')

Instead of reducing the results we get a result of the same size as the original data. This makes combining the data back super simple. Lets add another line to show what we can do with this:

df['avg_company_salary'] = df.groupby('Company').transform('mean')
df['is_above_avg_salary'] = \
df['avg_company_salary'] < df['Yearly Salary']

As we showed earlier you can accomplish the same results with aggregate and merge in this specific example, but the cool thing about transform is that you do it in a single step.

I hope you too will find the transform function useful, and that you’ll get a chance to use it soon!

©Hasbro

Interested in git as well? Read my post Git: Level Up to step up your git game.

--

--

Data Scientist @ Trax. I like clean code, machine learning and sushi, not necessarily in that order.