The world’s leading publication for data science, AI, and ML professionals.

Turbocharge your data manipulation skills

Unlock the power of pandas groupby, apply and transform

Photo by Kier in Sight on Unsplash
Photo by Kier in Sight on Unsplash

In a competitive and data-rich world, understanding segmental behaviour is key to providing tailored insights and product offerings.

Whether that’s achieved through understanding segmental trends through descriptive statistics or via more nuanced approaches like including segmental features in machine learning models, some data manipulation is required.

Luckily for us, Pandas provides highly versatile functionality which allows us to cut through the majority of heavy manipulation required for segmenting data in various ways. Using some examples, we’ll demonstrate:

  1. The groupby operation – what it is, how it works, and what it returns.
  2. How to use apply with groupby in order to apply more complex and exotic transformations.
  3. Using groupby and transform to map the magic of groupby and apply back to the original data shape.
  4. Some tips and tricks that I’ve picked up over time.

Let’s get cracking – first up, getting some data to play around with.

The data

This time round we’ll be using information collected from a portfolio of consumer credit cards¹.

Apart from column name changes and some format changes I’ve applied, the data is almost intact – a sneak peak:

Image by author
Image by author

We have various categorical features (e.g. education level) mixed in with numeric features (e.g. customer age). The data set is wider than the snapshot above implies, so don’t be surprised if you see new and interesting features being used below.

Groupby

First things first – the groupby.

From the documentation²:

A groupby operation involves some combination of splitting the object, applying a function, and combining the results. This can be used to group large amounts of data and compute operations on these groups.

That’s quite a good explanation. I also find it helpful to think of a groupby as a sort of for loop:

  1. Segment the data into various groups. Or, group the data by referring to the given segmentation "rules".
  2. For each of the groups, apply a specified function. This could be a simple calculation like an average, or more exotic and complex calculations (more on that later).
  3. Collate the results.

This might be a little vague at the moment, so let’s work through a couple of examples.

A starter – determining the average credit limit for each type of customer, across the gender groups:

# average credit limit for customer type x gender
segments = ['attrition_flag','gender']
df.groupby(segments)['credit_limit'].mean().round().to_frame()
Image by author
Image by author

Easy peasy, lemon groupby.

Let’s ignore the massive gender discrepancy in credit limits for now and focus on the underlying code. What pandas has done for us is divide the DataFrame into attrition_flag x gender groups, calculated the average credit_limit for each group, and collated the results.

What if we wanted to extend the segmentation to include education level, and also calculate the average credit utilisation rate? That’s not too complex:

# extended segmentation and feature selection
segments = ['attrition_flag','education_level','gender']
features = ['credit_limit','avg_utilization_ratio']
df.groupby(segments)[features].mean()
Image by author
Image by author

Changing the calculation being applied to each group is also straightforward. For instance, if we wanted to count the number of observations in each segment, we could use size :

# group counts
segments = ['attrition_flag','gender']
feature = 'customer_age'
df.groupby(segments)[feature].size().to_frame(name = 'group_size')
Image by author
Image by author

This leads us to a more general recipe for groupby statements:

Image by author
Image by author

… where we can tailor segments , features , and function to individual use cases. I’ve been purposefully vague on what function actually is – we’ll see why in a minute – but the general idea holds true in most cases.

First, let’s talk about a few things.

  • So far, we’ve performed very basic calculations using two built-in pandas functions – mean and size . There are of course many other useful functions available for use, like median and standard deviation. In a minute, we’ll see how we can use more complex functions, custom functions and functions from other packages.
  • Using the groupby statement the way we have returns an object which has the same column names as the original DataFrame. It’s quite plausible that you end up with data sets which look similar but consistent of very different information – don’t be caught out. Sensible naming conventions or code structure can help with this, as well as renaming columns appropriately.
  • So far, we’ve used the same function on multiple columns. If you’d like to apply different functions to different columns, or different functions to the same column in the same groupby statement, I’d check out the agg functionality³. It also has a neat way of letting the user define the names of calculation results through tuples.

Now that we have a basic grasp of groupby, we can apply ourselves as we move on to see how we can start using more exotic functions.

See what I did there? So funny.

Pro tip: using a single feature in a groupby statement will return a Series, which Jupyter will present like a jumble of numbers. to_frame converts the Series into a DataFrame, which Jupyter outputs in a more aesthetically pleasing way.

Apply

The "standard" functions that we’ve seen above will only take us so far – that’s really the point of standard functions really.

If we need to do anything or advanced or bespoke, or rely on functionality in other packages, we need to combine groupby with apply .

The apply statement is quite self-explanatory, and described succinctly in the documentation⁴:

Apply a function along an axis of the DataFrame.

Objects passed to the function are Series objects whose index is either the DataFrame’s index (axis=0) or the DataFrame’s columns (axis=1).

Let’s do some examples of using bespoke functions on groupings.

We’ll start by doing a bit of a different calculation: scaling the group median by the ratio of the group maximum to the group minimum; we’ll do this for each type of customer split by gender.

# define the bespoke function
def scaled_median(s):
    # calculates Series median x Series maximum / Series minimum

    return s.median() * s.max() / s.min()

# apply it to data
segments = ['attrition_flag','gender']
feature = 'customer_age'
df.groupby(segments)[feature].apply(scaled_median).to_frame(name='result')
Image by author
Image by author

If you’re feeling especially Pythonic, you can use lambda functions in the apply statement. That would look something like:

apply(lambda u: u.median() * u.max() / u.min())

Now, what about applying a custom function to multiple features (columns)? It takes a bit more thought – let’s use an example to demonstrate.

The organisation has a heuristic which indicates the likely receptiveness of a customer to a credit offer. This heuristic – let’s call it the "appetite score" – is based on marital status, gender, number of dependents, and the average credit utilisation. We would like to understand the average appetite score for each customer type and gender.

Code-wise, this could look something like this:

# maps to convert categorical to numeric
status_mapper = {'Married':1.1, 'Single':1, 'Unknown':0.8, 'Divorced':0.9}
gender_mapper = {'F':1.1, 'M':1}

# define the appetite score function
def cohort_score(dataframe):
    score = (
        dataframe['marital_status'].map(status_mapper).astype(int) 
        * dataframe['gender'].map(gender_mapper).astype(int) 
        * (1 + dataframe['dependent_count']) 
        * dataframe['avg_utilization_ratio']
    ).mean()

    return score

# apply
segments = ['attrition_flag','gender']
df.groupby(segments).apply(cohort_score).to_frame(name = 'score')

… which gives:

Image by author
Image by author

A few things to notice here.

We use dictionaries and map to convert marital status and gender to numeric values for calculation in the heuristic, remembering to convert the data type appropriately.

We don’t explicitly select the features we need for the heuristic. This is actually a nifty trick which I haven’t really touched on – we can feed both Series and DataFrames into the apply . In this case, since we don’t specify which features to use, we are effectively passing the resulting grouped DataFrame to the cohort_score function, which in turn filters out necessary columns.

The calculation of the mean is baked into the function. Now arguably, this is cheating a bit – in real life, we may want to build a heuristic which calculates the appetite score on an individual level, and then create group statistics from the result. This actually ties in well with two other concepts : using functions from other packages and using a lambda function.

Let’s do exactly that now – define a heuristic which doesn’t return an average, use numpy to calculate the group average, and wrap it in a lambda function.

# get numpy
import numpy as np

# define the appetite score function for an individual
def cohort_score(series):
    score = (
        series['marital_status'].map(status_mapper).astype(int) 
        * series['gender'].map(gender_mapper).astype(int) 
        * (1 + series['dependent_count']) 
        * series['avg_utilization_ratio']
    )

    return score

# apply
segments = ['attrition_flag','gender']
d = df.groupby(segments).apply(lambda u: np.mean(cohort_score(u)))
d.to_frame(name = 'score')

which gives:

Image by author
Image by author

Look familiar? It should do, as it’s exactly the same result as the calculation above.

One thing to notice is how easy it is to use a function from another package. In this case, we’ve used the mean calculation from numpy , but the concept is the same – feed the function into apply , making reference to the package it comes from (remembering of course to import the package!).

Transform

You’ll by now have noticed that groupby returns group statistics, presented at a group level. Perfect if that’s what you’re after – for instance, for visualisation or summation – but there are cases where we need to "map" the group result back to individual observations.

I’m sure there are many other applications for this, but this approach can be extremely useful in feature engineering. Imagine for a moment that you’re predicting house prices — wouldn’t it be good to know how big an individual house is compared to the "average" house in the same neighborhood?

Luckily for us, pandas provides easy functionality which allows us to do just that, without the need for any joining or merging. This is the magic of transform, though the documentation undersells it a bit⁵:

Call func on self producing a DataFrame with the same axis shape as self

Maybe an example would be useful! Let’s see how we could apply the scaled median function again, but this time using transform to create a new column in the data with the group results mapped back to each individual observation:

# get example data
df_example = df[['attrition_flag','gender','customer_age']].copy()

# apply it to data
segments = ['attrition_flag','gender']
feature = 'customer_age'
df_example['scaled_median_age'] = (
    df_example
    .groupby(segments)[feature]
    .transform(scaled_median)
)

df_example

which gives:

Image by author
Image by author

Here we see how the same group values are produced, but are also mapped back to size and included in the DataFrame.

Tips and tricks

Now that we have a grounding in groupby , apply, and transform, let’s take a look at some tips and tricks that might be useful.

The versatility of apply

The apply function is very versatile. We’ve seen how we can apply it with a groupby operation, but that’s not always necessary – i.e. you can use apply on a DataFrame directly.

Axes and apply

When I started using apply in anger, I encountered (probably more than) my fair share of error messages. More often than not, it was down to me not completely understanding the axis that the function was being applied along – i.e. I was incorrectly specifying whether or not to apply the function to rows or columns.

So, if you’re certain that your function works but pandas is still being disagreeable, I’d recommend taking a look at the axis argument in apply.

Using built-in functions with apply

As we’ve seen above, we can use built-in pandas functions directly on grouped DataFrames.

We can also use the same built-in functions in an apply , though we have to refer to the function by its name. For instance, using the built-in standard deviation function would look something like this:

segments = ['attrition_flag','gender']
feature = 'customer_age'
df.groupby(segments)[feature].apply('std')

So as it is with most things Python, there’s more than one way of doing things. However, using built-in functions in an apply statement like this can actually be slower than using them directly.

Missing data imputation

We earlier touched on how apply and transform are useful techniques when doing feature engineering. Well, they’re also great for data cleansing, particularly if you want to impute missing values using segmental statistics.

Here’s an example where we see how we can use median customer age for customer type and marital status to impute missing customer age values.

# example data frame
df_example = df[['attrition_flag','marital_status','customer_age']].copy()

# randomly create missing values
df_example['customer_age'] = np.where(
    np.random.random(size = len(df)) <= 0.5,
    df['customer_age'],
    np.nan
)

# impute using median
df_example['customer_age_imputed'] = (
    df_example
    .groupby(['attrition_flag','marital_status'])['customer_age']
    .transform('median')
)

df_example
Image by author
Image by author

This is obviously quite a contrived example but the logic transfers well to other use cases, especially if the groups are fairly homogeneous.

We’ll leave it here for now, and do a quick recap.

Wrapping up

We’ve covered a fair bit of ground here.

Firstly, we had a look at the groupby operation – what it is, how it works, and what it returns.

Once we found our feet using basic function with groupby , we pushed the boat out a bit and moved on to using bespoke functions by combining apply with groupby.

While having statistics at the group level is useful, we discussed why we might want to map those group statistics back to each individual observation. We saw how we can use groupby and transform to do exactly that.

I also shared a few "tips and tricks" that I’ve learned over time – hopefully they make your journey with groupby, apply, and transform a little smoother.

Now that you’ve seen a few examples, I’d encourage you to have a go yourself – practice makes perfect and all that.

Of course, there are extra bits we can do to take our analysis and presentation to the next level. The first is to appropriately format DataFrames:

Make Your Tables Look Glorious

… and the second is to produce presentation-ready charts:

Make your charts look glorious

As always, I hope you’ve enjoyed reading this as much as I have enjoyed writing it. As always, I’d love to learn about new tips and tricks, so please feel free to share in the comments!


References and resources

  1. zhyli. (2020). Prediction of Churning Credit Card Customers [Data set]. Zenodo., provided under the Creative Comons – Attribution 4.0 International licence.
  2. pandas.DataFrame.groupby – pandas 1.5.3 documentation (pydata.org)
  3. pandas.DataFrame.agg – pandas 1.5.3 documentation (pydata.org)
  4. pandas.DataFrame.apply – pandas 1.5.3 documentation (pydata.org)
  5. pandas.DataFrame.transform – pandas 1.5.3 documentation (pydata.org)

Related Articles