
In a competitive and data-rich world, understanding segmental behaviour is key to providing tailored insights and product offerings.
Whether that’s achieved through understanding segmental trends through descriptive statistics or via more nuanced approaches like including segmental features in machine learning models, some data manipulation is required.
Luckily for us, Pandas provides highly versatile functionality which allows us to cut through the majority of heavy manipulation required for segmenting data in various ways. Using some examples, we’ll demonstrate:
- The
groupby
operation – what it is, how it works, and what it returns. - How to use
apply
withgroupby
in order to apply more complex and exotic transformations. - Using
groupby
andtransform
to map the magic ofgroupby
andapply
back to the original data shape. - Some tips and tricks that I’ve picked up over time.
Let’s get cracking – first up, getting some data to play around with.
The data
This time round we’ll be using information collected from a portfolio of consumer credit cards¹.
Apart from column name changes and some format changes I’ve applied, the data is almost intact – a sneak peak:

We have various categorical features (e.g. education level) mixed in with numeric features (e.g. customer age). The data set is wider than the snapshot above implies, so don’t be surprised if you see new and interesting features being used below.
Groupby
First things first – the groupby
.
From the documentation²:
A groupby operation involves some combination of splitting the object, applying a function, and combining the results. This can be used to group large amounts of data and compute operations on these groups.
That’s quite a good explanation. I also find it helpful to think of a groupby
as a sort of for loop:
- Segment the data into various groups. Or, group the data by referring to the given segmentation "rules".
- For each of the groups, apply a specified function. This could be a simple calculation like an average, or more exotic and complex calculations (more on that later).
- Collate the results.
This might be a little vague at the moment, so let’s work through a couple of examples.
A starter – determining the average credit limit for each type of customer, across the gender groups:
# average credit limit for customer type x gender
segments = ['attrition_flag','gender']
df.groupby(segments)['credit_limit'].mean().round().to_frame()

Easy peasy, lemon groupby
.
Let’s ignore the massive gender discrepancy in credit limits for now and focus on the underlying code. What pandas
has done for us is divide the DataFrame into attrition_flag x gender
groups, calculated the average credit_limit
for each group, and collated the results.
What if we wanted to extend the segmentation to include education level, and also calculate the average credit utilisation rate? That’s not too complex:
# extended segmentation and feature selection
segments = ['attrition_flag','education_level','gender']
features = ['credit_limit','avg_utilization_ratio']
df.groupby(segments)[features].mean()

Changing the calculation being applied to each group is also straightforward. For instance, if we wanted to count the number of observations in each segment, we could use size
:
# group counts
segments = ['attrition_flag','gender']
feature = 'customer_age'
df.groupby(segments)[feature].size().to_frame(name = 'group_size')

This leads us to a more general recipe for groupby
statements:

… where we can tailor segments
, features
, and function
to individual use cases. I’ve been purposefully vague on what function
actually is – we’ll see why in a minute – but the general idea holds true in most cases.
First, let’s talk about a few things.
- So far, we’ve performed very basic calculations using two built-in
pandas
functions –mean
andsize
. There are of course many other useful functions available for use, like median and standard deviation. In a minute, we’ll see how we can use more complex functions, custom functions and functions from other packages. - Using the
groupby
statement the way we have returns an object which has the same column names as the original DataFrame. It’s quite plausible that you end up with data sets which look similar but consistent of very different information – don’t be caught out. Sensible naming conventions or code structure can help with this, as well as renaming columns appropriately. - So far, we’ve used the same function on multiple columns. If you’d like to apply different functions to different columns, or different functions to the same column in the same groupby statement, I’d check out the
agg
functionality³. It also has a neat way of letting the user define the names of calculation results through tuples.
Now that we have a basic grasp of groupby
, we can apply ourselves as we move on to see how we can start using more exotic functions.
See what I did there? So funny.
Pro tip: using a single feature in a groupby statement will return a Series, which Jupyter will present like a jumble of numbers. to_frame
converts the Series into a DataFrame, which Jupyter outputs in a more aesthetically pleasing way.
Apply
The "standard" functions that we’ve seen above will only take us so far – that’s really the point of standard functions really.
If we need to do anything or advanced or bespoke, or rely on functionality in other packages, we need to combine groupby
with apply
.
The apply
statement is quite self-explanatory, and described succinctly in the documentation⁴:
Apply a function along an axis of the DataFrame.
Objects passed to the function are Series objects whose index is either the DataFrame’s index (
axis=0
) or the DataFrame’s columns (axis=1
).
Let’s do some examples of using bespoke functions on groupings.
We’ll start by doing a bit of a different calculation: scaling the group median by the ratio of the group maximum to the group minimum; we’ll do this for each type of customer split by gender.
# define the bespoke function
def scaled_median(s):
# calculates Series median x Series maximum / Series minimum
return s.median() * s.max() / s.min()
# apply it to data
segments = ['attrition_flag','gender']
feature = 'customer_age'
df.groupby(segments)[feature].apply(scaled_median).to_frame(name='result')

If you’re feeling especially Pythonic, you can use lambda functions in the apply
statement. That would look something like:
apply(lambda u: u.median() * u.max() / u.min())
Now, what about applying a custom function to multiple features (columns)? It takes a bit more thought – let’s use an example to demonstrate.
The organisation has a heuristic which indicates the likely receptiveness of a customer to a credit offer. This heuristic – let’s call it the "appetite score" – is based on marital status, gender, number of dependents, and the average credit utilisation. We would like to understand the average appetite score for each customer type and gender.
Code-wise, this could look something like this:
# maps to convert categorical to numeric
status_mapper = {'Married':1.1, 'Single':1, 'Unknown':0.8, 'Divorced':0.9}
gender_mapper = {'F':1.1, 'M':1}
# define the appetite score function
def cohort_score(dataframe):
score = (
dataframe['marital_status'].map(status_mapper).astype(int)
* dataframe['gender'].map(gender_mapper).astype(int)
* (1 + dataframe['dependent_count'])
* dataframe['avg_utilization_ratio']
).mean()
return score
# apply
segments = ['attrition_flag','gender']
df.groupby(segments).apply(cohort_score).to_frame(name = 'score')
… which gives:

A few things to notice here.
We use dictionaries and map
to convert marital status and gender to numeric values for calculation in the heuristic, remembering to convert the data type appropriately.
We don’t explicitly select the features we need for the heuristic. This is actually a nifty trick which I haven’t really touched on – we can feed both Series and DataFrames into the apply
. In this case, since we don’t specify which features to use, we are effectively passing the resulting grouped DataFrame to the cohort_score
function, which in turn filters out necessary columns.
The calculation of the mean is baked into the function. Now arguably, this is cheating a bit – in real life, we may want to build a heuristic which calculates the appetite score on an individual level, and then create group statistics from the result. This actually ties in well with two other concepts : using functions from other packages and using a lambda function.
Let’s do exactly that now – define a heuristic which doesn’t return an average, use numpy
to calculate the group average, and wrap it in a lambda function.
# get numpy
import numpy as np
# define the appetite score function for an individual
def cohort_score(series):
score = (
series['marital_status'].map(status_mapper).astype(int)
* series['gender'].map(gender_mapper).astype(int)
* (1 + series['dependent_count'])
* series['avg_utilization_ratio']
)
return score
# apply
segments = ['attrition_flag','gender']
d = df.groupby(segments).apply(lambda u: np.mean(cohort_score(u)))
d.to_frame(name = 'score')
which gives:

Look familiar? It should do, as it’s exactly the same result as the calculation above.
One thing to notice is how easy it is to use a function from another package. In this case, we’ve used the mean calculation from numpy
, but the concept is the same – feed the function into apply
, making reference to the package it comes from (remembering of course to import the package!).
Transform
You’ll by now have noticed that groupby
returns group statistics, presented at a group level. Perfect if that’s what you’re after – for instance, for visualisation or summation – but there are cases where we need to "map" the group result back to individual observations.
I’m sure there are many other applications for this, but this approach can be extremely useful in feature engineering. Imagine for a moment that you’re predicting house prices — wouldn’t it be good to know how big an individual house is compared to the "average" house in the same neighborhood?
Luckily for us, pandas provides easy functionality which allows us to do just that, without the need for any joining or merging. This is the magic of transform, though the documentation undersells it a bit⁵:
Call
func
on self producing a DataFrame with the same axis shape as self
Maybe an example would be useful! Let’s see how we could apply the scaled median function again, but this time using transform to create a new column in the data with the group results mapped back to each individual observation:
# get example data
df_example = df[['attrition_flag','gender','customer_age']].copy()
# apply it to data
segments = ['attrition_flag','gender']
feature = 'customer_age'
df_example['scaled_median_age'] = (
df_example
.groupby(segments)[feature]
.transform(scaled_median)
)
df_example
which gives:

Here we see how the same group values are produced, but are also mapped back to size and included in the DataFrame.
Tips and tricks
Now that we have a grounding in groupby
, apply
, and transform
, let’s take a look at some tips and tricks that might be useful.
The versatility of apply
The apply
function is very versatile. We’ve seen how we can apply it with a groupby
operation, but that’s not always necessary – i.e. you can use apply
on a DataFrame directly.
Axes and apply
When I started using apply
in anger, I encountered (probably more than) my fair share of error messages. More often than not, it was down to me not completely understanding the axis that the function was being applied along – i.e. I was incorrectly specifying whether or not to apply the function to rows or columns.
So, if you’re certain that your function works but pandas
is still being disagreeable, I’d recommend taking a look at the axis
argument in apply
.
Using built-in functions with apply
As we’ve seen above, we can use built-in pandas
functions directly on grouped DataFrames.
We can also use the same built-in functions in an apply
, though we have to refer to the function by its name. For instance, using the built-in standard deviation function would look something like this:
segments = ['attrition_flag','gender']
feature = 'customer_age'
df.groupby(segments)[feature].apply('std')
So as it is with most things Python, there’s more than one way of doing things. However, using built-in functions in an apply
statement like this can actually be slower than using them directly.
Missing data imputation
We earlier touched on how apply
and transform
are useful techniques when doing feature engineering. Well, they’re also great for data cleansing, particularly if you want to impute missing values using segmental statistics.
Here’s an example where we see how we can use median customer age for customer type and marital status to impute missing customer age values.
# example data frame
df_example = df[['attrition_flag','marital_status','customer_age']].copy()
# randomly create missing values
df_example['customer_age'] = np.where(
np.random.random(size = len(df)) <= 0.5,
df['customer_age'],
np.nan
)
# impute using median
df_example['customer_age_imputed'] = (
df_example
.groupby(['attrition_flag','marital_status'])['customer_age']
.transform('median')
)
df_example

This is obviously quite a contrived example but the logic transfers well to other use cases, especially if the groups are fairly homogeneous.
We’ll leave it here for now, and do a quick recap.
Wrapping up
We’ve covered a fair bit of ground here.
Firstly, we had a look at the groupby
operation – what it is, how it works, and what it returns.
Once we found our feet using basic function with groupby
, we pushed the boat out a bit and moved on to using bespoke functions by combining apply
with groupby
.
While having statistics at the group level is useful, we discussed why we might want to map those group statistics back to each individual observation. We saw how we can use groupby
and transform
to do exactly that.
I also shared a few "tips and tricks" that I’ve learned over time – hopefully they make your journey with groupby
, apply
, and transform
a little smoother.
Now that you’ve seen a few examples, I’d encourage you to have a go yourself – practice makes perfect and all that.
Of course, there are extra bits we can do to take our analysis and presentation to the next level. The first is to appropriately format DataFrames:
… and the second is to produce presentation-ready charts:
As always, I hope you’ve enjoyed reading this as much as I have enjoyed writing it. As always, I’d love to learn about new tips and tricks, so please feel free to share in the comments!
References and resources
- zhyli. (2020). Prediction of Churning Credit Card Customers [Data set]. Zenodo., provided under the Creative Comons – Attribution 4.0 International licence.
- pandas.DataFrame.groupby – pandas 1.5.3 documentation (pydata.org)
- pandas.DataFrame.agg – pandas 1.5.3 documentation (pydata.org)
- pandas.DataFrame.apply – pandas 1.5.3 documentation (pydata.org)
- pandas.DataFrame.transform – pandas 1.5.3 documentation (pydata.org)