
As one of the most popular libraries in Python, Pandas has been utilised very commonly especially in data EDA (Exploratory Data Analysis) jobs. Very typically, it can be used for filtering and transforming dataset just like what we usually do using SQL queries. They share a lot of similar concepts such as joining tables. However, some features from them have the same names but different concepts. "Group By" is one of them.
In this article, I’ll introduce some tricks for the Pandas group by function, which could improve our productivity in EDA jobs. Hopefully at least one is something you never familiar with so that it could help you.
I’m sure that you know how to import Pandas in Python, but still, let me put it here. All the rest of the code in this article assume Pandas has been imported as follows.
import pandas as pd
Sample Data for Demonstration

Let’s do not waste too much time on getting a sample dataset. The iris dataset is quite commonly used in Data Science as the "hello world" dataset. We can easily get it from the Sci-kit Learn library. Let’s import it and load the iris dataset.
from sklearn.datasets import load_iris
iris = load_iris()
The iris dataset is a dictionary after loaded. The value of the key "data" is a 2-D array containing all the features of 3 types of iris. The "feature_names" contains the column names in order. The "target" contains encoded classes of iris, which as 0, 1 or 2. Finally, the "targe_names" contains the actual names of the three types of iris, which is corresponding to the encoding number 0, 1 and 2.
You’ll understand how it looks like if you print it out. The following code is for loading the iris dataset into a Pandas data frame we expected.
df = pd.DataFrame(data=iris['data'], columns=iris['feature_names'])
df['class'] = iris['target']
df['class'] = df['class'].apply(lambda i: iris['target_names'][i])
The first line of the code creates the data frame from the 2-D array with the column names. The second line added the encoded iris class, and the third line translated the numbers into iris class names.

Now, we are on the same page. Let’s use this sample data frame for the rest of this article.
The Basics

The basic idea of the Pandas group by function is not for the sake of grouping categorical values together, but to calculate some aggregated values afterwards. So, the aggregation is performed for each group.
For example, we can group the data frame based on the iris classification, and calculate the average value for each feature (column).
df.groupby('class').mean()

Other than mean()
there are lots of aggregation functions such as min()
, max()
and count()
.
OK, this section is only for review purpose. Now we should begin.
Trick 1. Aggregation on Size or Count

It is quite common to use the count()
function to aggregate the groups to get the number of rows for each group. However, this is sometimes not what you want. That is, when there are NULL or NaN values in the data frame, they will NOT be counted by the count()
function.
Let’s manually assign a NaN value to the data frame.
df.iloc[0,0] = None
The above code will create a NaN value for the first row of the sepal length column.

Then, let’s use the count()
function over it.
df.groupby('class')['sepal length (cm)'].count()

The setosa iris is counted as 49, but there is actually 50 rows. This is because the count()
function doesn’t actually count NaN values.
There is another aggregation function that is rarely used – size()
. This function will pick up all the values even though it is NaN.
df.groupby('class')['sepal length (cm)'].size()

Trick 2. Customised Aggregation Column Names

Sometimes, we may want to rename the aggregated column rather than just having a "max" or "mean" as the name which doesn’t indicate which column it aggregated from.
We can actually specify the names as the arguments in the agg()
function.
df.groupby('class')['sepal length (cm)'].agg(
sepal_average_length='mean',
sepal_standard_deviation='std'
)

Trick 3. Customised Aggregation Function

Pandas provides many aggregation functions such as mean()
and count()
. However, it is still quite limited if we can only use these functions.
In fact, we can define our own aggregation functions and pass it into the agg()
function. For example, if we want to get the mean of each column, as well as convert them into millimeters, we can define the customised function as follows.
def transformed_mean(value):
value *= 100
value_mean = value.mean()
return round(value_mean, 2)
Then, simply pass this function name into the agg()
function as argument.
df_mm = df.groupby('class').agg(transformed_mean)
df_mm.columns = ['sepal length (mm)', 'sepal width (mm)', 'petal length (mm)', 'petal width (mm)']
Don’t forget to rename the column names by changing the unit to make sure they are consistent 🙂

Of course, for such as relatively simple function, we can also define an anonymous function using lambda as follows.
df_mm = df.groupby('class').agg(lambda x: round((x * 100).mean(), 2))

So, we don’t have to define a separated function. However, it is recommended to define a proper function if the logic is relatively complex for better readability.
Trick 4. Group By Bins

Do we have to use Group By on categorical variables? The answer is NO.
If a variable is continuous, what we need to do is just creating bins to make sure they are converted into categorical values. In Pandas, we can easily create bins with equal ranges using the pd.cut()
function.
sepal_len_groups = pd.cut(df['sepal length (cm)'], bins=3)

The code above created 3 bins with equal spans. Equal means that the distances between the 3 bins are exactly the same.
Then, we can directly use these three bins in group by.
df.groupby(sepal_len_groups)['sepal length (cm)'].agg(count='count')

Summary

In this article, I’ve introduced some tricks when using the group by function along with the aggregation function in Pandas. Using the correct methods can improve our productivity to a large extent sometimes.
If you feel my articles are helpful, please consider joining Medium Membership to support me and thousands of other writers! (Click the link above)