Introduction
When working with pandas DataFrames, we usually need to compute certain metrics for specific groups. The typical way to achieve this is through the traditional group-by expression followed by the relevant calculation we need to compute.
In today’s short tutorial we will be showcasing how to perform Group-By operations over pandas DataFrames in order to compute the mean (aka average) and median values per group.
First, let’s create an example DataFrame in pandas that we will be using throughout this article in order to demonstrate a few concepts and understand what steps we need to follow in order to derive the target result.
import pandas as pd
df = pd.DataFrame(
[
(1, 'B', 121, 10.1, True),
(2, 'C', 145, 5.5, False),
(3, 'A', 345, 4.5, False),
(4, 'A', 112, 3.0, True),
(5, 'C', 105, 2.1, False),
(6, 'A', 435, 7.8, True),
(7, 'B', 521, 9.1, True),
(8, 'B', 322, 8.7, True),
(9, 'C', 213, 5.8, True),
(10, 'B', 718, 9.1, False),
],
columns=['colA', 'colB', 'colC', 'colD', 'colE']
)
print(df)
colA colB colC colD colE
0 1 B 121 10.1 True
1 2 C 145 5.5 False
2 3 A 345 4.5 False
3 4 A 112 3.0 True
4 5 C 105 2.1 False
5 6 A 435 7.8 True
6 7 B 521 9.1 True
7 8 B 322 8.7 True
8 9 C 213 5.8 True
9 10 B 718 9.1 False
Using the mean() method
The first option we have here is to perform the groupby
operation over the column of interest, then slice the result using the column for which we want to perform the mathematical calculation and finally call the mean()
method.
Now let’s suppose that for each value appearing in column colB
, we want to compute the mean value for column colC
. The following expression will do the trick for us.
>>> df.groupby('colB')['colC'].mean()
colB
A 297.333333
B 420.500000
C 154.333333
Name: colC, dtype: float64
The result will be a pandas Series containing the mean of colC
for each of the values appearing in column colB
.
Using the agg() method
In the same way, we can instead make use of the agg()
method that can be used to perform aggregations for the specified operation – which in our case will be the mean calculation.
>>> df.groupby('colB')['colC'].agg('mean')
colB
A 297.333333
B 420.500000
C 154.333333
Name: colC, dtype: float64
The result will be exactly the same as the previous approach that we showcased.
Computing the median
In the same way you can use the same strategy in order to computer other metrics such as median, count or sum of computed groups.
In the following example, we use the same approach as we showcased in the first part of this tutorial in order to compute the median of colC
for each of the values appearing in column colB
.
>>> df.groupby('colB')['colC'].median()
colB
A 345.0
B 421.5
C 145.0
Name: colC, dtype: float64
In the same way you can use other methods such as count()
and sum()
in order to compute the corresponding metrics.
The same applies for the second approach that involves the agg()
method:
>>> df.groupby('colB')['colC'].agg('median')
colB
A 345.0
B 421.5
C 145.0
Name: colC, dtype: float64
Final Thoughts
In today’s article we discussed about one of the most commonly performed operations in pandas that requires us to perform group by operations over the DataFrame of interest.
Additionally, we showcased how to then compute useful metrics such as the mean and median values for the groups of interest. This is of course just an example of a metric value you can compute – in reality the same approach can be used in order to compute counts, sum etc.
Become a member and read every story on Medium. Your membership fee directly supports me and other writers you read. You’ll also get full access to every story on Medium.
Related articles you may also like