
Pandas is arguably the most popular Python library in the Data Science ecosystem. It provides numerous functions to perform data analysis and manipulation tasks efficiently.
One of the most frequently used Pandas functions for data analysis is the groupby function. It allows for grouping data points (i.e. rows) based on the distinct values in a column or a set of columns.
After the groups are generated, you can easily apply aggregation functions to a numerical column. Consider a dataset that contains the gender information and salaries of employees in a company.
In order to calculate the average salary of different genders, we can group the rows based on the gender column and then apply the mean function on the salary column.
In this article, we will go over 10 steps to learn the details of the groupby function. The consecutive steps contain an example built up on the previous step.
Let’s start with importing Pandas and creating a data frame with made up data.
import pandas as pd
df = pd.DataFrame({
"names": ["John", "Jane", "Ashley", "Allen", "Betty"],
"gender": ["Male", "Female", "Female", "Male", "Female"],
"height": [182, 176, 174, 178, 172],
"education": ["BS", "BS", "BA", "BA", "BS"],
"salary": [65000, 72000, 74000, 68000, 80000]
})
df

Step 1
The first step is a very basic example. We create groups based on gender and apply the mean function.
df.groupby("gender").mean()

Since we do not specify a numerical column, Pandas calculates the average value for each numerical column.
Step 2
In some cases, we only want to apply the aggregate function to a specific column. One way of doing this is to filter columns before applying the groupby function.
df[["gender","salary"]].groupby("gender").mean()

Step 3
What we did in the second step is not the optimal way. A typical real-life dataset contains several columns and we may need to calculate aggregations on not all but many columns.
Filtering the columns in such cases is kind of a tedious task. A better method is to use the NamedAgg function of Pandas.
df.groupby("gender").agg(
avg_salary = pd.NamedAgg("salary","mean")
)

We just need to specify the column name and the aggregate function. Another advantage of using the NamedAgg function is that we can assign a name to the aggregated values. It is definitely more clear and informative than using the original column name.
Step 4
There is even a simpler method than the one in the third step. The syntax is as follows:
df.groupby("gender").agg(avg_salary=("salary","mean"))

I always use this method because it requires the least amount of typing 😊 .
Step 5
Once the groups are generated, we can calculate as many aggregations as needed. For instance, the following code finds the median salary and average height for each gender.
df.groupby("gender").agg(
median_salary=("salary","median"),
avg_height=("height","mean")
)

Step 6
Up to this point, the groups are displayed as the index of a data frame. In some cases, it is better to have them as a column in the data frame. We can achieve this by using the as_index parameter.
df.groupby("gender", as_index=False).agg(
median_salary=("salary","median"),
avg_height=("height","mean")
)

This way is better especially when we have several groups.
Step 7
Just like we can calculate aggregations on multiple columns, we can create groups based on multiple columns. If we pass two columns to the groupby function, it creates groups based on the combination of distinct values in each column.
df.groupby(["gender","education"], as_index=False).agg(
median_salary=("salary","median"),
avg_height=("height","mean")
)

Make sure to pass the column names in a list. Otherwise, you will get an error.
Step 8
It is not necessary in our case but when we have several groups, we might want to sort the groups based on the aggregated values. Consider we have 100 columns and need to find the top 3 groups in terms of the highest average salary. In such cases, sorting the results becomes a necessary operation.
The sort_values function can be used along with the groupby function as follows:
df.groupby(["gender","education"], as_index=False).agg(
median_salary=("salary","median"),
avg_height=("height","mean")
).sort_values(by="median_salary", ascending=False)

The rows are sorted in ascending order by default. We can change it using the ascending parameter.
Step 9
Our data frame does not contain any missing values. However, real-life datasets are likely to have some. If there are missing values in the column used for grouping, those rows are dropped by default.
Let’s first update a value in our data frame as missing.
df.iloc[4,1] = None
df

df.groupby("gender").agg(avg_salary=("salary","mean"))

As you can see, the missing value (None) in the gender column is ignored. However, it is important to take the missing values into account in many cases. We can set the dropna parameter as False to include the missing values in the results.
df.groupby("gender", dropna=False).agg(avg_salary=("salary","mean"))

Step 10
We have seen how to calculate multiple aggregations in step 5. Another method for this operation is to pass a list of aggregations to the agg function.
df.groupby("gender", dropna=False).agg(["mean","count"])

Conclusion
The groupby is a widely-used function for data analysis. We have covered almost everything you need to know about it. There are some further details about the use of the groupby function but you would rarely need to use them.
Thank you for reading. Please let me know if you have any feedback.