The world’s leading publication for data science, AI, and ML professionals.

Breaking Down the Powerful Magic Behind the Pandas GroupBy Function

A detailed explanation of how groupby works under the hood to help you understand it better.

Photo by Tobias Fischer on Unsplash
Photo by Tobias Fischer on Unsplash

In recent years, some of the most popular data sets and polls have been those surrounding governmental elections. Election season has become a time of countless charts, maps, polls, and predictions making their way through the popular media.

I want you to imagine you wake up one hectic morning and start scrolling through your subscription of The New York Times to peruse some of the data (if you would never do this, just humor me for a second). You’re tired, your eyes are barely open, and your mental processing capacity is still warming up. You just want a quick, easy-to-interpret insight into the happenings of the moment.

And then, what the Times gives you is one, giant data set, where every row is a single voter, and the columns contain various data about age, location, ethnicity, sex, etc., etc. – finally ending with the candidate the voter selected.

That wouldn’t make much sense now, would it? Even if you scrolled and scrolled for hours, it’s unlikely that data formatted this way will provide you any meaningful information about the underlying data set. It’s simply too scattered. Too raw.

As data scientists, one of our primary tasks is to discern key insights from data and provide them to the public in a simple, understandable way. There are various ways to achieve this task – the one I want to focus on today is grouping together various attributes of the data in an effort to discover patterns.

Depending on your tool of choice, methods for this can vary. In this article, I’ll talk about a common method for grouping and aggregating data in Python’s Pandas module: the groupby function.

The function itself has been covered in various articles, but an often-overlooked topic is the "magic" that happens behind the scenes. In this article, while I will review the function briefly for context, I will primarily delve into the actual GroupBy object that Pandas defines under the hood. By studying its structure, my hope is that you can gain a better understanding of how groupby actually works in order to more effectively use it in your future Data Science tasks.

Let’s get into it.

A Quick Review of Groupby

The best way to understand groupby is via example. Say that we have the following, small data set called people , which contains information about people’s sexes, ages, heights, and weights:

Image by Author
Image by Author

The primary use case of the groupby function is to group data by a specific column and aggregate the values of other columns for each unique group using some specified function. For example, if we wanted to get the average ages, heights, and weights for each sex, we could do it as follows:

people.groupby('Sex').mean()
Image by Author
Image by Author

You’ll notice that the "Name" column was automatically excluded, for the simple reason that it does not make logical sense to compute a mean over a list of strings.

It’s also possible to 1) focus on the values from one column at a time and 2) apply a custom aggregation function. For instance, maybe for some odd reason, we want the sum of the squares of ages, split up by sex. Here’s the code to accomplish this task:

def sum_of_squares(arr):
    return sum([item * item for item in arr)])
people.groupby('Sex')[['Age']].agg(sum_of_squares)
Image by Author
Image by Author

Some key points from the above example:

  • In our user-defined function sum_of_squares , we make use of a list comprehension [1, 2] in order to iteratively square all items before summing them up.
  • You’ll notice that the function takes in an array. This is because when we group by the 'Sex' and extract out the 'Age' , we effectively store all the ages for each respective group ( 'Male' and 'Female' , in this case) in an array (or technically, a Series object [3]). The aggregation function then accepts this array and aggregates its values into the single, final value per group displayed in the output DataFrame. We will gain a better insight into this in the next part of the article.
  • Using double brackets to extract out the 'Age' column is a little syntactic trick which enables us to return the output as a DataFrame instead of a Series.

And with that, we’ve reviewed everything we need about groupby to properly understand what happens below the layer of abstraction. We’re now ready to go deeper.

The "Magic" Behind Groupby

For simplicity, let’s stick with our first example from above: getting the mean of all the columns after grouping by the 'Sex' variable.

Code: people.groupby('Sex').mean()

Before:

Image by Author
Image by Author

After:

Image by Author
Image by Author

This is all good and well – but a little incomplete. How? Well, if we break down the data transformation into its component parts, we get three main stages:

  1. The original, unchanged DataFrame (the "Before" picture).
  2. A transformed DataFrame which groups all unique labels in the column of interest together, along with the related values in other columns.
  3. A final DataFrame which has aggregated the values so that each group has a singular one (the "After" picture).

What happened to the middle stage? This is arguably the most important part of the process to grasp in order to deeply understand groupby , so let’s see if there’s a way to display the data for this intermediate step.

A first attempt might involve trying to display the data after calling groupby but before calling the aggregation function (in our case, mean ):

people.groupby('Sex')
Image by Author
Image by Author

Hmm, okay – so that didn’t quite go as planned. We’re just being given a string representation of the literal GroupBy object, as implemented in Pandas. It turns out, to see the actual data separated by groups, we need to use the object’s associated get_group function:

people_grouped_by_sex = people.groupby('Sex')
for group in people['Sex'].unique():
    display(people_grouped_by_sex.get_group(group))
Image by Author
Image by Author

Let’s break this down:

  • First, we store the GroupBy object in the variable people_grouped_by_sex .
  • Then, we use a loop to iterate through all the unique labels of the 'Sex' column, which we know form the unique groups of our GroupBy object. Note that it would also have worked to just loop through a hard-coded list such as ['Male', 'Female'] , but I intentionally wrote the code as above to demonstrate how you might generalize this technique to a larger data set – particularly one in which you may not know all the unique group labels beforehand.
  • Finally, we use the get_group method of the GroupBy object to access the DataFrame for each respective group – I like to call them "sub-frames" or "mini-DataFrames," though I will mention these are not standard terms by any means. The display function is used within Jupyter Notebooks [4] to output whatever object you pass to it in a pretty, human-readable format.

So now, we can see what happens in the middle stage: Pandas takes the DataFrame and splits it into a bunch of smaller ones, each of which contains the data for one of the group labels in the column we are grouping by. The values from these sub-frames are then aggregated to give us our final DataFrame.

For example, in the first sub-frame above (for the group 'Male' ), the values in the 'Age' column are 44, 30, 24, and 18. The mean of these numbers is 29.00, precisely the value we see in our final output DataFrame after the mean function is called on our GroupBy object. The other values are calculated in exactly the same way.

And there you have it – the mystery of groupby is a mystery no more.

Some Final Tips and Thoughts

I’ll end with a few general tips to remember the next time you’re dealing with groupby, whether for yourself or to explain to another:

  • Simplicity: The goal of using groupby should be to simplify your data, not complicate it even more.
  • Focus: Although it is possible to conduct a grouping of multiple columns, it’s usually a good idea to start slow and remain centered on drawing focused insights. There is a lot you can do even with just one column at a time.
  • Adaptability: Don’t get hung up on this singular solution, as there may be better options depending on your situation. There are other ways to aggregate data in Pandas.

As a final note, you might be wondering why Pandas doesn’t just show us these mini-DataFrames directly, and instead requires a roundabout approach to view them. From a Programming standpoint, it makes sense: a user doesn’t really need to know what’s going on under the hood to use groupby. Hiding its workings serves as a layer of abstraction that prevents newer users from becoming confused or overwhelmed early on.

That said, I think delving into these nuts and bolts can be an extremely useful teaching tool to gain a better, deeper understanding of how groupby actually does its job. Gaining this understanding has helped me better parse and write more complex groupby queries, and I hope it can do the same for you.

And with that, I’ll bid you farewell until next time. Happy grouping!


Want to excel at Python? Get exclusive, free access to my simple and easy-to-read guides here. Want to read unlimited stories on Medium? Sign up with my referral link below!

Murtaza Ali – Medium

My name is Murtaza Ali, and I am a PhD Student at the University of Washington studying human-computer interaction. I enjoy writing about education, programming, life, and the occasional random musing.

References

[1] https://towardsdatascience.com/whats-in-a-list-comprehension-c5d36b62f5 [2] https://levelup.gitconnected.com/whats-in-a-list-comprehension-part-2-49d34bada3f5 [3] https://pandas.pydata.org/docs/reference/api/pandas.Series.html [4] https://levelup.gitconnected.com/whats-in-a-jupyter-notebook-windows-edition-1f69c290a280


Related Articles