I am about to write a very practical post with a few code snippets.
This morning, while working with some EDA and data wrangling, I caught myself testing many functions to make the data the way I wanted for plotting. That’s what I want to share with you.
For these examples, I’m using the dataset taxis from seaborn
library.
df = sns.load_dataset('taxis')
df.head(2)

Group, Count and Rename
Let’s group the data by payment type and see what’s the preferred type by _pickupborough.
# Preferred payment method by pickup borough
df.groupby(['payment']).pickup_borough.value_counts()

Or we can transform that in percentages using the normalize=True
parameter.
# Preferred payment method by pickup borough
df.groupby(['payment']).pickup_borough.value_counts(normalize=True)
However, the resulting object from Picture 1 is a Pandas Series. If I want to manipulate that table, I will have trouble. Notice that I don’t have a column name for the counts. Thus, I need to reset the index to make it a data frame with proper column names. But see what happens when I try it.

That happens because Pandas will try to transform the resulting Series in a Dataframe and use the same name (_pickupborough) to the counts column. As it is not possible to have two columns with the same name, we get the error. To work around that, it is needed to transform to a frame, rename the column prior to reset the index. See below the first useful code.
# Remove the comments before running the code
# Group the data
df.groupby(['payment']).pickup_borough.value_counts()
# transform to dataframe
.to_frame()
# rename the column
.rename(columns={'pickup_borough': 'freq_count'})
# Reset index
.reset_index()

Group the Grouped data
Did you know you can group an already grouped data?
Let’s see that step-by-step. First, let’s create a dataset specifically for this example.
# Create a dataset
df2 = pd.DataFrame({'id': [1,1,2,2,3,3,3,3,3,3,4,4,4,6,7,7,8,8,5,5],
'register': [2,2,2,2,4,4,4,4,4,4,1,1,1,1,1,1,2,2,2,2],
'amount': np.random.randint(2,20,20)})
# Sort
df2= df2.sort_values(by='id').reset_index(drop=True)
[OUT]:
id | register | amount
0 1 2 17
1 1 2 13
2 2 2 10
3 2 2 18
4 3 4 17
Ok. So imagine you want to know what’s the average expense amount by register. You many be thinking about this:
# Mean by register
df2.groupby('register').amount.mean()
[OUT]:
register
1 12.333333
2 11.125000
4 10.000000
But this is wrong for one reason. Each transaction ID is not unique. It means that within each transaction, there’s more than one item. So, before you extract the mean amount by register, you must make each ID entry unique.
See below what’s the result when we slice only the register 1.
# Slice on register 1
df2.query('register == 1')
[OUT]:
id register amount
10 4 1 17
11 4 1 19
12 4 1 8
15 6 1 8
16 7 1 10
17 7 1 12
If I just extract the mean from this, I will be summing up the amount ($74) and dividing by 6, which is the number of rows, what gives me the $12.33 result.
What I should be doing is this instead – making each transaction unique, in order to know the real expense for each transaction, not for each item.
# Slice register 1 then Group By ID
df2.query('register == 1').groupby('id').amount.sum()
id
4 44
6 8
7 22
Nice. Now I know how much each transaction really cost, I can calculate the average for 3 unique transactions, which is (44 + 8 + 22) / 3 = $24.66.
To do that on a single row of code, use two levels of grouping. Here is the second useful code snippet.
# Step 1: Group by register and ID and sum the amount.
# Step 2: group the result by register and calculate the mean.
df2.groupby(['register','id']).amount.sum()
.groupby('register').mean()

Before You Go
There is so much more you can do with groupby…
My suggestion is for you to write the code, see what’s the output, then play with the output and see if you can nest the functions with the dot notation, if that’s interesting to you.
Just don’t write too large code lines if you’re presenting that work to someone. They can be quite difficult to read.
If this content is useful to you, feel free to follow my blog for more.