The world’s leading publication for data science, AI, and ML professionals.

2 Useful Code Snippets for Aggregated data

After you 'groupby' in Pandas, you will want to have these codes near you

Photo by Peter Burdon on Unsplash
Photo by Peter Burdon on Unsplash

I am about to write a very practical post with a few code snippets.

This morning, while working with some EDA and data wrangling, I caught myself testing many functions to make the data the way I wanted for plotting. That’s what I want to share with you.

For these examples, I’m using the dataset taxis from seaborn library.

df = sns.load_dataset('taxis')
df.head(2)
Dataset 'Taxis'. Image by the author.
Dataset ‘Taxis’. Image by the author.

Group, Count and Rename

Let’s group the data by payment type and see what’s the preferred type by _pickupborough.

# Preferred payment method by pickup borough
df.groupby(['payment']).pickup_borough.value_counts()
Picture1 - Payment type by pickup place. Image by the author.
Picture1 – Payment type by pickup place. Image by the author.

Or we can transform that in percentages using the normalize=True parameter.

# Preferred payment method by pickup borough
df.groupby(['payment']).pickup_borough.value_counts(normalize=True)

However, the resulting object from Picture 1 is a Pandas Series. If I want to manipulate that table, I will have trouble. Notice that I don’t have a column name for the counts. Thus, I need to reset the index to make it a data frame with proper column names. But see what happens when I try it.

Picture 2 - Error trying to reset_index. Image by the author.
Picture 2 – Error trying to reset_index. Image by the author.

That happens because Pandas will try to transform the resulting Series in a Dataframe and use the same name (_pickupborough) to the counts column. As it is not possible to have two columns with the same name, we get the error. To work around that, it is needed to transform to a frame, rename the column prior to reset the index. See below the first useful code.

# Remove the comments before running the code
# Group the data
df.groupby(['payment']).pickup_borough.value_counts()
# transform to dataframe
.to_frame()
# rename the column
.rename(columns={'pickup_borough': 'freq_count'})
# Reset index
.reset_index()
Picture3 - Resulting dataframe. Image by the author.
Picture3 – Resulting dataframe. Image by the author.

Group the Grouped data

Did you know you can group an already grouped data?

Let’s see that step-by-step. First, let’s create a dataset specifically for this example.

# Create a dataset
df2 = pd.DataFrame({'id': [1,1,2,2,3,3,3,3,3,3,4,4,4,6,7,7,8,8,5,5],
              'register': [2,2,2,2,4,4,4,4,4,4,1,1,1,1,1,1,2,2,2,2],
             'amount': np.random.randint(2,20,20)})
# Sort
df2= df2.sort_values(by='id').reset_index(drop=True)
[OUT]:

   id | register | amount
0   1      2         17
1   1      2         13
2   2      2         10
3   2      2         18
4   3      4         17

Ok. So imagine you want to know what’s the average expense amount by register. You many be thinking about this:

# Mean by register
df2.groupby('register').amount.mean()
[OUT]:
register 
1    12.333333 
2    11.125000 
4    10.000000

But this is wrong for one reason. Each transaction ID is not unique. It means that within each transaction, there’s more than one item. So, before you extract the mean amount by register, you must make each ID entry unique.

See below what’s the result when we slice only the register 1.

# Slice on register 1
df2.query('register == 1')
[OUT]:
    id  register  amount
10  4     1       17
11  4     1       19
12  4     1       8
15  6     1       8
16  7     1       10
17  7     1       12

If I just extract the mean from this, I will be summing up the amount ($74) and dividing by 6, which is the number of rows, what gives me the $12.33 result.

What I should be doing is this instead – making each transaction unique, in order to know the real expense for each transaction, not for each item.

# Slice register 1 then Group By ID
df2.query('register == 1').groupby('id').amount.sum()
id 
4    44 
6     8 
7    22

Nice. Now I know how much each transaction really cost, I can calculate the average for 3 unique transactions, which is (44 + 8 + 22) / 3 = $24.66.

To do that on a single row of code, use two levels of grouping. Here is the second useful code snippet.

# Step 1: Group by register and ID and sum the amount.
# Step 2: group the result by register and calculate the mean.
df2.groupby(['register','id']).amount.sum()
                              .groupby('register').mean()
Result of the nested groupby. Image by the author.
Result of the nested groupby. Image by the author.

Before You Go

There is so much more you can do with groupby…

My suggestion is for you to write the code, see what’s the output, then play with the output and see if you can nest the functions with the dot notation, if that’s interesting to you.

Just don’t write too large code lines if you’re presenting that work to someone. They can be quite difficult to read.

If this content is useful to you, feel free to follow my blog for more.

gustavorsantos


Related Articles