The world’s leading publication for data science, AI, and ML professionals.

Mastering Data Aggregation with Pandas

Aggregating Data with Pandas

Source
Source

Data aggregation is the process of gathering data and expressing it in a summary form. This typically corresponds to summary statistics for numerical and categorical variables in a data set. In this post we will discuss how to aggregate data using pandas and generate insightful summary statistics.

Let’s get started!

For our purposes, we will be working with The Wines Reviews data set, which can be found here.

To start, let’s read our data into a Pandas data frame:

import pandas as pd
df = pd.read_csv("winemag-data-130k-v2.csv")

Next, let’s print the first five rows of data:

print(df.head())

USING THE DESCRIBE() METHOD

The ‘describe()’ method is a basic method that will allow us to pull summary statistics for columns in our data. Let’s use the ‘describe()’ method on the prices of wines:

print(df['price'].describe())

We see that the ‘count’, number of non-null values, of wine prices is 120,975. The mean price of wines is $35 with a standard deviation of $41. The minimum value of the price of wine is $4 and the maximum is $3300. The ‘describe()’ method also provides percentiles. Here, 25% of wines prices are below $17, 50% are below $25, and 75% are below $42.

Let’s look at the summary statistics using ‘describe()’ on the ‘points’ column:

print(df['points'].describe())

We see that the number of non-null values of points is 129,971, which happens to be the length of the data frame. The mean points is 88 with a standard deviation of 3. The minimum value of the points of wine is 80 and the maximum is 100. For the percentiles, 25% of wines points are below 86, 50% are below 88, and 75% are below 91.

USING THE GROUPBY() METHOD

You can also use the ‘groupby()’ to aggregate data. For example, if we wanted to look at the average price of wine for each variety of wine, we can do the following:

print(df['price'].groupby(df['variety']).mean().head())

We see that the ‘Abouriou’ wine variety has a mean of $35, ‘Agiorgitiko’ has a mean of $23 and so forth. We can also display the sorted values:

print(df['price'].groupby(df['variety']).mean().sort_values(ascending = False).head())

Let’s look at the sorted mean prices for each ‘province’:

print(df['price'].groupby(df['province']).mean().sort_values(ascending = False).head())

We can also look at more than one column. Let’s look at the mean prices and points across ‘provinces’:

print(df[['price', 'points']].groupby(df.province).mean().head())

I’ll stop here but I encourage you to play around with the data and code yourself.

CONCLUSION

To summarize, in this post we discussed how to aggregate data using pandas. First, we went over how to use the ‘describe()’ method to generate summary statistics such as mean, standard deviation, minimum, maximum and percentiles for data columns. We then went over how to use the ‘groupby()’ method to generate statistics for specific categorical variables, such as the mean price in each province and the mean price for each variety. I hope you found this post useful/interesting. The code from this post is available on GitHub. Thank you for reading!


Related Articles