The world’s leading publication for data science, AI, and ML professionals.

Working With SQL Versus Pandas (Part 2) Plus Practice Problems

Aggregating and grouping data

This is part two of a series. You can find part one here:

Working With SQL Versus Pandas (Part 1) Plus Practice Problems

Introduction

In this series, I set out to compare Python Pandas to SQL. These are both popular tools for manipulating data and there is a lot of overlap in functionality. In this article, we will cover aggregating and grouping data. We once again will use the Titanic dataset that can be found on Kaggle for the examples in this article. I encourage you to download it and follow along for maximum retention. Also, don’t forget to try your hand at the practice problems at the end. Okay, let’s dive in!

Aggregate Functions

Let’s say that we are interested in finding out some stats of the data like the maximum and minimum fare paid, the average age of passengers, and how many people survived.

Sql

To do aggregate functions in SQL, you can use AVG(), COUNT(), MIN(), MAX(), and SUM() with the column you would like to apply the function to in the parenthesis. Something to be aware of is that these aggregate functions ignore null values. Here’s what it would look like for the problem we set up.

SELECT MAX(fare), MIN(fare), AVG(age), SUM(survived)
FROM titanic

Remember that the survived column is coded with 1 for those that survived and 0 for those that didn’t. This is why the SUM() function works to give us the total for those that survived.

For these aggregate functions, you may find it useful to use ROUND() for results that are prettier. If we wanted to round the result of the average age column to one decimal place for example, that would look like so: ROUND(AVG(age), 1). If we wanted to round to the nearest whole number then it would look like this: ROUND(AVG(age)).

A disadvantage with SQL is that the aggregate functions listed previously are the only standard ones. There aren’t such simple, straight forward ways to calculate things like the median or standard deviation of a column. I believe that some specific versions of SQL may have more aggregate function options, so I recommend doing some research on the version you are using.

Pandas

There are several options in pandas for aggregate functions. Pandas has the default of ignoring null values but there are parameters to include them if you really wanted to. To get specific aggregate functions done on certain columns, we could calculate each one individually. Here’s what that would look like.

titanic.fare.agg(['max','min'])
titanic.age.mean()
titanic.survived.sum()

The agg() here let’s us put in a list. Since we want to find the maximum and minimum value of the fare column, we can combine that into one line of code using the agg() method.

As mentioned in the SQL example, you may want to round for better looking results. You can chain on a round() to do so. The first parameter is the number of decimal places with 0 being the default. As an example, rounding to one place after the decimal for the average age would be titanic.age.mean().round(1).

Something really great about Pandas is the describe() method which returns a summary table on the numerical columns. The code titanic.describe() returns the table below.

This is more information than was technically asked for, but is still very useful and only needs one very simple line of code. You can also get some information on the "object" data type columns with titanic.describe(include="O"). Here’s what would be returned.

Pandas has a lot more aggregate functions than SQL like median (median()), standard deviation (std()), and variance (var()). The agg() method also allows for even more flexibility where you can aggregate based on a custom written function if you so desired.

Group By

Let’s say that we want to find some aggregated data for males and females of each class. Let’s find the count, the average age, and the number that survived of males and females in each class.

SQL

In SQL the GROUP BY clause is used for this purpose. Here is what it would look like.

SELECT sex, pclass, COUNT(*), SUM(survived), AVG(age)
FROM titanic
GROUP BY sex, pclass

In this example since we are returning columns that are not aggregated values (sex, pclass), the code will not run without the GROUP BY clause as SQL doesn’t know how to handle these columns without it. In grouping by sex and class, the aggregated columns will be calculated individually for each combination of sex and class (females in first class, males in first class, females in second class, etc.)

Pandas

In Pandas we have the groupby() method for getting the results that we want. In the same way as before, we can run code to get each individual piece that we want or we can run code that has more information than we need. Here’s what it would look like to get the information in pieces.

titanic.groupby(['sex', 'pclass']).survived.agg(['count', 'sum'])
titanic.groupby(['sex', 'pclass']).age.mean()

We could also generate one table with the information we want, but it comes with more information than was asked for. Here’s what that would look like.

new_df = titanic.groupby(['sex', 'pclass'])[['survived', 'age']]
new_df.agg(['count', 'sum', 'mean'])

In this example I created a new variable to split the line of code so it was easier to look at, but you could do this all in one line. First we apply the groupby, then slice the columns that we need, and then apply the aggregation functions to those columns. Here’s a look at the resulting table.

In some cases, this extra information may be an added bonus of interest, in other cases, it may only cause more confusion. For example, the counts for the survived and age columns are different. This is because there are null values in the age column, so the counts in the survived column are the true counts we care about as there are no nulls. But this can cause unnecessary confusion. On the other hand, the mean in the survived column tells us the percentage of those who survived from each of the groups, which is interesting information that adds value in an analysis.

Practice Problems

  1. Find the average fare price and count of passengers grouped by the embarked port and class.
  2. Find the min, max, and average age of passengers grouped by class and survival.
  3. Find the average fare price, minimum age, and maximum age of passengers grouped by survival.

STOP – Answers Below

Did you give each problem a try before peeking at the answer? Remember that it is expected that your answers will be a little different from mine as there are often multiple ways of solving.

  1. Find the average fare price and count of passengers grouped by the embarked port and class.

SQL:

SELECT embarked, pclass, COUNT(*), AVG(fare)
FROM titanic
GROUP BY embarked, pclass

Pandas:

titanic.groupby(['embarked', 'pclass']).fare.mean()
titanic.groupby(['embarked', 'pclass']).embarked.count()

2. Find the min, max, and average age of passengers grouped by class and survival.

SQL:

SELECT pclass, survived, MIN(age), MAX(age), AVG(age)
FROM titanic
GROUP BY pclass, survived

Pandas:

titanic.groupby(['pclass','survived']).age.agg(['min','max','mean'])

3. Find the average fare price, minimum age, and maximum age of passengers grouped by survival.

SQL:

SELECT survived, AVG(fare), MIN(age), MAX(age)
FROM titanic
GROUP BY survived

Pandas:

titanic.groupby('survived').age.agg(['min', 'max'])
titanic.groupby('survived').fare.mean()

Conclusion

I hope you enjoyed this discussion of aggregate functions and grouping. I want to keep these articles short enough to digest easily, which is why I decided to stop here. I look forward to continuing on the journey of this article series with you, so make sure to join me in the next part!


Related Articles