The world’s leading publication for data science, AI, and ML professionals.

Group by in Pandas, SQL, and NoSQL

A ubiquitous operation in data analysis

Photo by Markus Spiske on Unsplash
Photo by Markus Spiske on Unsplash

The group by operations are commonly used in exploratory Data Analysis. Almost every data analysis tool and framework provides an implementation of group by operation.

The group by operation involves grouping numerical values based on the values of a discrete or categorical variable and then apply some form of transformation or aggregation.

For instance, finding the average house prices in different neighborhoods of a city is a group by operation. The prices are grouped based on neighborhoods and the mean function is applied.

The following figure illustrates how a typical group by operation is executed.

Group by operation (image by author)
Group by operation (image by author)

In this article, we will see how group by operations are done in Pandas, SQL, and MongoDB (NoSQL database). There will be a separate section for each tool and we will do the same examples in each section.


Pandas

Pandas is a highly popular data analysis and manipulation library for Python. The group by operations are done using the "groupby" function.

The dataset is stores in a dataframe called marketing.

(image by author)
(image by author)

Let’s first calculate the average amount spent per age group. We select both the column to be aggregates and the column to be used for grouping. The grouping column is passed to the groupby function. Then, we apply the desired aggregate function which is the mean function in this case.

marketing[['Age','AmountSpent']].groupby('Age').mean()
          AmountSpent                 
Age     
---------------------                         
Middle    1501.690945                 
Old       1432.126829                 
Young      558.623693

The middle aged customers tend to spend most on average.


We can group by multiple columns. The group by function accepts multiple columns in a list. Let’s also add the gender column in the groups to make the previous example more detailed.

marketing[['Age','Gender','AmountSpent']]
.groupby(['Age','Gender']).mean()
                          AmountSpent                 
Age          Gender
-------------------------------------                             
Middle       Female       1301.339806                 
               Male       1638.354305                 
Old          Female       1279.310078                 
               Male       1691.513158                 
Young        Female        501.257310                 
               Male        643.189655

It seems like males tend to spend more than females in all age groups.


We have seen multiple columns to be used grouping. Another common case is to aggregate multiple columns. For instance, we can calculate the average salary and total spent amount for categories in the own home column.

There are multiple ways of doing this task. The simplest one is as follows:

marketing[['OwnHome','Salary','AmountSpent']]
.groupby('OwnHome').agg({'Salary':'mean', 'AmountSpent':'sum'})
                 Salary     AmountSpent                 
OwnHome 
---------------------------------------                                   
Own        69758.720930          796258                 
Rent       41546.280992          420512

We pass a dictionary to the agg function that indicates the aggregations applied to each column. However, we can’t really tell from the results which aggregation function is applied the columns because it only shows the column names.

One solution overcome this issue is to use the NamedAgg method.

marketing[['OwnHome','Salary','AmountSpent']]
.groupby('OwnHome').agg(
     avgSalary = pd.NamedAgg('Salary','mean'),
     totalSpent = pd.NamedAgg('AmountSpent','sum')
)
               avgSalary     totalSpent                 
OwnHome 
---------------------------------------                                   
Own        69758.720930          796258                 
Rent       41546.280992          420512

SQL

SQL is a programming language used for managing data in a relational database. It allows for efficient and versatile ways to filter, transform, and analyze data stored in databases.

There are many relational database managements systems that use SQL. Although most of the SQL syntax is the same among them, there might be small differences. I will be using MySQL.

The dataset is stored in a table called marketing.

(image by author)
(image by author)

The first example includes finding the average spent amount for categories in the age column. We use the select statement with the group by clause. The columns are selected first. The aggregate function is specified while selecting the column. Then, we use the group by with the age column.

mysql> select Age, avg(AmountSpent)
    -> from marketing
    -> group by Age;
+--------+------------------+
| Age    | avg(AmountSpent) |
+--------+------------------+
| Middle |        1501.6909 |
| Old    |        1432.1268 |
| Young  |         558.6237 |
+--------+------------------+

In the second example, we add another column to be used for grouping. For SQL, we add multiple columns to the group by clause separated by comma.

mysql> select Age, Gender, avg(AmountSpent)
    -> from marketing
    -> group by Age, Gender;
+--------+--------+------------------+
| Age    | Gender | avg(AmountSpent) |
+--------+--------+------------------+
| Middle | Female |        1301.3398 |
| Middle | Male   |        1638.3543 |
| Old    | Female |        1279.3101 |
| Old    | Male   |        1691.5132 |
| Young  | Female |         501.2573 |
| Young  | Male   |         643.1897 |
+--------+--------+------------------+

The third example includes multiple aggregations. We calculate the average salary and total spent amount for categories in the own home column.

mysql> select OwnHome, avg(Salary) as avgSalary,
    -> sum(AmountSpent) as totalSpent
    -> from marketing
    -> group by OwnHome;
+---------+------------+------------+
| OwnHome | avgSalary  | totalSpent |
+---------+------------+------------+
| Own     | 69758.7209 |     796258 |
| Rent    | 41546.2810 |     420512 |
+---------+------------+------------+

For Pandas, we have used the NamedAgg method to rename the aggregated columns. For SQL, we can change the names using the "as" keyword.


MongoDB (NoSQL database)

NoSQL refers to non-SQL or non-relational database design. NoSQL also provides an organized way of storing data but not in tabular form.

There are several NoSQL databases used in the Data Science ecosystem. In this article, we will be using MongoDB which stores data as documents. A document in MongoDB consists of field-value pairs. Documents are organized in a structure called "collection". As an analogy, we can think of documents as rows in a table and collections as tables.

The dataset is stored in a collection called marketing. Here is a document in the marketing collection that represents an observation (i.e. a row in a table).

> db.marketing.find().limit(1).pretty()
{
 "_id" : ObjectId("6014dc988c628fa57a508088"),
 "Age" : "Middle",
 "Gender" : "Male",
 "OwnHome" : "Rent",
 "Married" : "Single",
 "Location" : "Close",
 "Salary" : 63600,
 "Children" : 0,
 "History" : "High",
 "Catalogs" : 6,
 "AmountSpent" : 1318
}

The db refers to the current database. We need to specify the collection name after the dot.

MongoDB provides the aggregate pipeline for data analysis operations such as filtering, transforming, filtering, and so on. For group by operations, we use the "$group" stage in the aggregate pipeline.

The first example is to calculate average spent amount for each age group.

> db.marketing.aggregate([
... { $group: { _id: "$Age", avgSpent: { $avg: "$AmountSpent" }}}
... ])
{ "_id" : "Old", "avgSpent" : 1432.1268292682928 }
{ "_id" : "Middle", "avgSpent" : 1501.6909448818897 }
{ "_id" : "Young", "avgSpent" : 558.6236933797909 }

The fields (i.e. column in table) used for grouping are passed to the group stage with the "_id" keyword. We assign a name for each aggregation that contains the field to be aggregated and the aggregation function.


The next example contains two fields used for grouping. We specify both fields with the "_id" keyword as follows:

> db.marketing.aggregate([
... {
...   $group:
...    {
...      _id: {OwnHome: "$OwnHome", Gender: "$Gender"},
...      avgSpent: { $avg: "$AmountSpent" }
...    }
... }
... ]).pretty()
{
 "_id" : {
  "ownhome" : "Rent",
  "gender" : "Male"
 },
 "avg_spent" : 996.1238532110092
}
{
 "_id" : {
  "ownhome" : "Own",
  "gender" : "Male"
 },
 "avg_spent" : 1742.0036231884058
}
{
 "_id" : {
  "ownhome" : "Own",
  "gender" : "Female"
 },
 "avg_spent" : 1314.4375
}
{
 "_id" : {
  "ownhome" : "Rent",
  "gender" : "Female"
 },
 "avg_spent" : 764.5
}

The final example contains two fields to be aggregated. We specify them separately in the group stage as follows.

> db.marketing.aggregate([
... { 
...   $group:
...     {
...       _id: "$OwnHome",
...       avgSalary: { $avg: "$Salary" },
...       totalSpent: { $sum: "$AmountSpent" }
...     }
... }
... ])
{ "_id" : "Own", "avgSalary" : 69758.72093023256, "totalSpent" : 796258 }
{ "_id" : "Rent", "avgSalary" : 41546.28099173554, "totalSpent" : 420512 }

Conclusion

We have done three examples that demonstrate how grouping is done in Pandas, SQL, and MongoDB. The goal of this article is to introduce the idea behind the group by operation and cover the basic syntax for each tool.

For each tool, we can performed more complicated group by operations. For instance, the results can be sorted based on an aggregated column. All of the tools provide a decent way to sort the results of the group by operations.

Thank you for reading. Please let me know if you have any feedback.


Related Articles