The group by operations are commonly used in exploratory Data Analysis. Almost every data analysis tool and framework provides an implementation of group by operation.
The group by operation involves grouping numerical values based on the values of a discrete or categorical variable and then apply some form of transformation or aggregation.
For instance, finding the average house prices in different neighborhoods of a city is a group by operation. The prices are grouped based on neighborhoods and the mean function is applied.
The following figure illustrates how a typical group by operation is executed.

In this article, we will see how group by operations are done in Pandas, SQL, and MongoDB (NoSQL database). There will be a separate section for each tool and we will do the same examples in each section.
Pandas
Pandas is a highly popular data analysis and manipulation library for Python. The group by operations are done using the "groupby" function.
The dataset is stores in a dataframe called marketing.

Let’s first calculate the average amount spent per age group. We select both the column to be aggregates and the column to be used for grouping. The grouping column is passed to the groupby function. Then, we apply the desired aggregate function which is the mean function in this case.
marketing[['Age','AmountSpent']].groupby('Age').mean()
AmountSpent
Age
---------------------
Middle 1501.690945
Old 1432.126829
Young 558.623693
The middle aged customers tend to spend most on average.
We can group by multiple columns. The group by function accepts multiple columns in a list. Let’s also add the gender column in the groups to make the previous example more detailed.
marketing[['Age','Gender','AmountSpent']]
.groupby(['Age','Gender']).mean()
AmountSpent
Age Gender
-------------------------------------
Middle Female 1301.339806
Male 1638.354305
Old Female 1279.310078
Male 1691.513158
Young Female 501.257310
Male 643.189655
It seems like males tend to spend more than females in all age groups.
We have seen multiple columns to be used grouping. Another common case is to aggregate multiple columns. For instance, we can calculate the average salary and total spent amount for categories in the own home column.
There are multiple ways of doing this task. The simplest one is as follows:
marketing[['OwnHome','Salary','AmountSpent']]
.groupby('OwnHome').agg({'Salary':'mean', 'AmountSpent':'sum'})
Salary AmountSpent
OwnHome
---------------------------------------
Own 69758.720930 796258
Rent 41546.280992 420512
We pass a dictionary to the agg function that indicates the aggregations applied to each column. However, we can’t really tell from the results which aggregation function is applied the columns because it only shows the column names.
One solution overcome this issue is to use the NamedAgg method.
marketing[['OwnHome','Salary','AmountSpent']]
.groupby('OwnHome').agg(
avgSalary = pd.NamedAgg('Salary','mean'),
totalSpent = pd.NamedAgg('AmountSpent','sum')
)
avgSalary totalSpent
OwnHome
---------------------------------------
Own 69758.720930 796258
Rent 41546.280992 420512
SQL
SQL is a programming language used for managing data in a relational database. It allows for efficient and versatile ways to filter, transform, and analyze data stored in databases.
There are many relational database managements systems that use SQL. Although most of the SQL syntax is the same among them, there might be small differences. I will be using MySQL.
The dataset is stored in a table called marketing.

The first example includes finding the average spent amount for categories in the age column. We use the select statement with the group by clause. The columns are selected first. The aggregate function is specified while selecting the column. Then, we use the group by with the age column.
mysql> select Age, avg(AmountSpent)
-> from marketing
-> group by Age;
+--------+------------------+
| Age | avg(AmountSpent) |
+--------+------------------+
| Middle | 1501.6909 |
| Old | 1432.1268 |
| Young | 558.6237 |
+--------+------------------+
In the second example, we add another column to be used for grouping. For SQL, we add multiple columns to the group by clause separated by comma.
mysql> select Age, Gender, avg(AmountSpent)
-> from marketing
-> group by Age, Gender;
+--------+--------+------------------+
| Age | Gender | avg(AmountSpent) |
+--------+--------+------------------+
| Middle | Female | 1301.3398 |
| Middle | Male | 1638.3543 |
| Old | Female | 1279.3101 |
| Old | Male | 1691.5132 |
| Young | Female | 501.2573 |
| Young | Male | 643.1897 |
+--------+--------+------------------+
The third example includes multiple aggregations. We calculate the average salary and total spent amount for categories in the own home column.
mysql> select OwnHome, avg(Salary) as avgSalary,
-> sum(AmountSpent) as totalSpent
-> from marketing
-> group by OwnHome;
+---------+------------+------------+
| OwnHome | avgSalary | totalSpent |
+---------+------------+------------+
| Own | 69758.7209 | 796258 |
| Rent | 41546.2810 | 420512 |
+---------+------------+------------+
For Pandas, we have used the NamedAgg method to rename the aggregated columns. For SQL, we can change the names using the "as" keyword.
MongoDB (NoSQL database)
NoSQL refers to non-SQL or non-relational database design. NoSQL also provides an organized way of storing data but not in tabular form.
There are several NoSQL databases used in the Data Science ecosystem. In this article, we will be using MongoDB which stores data as documents. A document in MongoDB consists of field-value pairs. Documents are organized in a structure called "collection". As an analogy, we can think of documents as rows in a table and collections as tables.
The dataset is stored in a collection called marketing. Here is a document in the marketing collection that represents an observation (i.e. a row in a table).
> db.marketing.find().limit(1).pretty()
{
"_id" : ObjectId("6014dc988c628fa57a508088"),
"Age" : "Middle",
"Gender" : "Male",
"OwnHome" : "Rent",
"Married" : "Single",
"Location" : "Close",
"Salary" : 63600,
"Children" : 0,
"History" : "High",
"Catalogs" : 6,
"AmountSpent" : 1318
}
The db refers to the current database. We need to specify the collection name after the dot.
MongoDB provides the aggregate pipeline for data analysis operations such as filtering, transforming, filtering, and so on. For group by operations, we use the "$group" stage in the aggregate pipeline.
The first example is to calculate average spent amount for each age group.
> db.marketing.aggregate([
... { $group: { _id: "$Age", avgSpent: { $avg: "$AmountSpent" }}}
... ])
{ "_id" : "Old", "avgSpent" : 1432.1268292682928 }
{ "_id" : "Middle", "avgSpent" : 1501.6909448818897 }
{ "_id" : "Young", "avgSpent" : 558.6236933797909 }
The fields (i.e. column in table) used for grouping are passed to the group stage with the "_id" keyword. We assign a name for each aggregation that contains the field to be aggregated and the aggregation function.
The next example contains two fields used for grouping. We specify both fields with the "_id" keyword as follows:
> db.marketing.aggregate([
... {
... $group:
... {
... _id: {OwnHome: "$OwnHome", Gender: "$Gender"},
... avgSpent: { $avg: "$AmountSpent" }
... }
... }
... ]).pretty()
{
"_id" : {
"ownhome" : "Rent",
"gender" : "Male"
},
"avg_spent" : 996.1238532110092
}
{
"_id" : {
"ownhome" : "Own",
"gender" : "Male"
},
"avg_spent" : 1742.0036231884058
}
{
"_id" : {
"ownhome" : "Own",
"gender" : "Female"
},
"avg_spent" : 1314.4375
}
{
"_id" : {
"ownhome" : "Rent",
"gender" : "Female"
},
"avg_spent" : 764.5
}
The final example contains two fields to be aggregated. We specify them separately in the group stage as follows.
> db.marketing.aggregate([
... {
... $group:
... {
... _id: "$OwnHome",
... avgSalary: { $avg: "$Salary" },
... totalSpent: { $sum: "$AmountSpent" }
... }
... }
... ])
{ "_id" : "Own", "avgSalary" : 69758.72093023256, "totalSpent" : 796258 }
{ "_id" : "Rent", "avgSalary" : 41546.28099173554, "totalSpent" : 420512 }
Conclusion
We have done three examples that demonstrate how grouping is done in Pandas, SQL, and MongoDB. The goal of this article is to introduce the idea behind the group by operation and cover the basic syntax for each tool.
For each tool, we can performed more complicated group by operations. For instance, the results can be sorted based on an aggregated column. All of the tools provide a decent way to sort the results of the group by operations.
Thank you for reading. Please let me know if you have any feedback.