SQL (Structured Query Language) is used to manage databases that store data in tabular form with labelled rows and columns. NoSQL refers to non-SQL or non-relational Database design. It still provides an organized way of storing data but not in tabular form.
The common structures adapted by NoSQL databases to store data are key-value pairs, wide column, graph, or document. One of the popular NoSQL databases is MongoDB which stores data as documents.
A document in MongoDB consists of field-value pairs. Documents are organized in a structure called collection. To make an analogy, document can be considered as a row in a table and collection can be considered as an entire table.
In this article, we will compare a SQL database (MySQL) and a NoSQL database (MongoDB) in terms of the join operations. I also wrote an article that demonstrate how to perform basic operations to query SQL and NoSQL databases.
When we are to retrieve data from a relational database, the desired data is typically spread out to multiple tables. In such cases, we use SQL joins which are used to handle tasks that include selecting rows from two or more related tables.
In case of NoSQL, data for an item (or data point) is mostly stored in one collection. However, there might be cases where we need to span multiple collections to obtain all the data we need.
Thus, join queries are of crucial importance for both type of databases.
I have prepared two tables and collections with the same made up data. The first one contains information about the customers of a retail business. The second one contains information about the orders made by these customers.
Here is an entry (i.e. row) in the customer and orders tables:
+---------+------+----------+--------+
| cust_id | age | location | gender |
+---------+------+----------+--------+
| 1000 | 42 | Austin | female |
+---------+------+----------+--------+
+----------+------------+--------+---------+
| order_id | date | amount | cust_id |
+----------+------------+--------+---------+
| 1 | 2020-10-01 | 27.40 | 1001 |
+----------+------------+--------+---------+
Here is a document in the customer and orders collections:
{
"_id" : ObjectId("600e120b44284c416405dd7e"),
"cust_id" : "1000",
"age" : 42,
"location" : "Austin",
"gender" : "Female"
}
{
"_id" : ObjectId("600e141d44e046eb7c92c4fe"),
"order_id" : "1",
"date" : "2020-10-01",
"amount" : 27.4,
"cust_id" : "1001"
}
These two tables relate to each other by the cust_id column. The following examples include queries that require to use join operations. I will complete the same tasks in both databases so that we can see the differences and similarities.
Example 1
We want to see the orders made by customers who are older than 40.
NoSQL (MongoDB):
We can perform join operations using the "$lookup" keyword in aggregate pipelines. The aggregate pipelines are highly useful in MongoDB as they allow for performing many different kinds of operations in a single pipeline such as filtering, sorting, grouping, applying data aggregations and so on.
In this example, we first filter the documents based on customer age using the "$match" keyword and then select the documents from the orders table that fit the filtered condition.
> db.customer.aggregate([
... { $match: { age: {$gt:40} }},
... { $lookup: { from: "orders",
... localField: "cust_id",
... foreignField: "cust_id",
... as: "orders_docs" }}
... ]).pretty()
{
"_id" : ObjectId("600e120b44284c416405dd7e"),
"cust_id" : "1000",
"age" : 42,
"location" : "Austin",
"gender" : "Female",
"orders_docs" : [
{
"_id" : ObjectId("600e141d44e046eb7c92c4ff"),
"order_id" : "2",
"date" : "2020-10-01",
"amount" : 36.2,
"cust_id" : "1000"
},
{
"_id" : ObjectId("600e157c44e046eb7c92c50a"),
"order_id" : "13",
"date" : "2020-10-03",
"amount" : 46.1,
"cust_id" : "1000"
}
]
}
The local and foreign fields indicate the field names to be used to join the values. The output contains the documents from customer collection that fit the specified condition and the orders of those customers. There happens to be only one customer older than 40 and she has two orders.
SQL (MySQL)
We can join two tables in a select query as below.
mysql> select orders.*
-> from customer
-> join orders
-> on customer.cust_id = orders.cust_id
-> where customer.age > 40;
+----------+------------+--------+---------+
| order_id | date | amount | cust_id |
+----------+------------+--------+---------+
| 2 | 2020-10-01 | 36.20 | 1000 |
| 13 | 2020-10-03 | 46.10 | 1000 |
+----------+------------+--------+---------+
In a normal select statement, we only write the name of the columns to be selected. When we join tables, the columns are specified with the name of the table so that SQL knows where a column comes from.
Then we write the names of the tables with join keyword (e.g. customer join orders). The "on" keyword is used to indicate how these tables are related. The where statement filters the rows based on the given condition.
Example 2
We want to see the average order amount of customers from each location.
NoSQL (MongoDB):
This task requires to join two collections and then apply a data aggregation. Both can be achieved in the aggregate pipeline using the "$lookup" and "$group" stages.
> db.customer.aggregate([
... { $lookup: { from: "orders",
... localField: "cust_id",
... foreignField: "cust_id",
... as: "orders_docs" }},
... { $group: { _id: "$location",
... avg_amount: { $avg: "$amount" }}}
... ])
{ "_id" : "Houston", "avg_amount" : 44.450000 }
{ "_id" : "Dallas", "avg_amount" : 34.591667 }
{ "_id" : "Austin", "avg_amount" : 33.333333 }
After the join operation in the "$lookup" stage, we group the documents based on location by selecting "$location" as id. The next part specifies both the aggregation function which is "$avg" in our case and the field to be aggregated.
SQL (MySQL)
We apply the aggregation function while selecting the column. The results are grouped based on the location by using the group by clause.
mysql> select customer.location, avg(orders.amount) as avg_amount
-> from customer
-> join orders
-> on customer.cust_id = orders.cust_id
-> group by customer.location;
+----------+------------+
| location | avg_amount |
+----------+------------+
| Austin | 33.333333 |
| Dallas | 34.591667 |
| Houston | 44.450000 |
+----------+------------+
Example 3
In this example, we will add a filtering criterion to the previous example. For each location, let’s calculate the average order amount of customers who are younger than 30.
NoSQL (MongoDB):
We just need to add a "$match" stage at the beginning of the pipeline to apply the filtering criterion.
> db.customer.aggregate([
... { $match: { age: {$lt: 30} }},
... { $lookup: { from: "orders",
... localField: "cust_id",
... foreignField: "cust_id",
... as: "orders_docs" }},
... { $group: { _id: "$location",
... avg_amount: { $avg: "$amount" }}}
... ])
{ "_id" : "Houston", "avg_amount" : 35.625000 }
{ "_id" : "Dallas", "avg_amount" : 34.591667 }
{ "_id" : "Austin", "avg_amount" : 36.000000 }
In the "$match" stage, we specify the condition along with the field name to filtered.
SQL (MySQL)
The filtering criterion is added by using the where clause in the select statement.
mysql> select customer.location, avg(orders.amount) as avg_amount
-> from customer
-> join orders
-> on customer.cust_id = orders.cust_id
-> where customer.age < 30
-> group by customer.location;
+----------+------------+
| location | avg_amount |
+----------+------------+
| Austin | 36.000000 |
| Dallas | 34.591667 |
| Houston | 35.625000 |
+----------+------------+
Since the rows are filtered before the group by clause, we can use the where clause. If we need to apply a filter based on an aggregated values (e.g. avg_amount > 35), the having clause should be used.
Conclusion
The data we need to retrieve from a relational (SQL) or non-relational (NoSQL) database might be spread out to multiple tables or collections. Thus, it is very important to have a comprehensive understanding of join operations.
We have done three basic examples to demonstrate the idea and implementation of join operations in both SQL and NoSQL databases.
Thank you for reading. Please let me know if you have any feedback.