[Disclaimer: This post contains some affiliate links to my Udemy Course]
Subqueries are a cool concept that we can use when programming in Structured Query Language (SQL). Starting with the problem: sometimes we want to access Data outside the context of our query to filter rows or perform some special filter based on an aggregation metric. When we want to do that, we may fall into the trap of creating too many checkpoints that depend on each other with several temporary tables. These temporary objects will not only take precious space from our server, but will also make the code harder to debug and a bit more confusing.
Subqueries are a simple SQL concept that will help us crunch several query statements into a single one. There are three major use cases (at least that I personally use a lot, but there are more) that rely on subqueries:
- Accessing aggregations with a subquery.
- Filtering rows from a table with the context of another.
- Performing double-layer aggregations such as average of averages or an average of sums.
In this post, we’ll look into some examples of how to use subqueries in Sql and how they may bring flexibility to your data pipelines!
If you are looking to up your SQL game, using subqueries may make a massive difference. By adding them to your toolbelt, you will write more organized queries that don’t depend on a high number of sequential steps, making it easier to reduce the amount of code you write. On the downside, they may be a bit slow than their join alternatives -so, let’s investigate them next!
Creating the Data
For this example, we’ll use data from two hypothetical stores. Let’s imagine that we have two tables, one for store_a
and other for store_b
. Each table records purchases from customers and it’s built like a typically facts table – meaning that each purchase is associated with the following:
- The employee_id that recorded it.
- The customer_id that bought items in the transaction.
- Every purchase row is multiplied for each product bought.
- Every product was brought at a specific price.
Let’s create and fill our dummy data for both stores:
create table sandbox.store_a (
transaction_id int,
employee_id int not null,
customer_id int not null,
product_id int not null,
price_paid numeric(19,4)
);
create table sandbox.store_b (
transaction_id int,
employee_id int not null,
customer_id int not null,
product_id int not null,
price_paid numeric(19,4)
);
insert into sandbox.store_a (
transaction_id, employee_id, customer_id, product_id, price_paid
) values
(1, 1, 1, 1, 9.99),
(1, 1, 1, 2, 7.99),
(1, 1, 1, 3, 3.99),
(1, 1, 1, 4, 2.99),
(2, 1, 2, 1, 9.99),
(2, 1, 2, 3, 3.99),
(3, 2, 1, 1, 9.99),
(4, 2, 1, 2, 7.99),
(4, 2, 1, 3, 3.99),
(5, 2, 2, 1, 9.99),
(6, 1, 3, 1, 3.99);
insert into sandbox.store_b (
transaction_id, employee_id, customer_id, product_id, price_paid
) values
(1, 5, 10, 1, 9.99),
(1, 5, 10, 3, 3.99),
(1, 5, 10, 4, 2.99),
(2, 1, 11, 1, 9.99),
(3, 1, 11, 3, 3.99),
(4, 6, 11, 1, 9.99),
(4, 6, 11, 2, 7.99),
(5, 7, 10, 3, 3.99);
Our tables look like the following:
Every store contains different purchases from different customers. To learn about subqueries in this context, we will use two use cases:
- How can we filter the products that have been bought at least three times in
store_b
? - How can we compute the average of the transactions in
store_a
? - On
store_a
, how we can filter the transactions that have a value higher than the average?
Using a SubQuery as a Filter
Let’s start with our first example – how can we filter the rows with products that have been bought at least three times in store_b
?
If we had to draw a mental plan on how to do this:
- Count the number of times each product was bought in
store_b
; - Store the
product_ids
where that number is higher or equal to three; - Filter those products from
store_b
;
In "normal" pipeline fashion, one could do this in two steps, using a temporary table. For instance:
create temporary table sandbox.top_products as (
select product_id
from sandbox.store_b
group by product_id
having count(product_id) >= 3
);
select a.* from
sandbox.store_b as a
inner join
sandbox.top_products as b
on a.product_id = b.product_id;
Creating our top_products
temporary table, and using its domain to restrict store_b
rows using an inner join is a possibility. As an alternative, we can pass our top_products
with a subquery in the where clause:
select * from
sandbox.store_b
where product_id IN (
select product_id from sandbox.top_products
);
Notice that we can embed the product_ids
in the where
clause of the query, by passing a select statement to it. But, if we can directly pass queries to where
clauses, we can now ditch our temporary table! Let’s see:
select * from
sandbox.store_b
where product_id IN (
select product_id
from sandbox.store_b
group by product_id
having count(product_id) >= 3
);
Done! We’ve crunched everything into the same query using our first subquery example. A tip: always use indentation to guide the reader of how your query behaves. With subqueries, leaving it in an indented block, makes it easier to read and understand.
In the end, the result of all the queries above is the same – we output the product_ids
1 and 3 from the store_b
:
Multilayer Aggregations
In the case, where we have a denormalized table, we may want to perform some multi-level aggregations on the data. For instance, if we want to compute the average of all transactions, we can’t apply an average directly, as our table is oriented to product_ids
and not to transaction_ids
.
Again, if we would apply "normal" data pipeline rules, we would probably break this into two queries:
create temporary table sandbox.average_price as (
select transaction_id, sum(price_paid) as total_value
from sandbox.store_a
group by transaction_id
);
select avg(total_value) from sandbox.average_price;
With the last select
statement, we know that the average transaction on store_a
is, approximately, 12.48 €. Again, this is another excellent use case for a subquery! Can you build the subquery syntax on your own?
Let’s see:
select avg(average_price.total_value) as average_transaction from (
select transaction_id, sum(price_paid) as total_value
from sandbox.store_a
group by transaction_id
) as average_price
;
This time, our subquery goes inside the from
clause! Subqueries are extremely flexible and enable us to fit them in the most famous SQL clauses
.
The result from the query above is exactly the same as the two-table approach:
Filtering the table based on an Aggregation
Another use case where subqueries may fit happens when we use SQL variables in some scenarios. For instance, imagine that we would like to filter transactions that have a value higher than the average (the output must retain the original product-oriented row) – can we do this in a couple of queries?
First, let’s do the multi-table approach:
- We start with computing the sum of each transaction.
- We insert the average value of all transactions into a variable.
- We filter the group of transactions where the total value is higher then the mean of all transactions.
Without using subqueries, this could be implemented in the following way:
create temporary table sandbox.average_price as (
select transaction_id, sum(price_paid) as total_value
from sandbox.store_a
group by transaction_id
);
select @avg_transaction:= avg(total_value)
from sandbox.average_price;
create temporary table sandbox.transaction_over_avg as (
select distinct transaction_id
from sandbox.average_price
where total_value > @avg_transaction
);
select a.*
from sandbox.store_a as a
inner join
sandbox.transaction_over_avg as b
on a.transaction_id = b.transaction_id;
The final select
would yield the information about transactions 1 and 2, the only purchases that have a value higher than the average:
If we want to implement this using some version of subqueries, we can combine them with variables:
select @avg_transaction:= avg(agg_table.total_value)
from (
select transaction_id, sum(price_paid) as total_value
from sandbox.store_a
group by transaction_id
) as agg_table;
select *
from sandbox.store_a
where transaction_id in (
select transaction_id
from sandbox.store_a
group by transaction_id
having sum(price_paid) > @avg_transaction
)
As you can see, this reduced our code a bit and we used subqueries both when creating the variable @avg_transaction
, and when filtering our original table. One could argue that three-level subqueries are possible, but may make the code a bit complicated to read so there’s no silver bullet here and that’s why a two-query solution fits.
Sometimes, subqueries may not even be the most clear way to solve a problem! Ideally, we would like to strike the balance between complex and long code -in general, shorter code is better but you may be doing it with too many subquery layers, making it hard to debug and understand (and also making your query slower).
Before we leave, two arguments against subqueries:
- temporary tables can also be extremely relevant when you want to save the state of your data for debugging purposes – with subqueries, you lose that layer of extra validation that may be required.
- subqueries are typically slower than join alternatives, specially in MySQL. Although more readable, they may have some performance issues on some tasks.
Thank you for taking the time to read this post! Hope you’ve enjoyed learning about subqueries and that you are able to apply them on your day-to-day. For data scientists, mastering SQL is a very important skill as I’ve detailed in my last post and using it to build more efficient data pipelines is turning into a must-have.
Subqueries are an often overlooked skill – they are extremely easy to understand if you already have some basic knowledge of SQL and can transform the way you do queries in a couple of minutes. Nevertheless, be wary that they are typically slower than join alternatives, so use them with caution!
I’ve set up a course on learning SQL from Scratch on Udemy where I’ve included a deep dive into these concepts and more – the course is suitable for absolute beginners and I would love to have you around!