The world’s leading publication for data science, AI, and ML professionals.

How to Use SubQueries in SQL

Learn how to make your SQL queries more flexible using subqueries and reduce code clutter

[Disclaimer: This post contains some affiliate links to my Udemy Course]

Subqueries are a cool concept that we can use when programming in Structured Query Language (SQL). Starting with the problem: sometimes we want to access Data outside the context of our query to filter rows or perform some special filter based on an aggregation metric. When we want to do that, we may fall into the trap of creating too many checkpoints that depend on each other with several temporary tables. These temporary objects will not only take precious space from our server, but will also make the code harder to debug and a bit more confusing.

Subqueries are a simple SQL concept that will help us crunch several query statements into a single one. There are three major use cases (at least that I personally use a lot, but there are more) that rely on subqueries:

  • Accessing aggregations with a subquery.
  • Filtering rows from a table with the context of another.
  • Performing double-layer aggregations such as average of averages or an average of sums.

In this post, we’ll look into some examples of how to use subqueries in Sql and how they may bring flexibility to your data pipelines!

If you are looking to up your SQL game, using subqueries may make a massive difference. By adding them to your toolbelt, you will write more organized queries that don’t depend on a high number of sequential steps, making it easier to reduce the amount of code you write. On the downside, they may be a bit slow than their join alternatives -so, let’s investigate them next!


Creating the Data

For this example, we’ll use data from two hypothetical stores. Let’s imagine that we have two tables, one for store_aand other for store_b. Each table records purchases from customers and it’s built like a typically facts table – meaning that each purchase is associated with the following:

  • The employee_id that recorded it.
  • The customer_id that bought items in the transaction.
  • Every purchase row is multiplied for each product bought.
  • Every product was brought at a specific price.

Let’s create and fill our dummy data for both stores:

create table sandbox.store_a (
 transaction_id int,
    employee_id int not null,
    customer_id int not null,
    product_id int not null,
    price_paid numeric(19,4)
    );
create table sandbox.store_b (
 transaction_id int,
    employee_id int not null,
    customer_id int not null,
    product_id int not null,
    price_paid numeric(19,4)
    );
insert into sandbox.store_a (
 transaction_id, employee_id, customer_id, product_id, price_paid
    ) values 
    (1, 1, 1, 1, 9.99),
    (1, 1, 1, 2, 7.99),
    (1, 1, 1, 3, 3.99),
    (1, 1, 1, 4, 2.99),
    (2, 1, 2, 1, 9.99),
    (2, 1, 2, 3, 3.99),
    (3, 2, 1, 1, 9.99),
    (4, 2, 1, 2, 7.99),
    (4, 2, 1, 3, 3.99),
    (5, 2, 2, 1, 9.99),
    (6, 1, 3, 1, 3.99);

insert into sandbox.store_b (
 transaction_id, employee_id, customer_id, product_id, price_paid
    ) values 
    (1, 5, 10, 1, 9.99),
    (1, 5, 10, 3, 3.99),
    (1, 5, 10, 4, 2.99),
    (2, 1, 11, 1, 9.99),
    (3, 1, 11, 3, 3.99),
    (4, 6, 11, 1, 9.99),
    (4, 6, 11, 2, 7.99),
    (5, 7, 10, 3, 3.99);

Our tables look like the following:

Every store contains different purchases from different customers. To learn about subqueries in this context, we will use two use cases:

  • How can we filter the products that have been bought at least three times in store_b?
  • How can we compute the average of the transactions in store_a ?
  • On store_a , how we can filter the transactions that have a value higher than the average?

Using a SubQuery as a Filter

Let’s start with our first example – how can we filter the rows with products that have been bought at least three times in store_b?

If we had to draw a mental plan on how to do this:

  • Count the number of times each product was bought in store_b ;
  • Store the product_ids where that number is higher or equal to three;
  • Filter those products from store_b ;

In "normal" pipeline fashion, one could do this in two steps, using a temporary table. For instance:

create temporary table sandbox.top_products as (
 select product_id
    from sandbox.store_b
    group by product_id 
    having count(product_id) >= 3
);
select a.* from
sandbox.store_b as a
inner join 
sandbox.top_products as b
on a.product_id = b.product_id;

Creating our top_products temporary table, and using its domain to restrict store_b rows using an inner join is a possibility. As an alternative, we can pass our top_products with a subquery in the where clause:

select * from 
sandbox.store_b
where product_id IN (
 select product_id from sandbox.top_products
);

Notice that we can embed the product_ids in the where clause of the query, by passing a select statement to it. But, if we can directly pass queries to where clauses, we can now ditch our temporary table! Let’s see:

select * from 
sandbox.store_b
where product_id IN (
    select product_id
    from sandbox.store_b
    group by product_id 
    having count(product_id) >= 3
);

Done! We’ve crunched everything into the same query using our first subquery example. A tip: always use indentation to guide the reader of how your query behaves. With subqueries, leaving it in an indented block, makes it easier to read and understand.

In the end, the result of all the queries above is the same – we output the product_ids 1 and 3 from the store_b :


Multilayer Aggregations

In the case, where we have a denormalized table, we may want to perform some multi-level aggregations on the data. For instance, if we want to compute the average of all transactions, we can’t apply an average directly, as our table is oriented to product_idsand not to transaction_ids .

Again, if we would apply "normal" data pipeline rules, we would probably break this into two queries:

create temporary table sandbox.average_price as (
  select transaction_id, sum(price_paid) as total_value
  from sandbox.store_a
  group by transaction_id
);
select avg(total_value) from sandbox.average_price;

With the last select statement, we know that the average transaction on store_a is, approximately, 12.48 €. Again, this is another excellent use case for a subquery! Can you build the subquery syntax on your own?

Let’s see:

select avg(average_price.total_value) as average_transaction from (
   select transaction_id, sum(price_paid) as total_value
   from sandbox.store_a
   group by transaction_id
   ) as average_price
;

This time, our subquery goes inside the from clause! Subqueries are extremely flexible and enable us to fit them in the most famous SQL clauses .

The result from the query above is exactly the same as the two-table approach:


Filtering the table based on an Aggregation

Another use case where subqueries may fit happens when we use SQL variables in some scenarios. For instance, imagine that we would like to filter transactions that have a value higher than the average (the output must retain the original product-oriented row) – can we do this in a couple of queries?

First, let’s do the multi-table approach:

  • We start with computing the sum of each transaction.
  • We insert the average value of all transactions into a variable.
  • We filter the group of transactions where the total value is higher then the mean of all transactions.

Without using subqueries, this could be implemented in the following way:

create temporary table sandbox.average_price as (
  select transaction_id, sum(price_paid) as total_value
  from sandbox.store_a
  group by transaction_id
);
select @avg_transaction:= avg(total_value)
from sandbox.average_price;
create temporary table sandbox.transaction_over_avg as (
 select distinct transaction_id
    from sandbox.average_price
    where total_value > @avg_transaction
    );

select a.*
from sandbox.store_a as a
inner join
sandbox.transaction_over_avg as b
on a.transaction_id = b.transaction_id;

The final select would yield the information about transactions 1 and 2, the only purchases that have a value higher than the average:

If we want to implement this using some version of subqueries, we can combine them with variables:

select @avg_transaction:= avg(agg_table.total_value)
from (
 select transaction_id, sum(price_paid) as total_value
 from sandbox.store_a
 group by transaction_id
) as agg_table;
select * 
from sandbox.store_a
where transaction_id in (
 select transaction_id
 from sandbox.store_a
 group by transaction_id
    having sum(price_paid) > @avg_transaction
)

As you can see, this reduced our code a bit and we used subqueries both when creating the variable @avg_transaction , and when filtering our original table. One could argue that three-level subqueries are possible, but may make the code a bit complicated to read so there’s no silver bullet here and that’s why a two-query solution fits.

Sometimes, subqueries may not even be the most clear way to solve a problem! Ideally, we would like to strike the balance between complex and long code -in general, shorter code is better but you may be doing it with too many subquery layers, making it hard to debug and understand (and also making your query slower).

Before we leave, two arguments against subqueries:

  • temporary tables can also be extremely relevant when you want to save the state of your data for debugging purposes – with subqueries, you lose that layer of extra validation that may be required.
  • subqueries are typically slower than join alternatives, specially in MySQL. Although more readable, they may have some performance issues on some tasks.

Thank you for taking the time to read this post! Hope you’ve enjoyed learning about subqueries and that you are able to apply them on your day-to-day. For data scientists, mastering SQL is a very important skill as I’ve detailed in my last post and using it to build more efficient data pipelines is turning into a must-have.

Subqueries are an often overlooked skill – they are extremely easy to understand if you already have some basic knowledge of SQL and can transform the way you do queries in a couple of minutes. Nevertheless, be wary that they are typically slower than join alternatives, so use them with caution!

I’ve set up a course on learning SQL from Scratch on Udemy where I’ve included a deep dive into these concepts and more – the course is suitable for absolute beginners and I would love to have you around!

Join Medium with my referral link – Ivo Bernardo


Related Articles