In this article, I’m going to introduce you to the SQL QUALIFY
clause: a trick that will help you write shorter and more readable SQL queries.
If you work in Data Science/Analytics or Machine Learning, this is a must-have SQL tool for four reasons:
- First, because it can help you write much shorter queries, simplifying your data collection pipelines. In my experience,
QUALIFY
reduces the length of many queries by up to 50%. - Second, because it will help you write much more succinct queries. This matters because, as Paul Graham famously said, succinctness is power in that it makes your code more readable and maintainable.
- Third,
QUALIFY
helps you write more efficient queries which reduce costs and speed up development time. - Fourth, it’s an easy way to show off. I’m not saying that you’re going to win any innovation prizes for using
QUALIFY
, but because it’s a relatively new clause, chances are that your boss might not have heard of it. At any rate, it’ll be fun to show your team a new trick.
What does the QUALIFY clause do?
Simply put, the QUALIFY
clause enables you to filter the results of window functions without needing to use nested subqueries or the WITH
clause. It allows you to replace these bulky code blocks with a single line of code. It’s sort of like how the HAVING
clause enables you to filter the results of GROUP BY
aggregations.
the
QUALIFY
clause enables you to filter the results of window functions
If that sounds like gobbledygook, give me 2 minutes of your time and I’ll try my best to explain.
It all starts with window functions
If you’re not familiar with window functions, you might want to check out this article, which introduces SQL window functions through 4 real-world use cases. But, in a nutshell, window functions can be used to perform calculations on a set of rows in a table without aggregating or collapsing the rows, which is very useful when you want to preserve the original structure of the table.
For example, let’s say we have a table named orders
which lists all the product orders a company receives. Each row represents an order and records the date
the order was placed, the customer
who placed the order, and the order_amount
(in dollars).
Using a window function, we could add in a new column total_order_amount
which represents the total order amount per customer. We could simply write:
SELECT
date,
customer,
order_amount,
SUM(order_amount) OVER(PARTITION BY customer) AS total_order_amount
FROM orders
ORDER BY date
which would return:
As you can see, the window function SUM(order_amount) OVER(PARTITION BY customer)
effectively partitioned our table into different "windows" (one for each customer) and then calculated the total_order_amount
for each of these windows. All of this was achieved without using a GROUP BY
aggregation, allowing us to retain the same number of rows.
Enter the QUALIFY clause
Now that we’ve got our window function set up, let’s imagine that we want to send a Thank You email to our highest-spending customers. Before we can do this, we need to filter this table to see the orders of customers whose total_order_amount
exceeded $5,000. In other words, we want to produce something like this:
A typical way of producing this would be to use a WITH
clause:
WITH my_order_table AS (
SELECT
date,
customer,
order_amount,
SUM(order_amount) OVER(PARTITION BY customer) AS total_order_amount
FROM orders
)
SELECT
*
FROM my_order_table
WHERE total_order_amount > 5000
ORDER BY date
… or a nested subquery:
SELECT
*
FROM
(SELECT
date,
customer,
order_amount,
SUM(order_amount) OVER(PARTITION BY customer) AS total_order_amount
FROM orders) AS A
WHERE A.total_order_amount > 5000
ORDER BY A.date
However, while both of these are perfectly acceptable solutions, the QUALIFY
clause enables us to simplify and shorten this code significantly. We could simply write:
SELECT
date,
customer,
order_amount,
SUM(order_amount) OVER(PARTITION BY customer) AS total_order_amount
FROM orders
QUALIFY total_order_amount > 5000
which would return the exact same result.
The QUALIFY statement is pure syntactic sugar
The great thing about the QUALIFY
clause is that it is evaluated after window functions are computed. This means that you can reference the window function column using the alias you assigned to it. In the example above, I demonstrated this by writing QUALIFY total_order_amount..
, referencing the alias total_order_amount
which I assigned at the point of creating the original window function.
It’s also not necessary to write the window function in the SELECT
clause. For example, if I didn’t want to actually show the total order amount as a separate column, but still needed to filter on this window function, I could write:
SELECT
date,
customer,
order_amount,
FROM orders
QUALIFY SUM(order_amount) OVER(PARTITION BY customer) > 5000
which would return the same result, but without the total_order_amount
column:
On top of all this, the QUALIFY
clause can also be slightly more efficient than using subqueries or WITH
. Not only will this speed up your development time; it also helps a ton with cost optimisation.
Another example: Select the top-N-per-group
A common use case of the QUALIFY
clause is to select the top-N-per-group. For example, imagine that we had the following table sales
which records every sale that a shop makes on a given day. Each row represents a single transaction and contains the timestamp of the sale, the product purchased, and the total amount of money spent in that sale.
Using the window function:
SELECT
Timestamp,
Product,
Sale_value,
ROW_NUMBER() OVER(PARTITION BY Product ORDER BY Timestamp ASC) as row_num
FROM sales
ORDER BY Product, row_num
we could assign a row number to each sale, within each product category:
and then, using a QUALIFY
statement, we could filter the result to only show the first row in each category:
SELECT
Timestamp,
Product,
Sale_value,
ROW_NUMBER() OVER(PARTITION BY Product ORDER BY Timestamp ASC) as row_num
FROM sales
QUALIFY row_num = 1
ORDER BY Product
And there you have it – the magic of QUALIFY
.
One final thing
The QUALIFY
clause is not yet available in all dialects of SQL. At the time of writing, it’s supported by many of the big platforms including Bigquery, Databricks, Snowflake, and H2. If your platform/dialect is not compatible, you’ll have to rely on subqueries/WITH
for now, but you can keep an eye on Modern SQL to see when it becomes available.