The world’s leading publication for data science, AI, and ML professionals.

SQL users: Halve the length of (some of) your queries with this one trick

The QUALIFY clause is pure syntactic sugar

In this article, I’m going to introduce you to the SQL QUALIFY clause: a trick that will help you write shorter and more readable SQL queries.

If you work in Data Science/Analytics or Machine Learning, this is a must-have SQL tool for four reasons:

  1. First, because it can help you write much shorter queries, simplifying your data collection pipelines. In my experience, QUALIFY reduces the length of many queries by up to 50%.
  2. Second, because it will help you write much more succinct queries. This matters because, as Paul Graham famously said, succinctness is power in that it makes your code more readable and maintainable.
  3. Third, QUALIFY helps you write more efficient queries which reduce costs and speed up development time.
  4. Fourth, it’s an easy way to show off. I’m not saying that you’re going to win any innovation prizes for using QUALIFY, but because it’s a relatively new clause, chances are that your boss might not have heard of it. At any rate, it’ll be fun to show your team a new trick.

What does the QUALIFY clause do?

Simply put, the QUALIFY clause enables you to filter the results of window functions without needing to use nested subqueries or the WITH clause. It allows you to replace these bulky code blocks with a single line of code. It’s sort of like how the HAVING clause enables you to filter the results of GROUP BY aggregations.

the QUALIFY clause enables you to filter the results of window functions

If that sounds like gobbledygook, give me 2 minutes of your time and I’ll try my best to explain.

It all starts with window functions

If you’re not familiar with window functions, you might want to check out this article, which introduces SQL window functions through 4 real-world use cases. But, in a nutshell, window functions can be used to perform calculations on a set of rows in a table without aggregating or collapsing the rows, which is very useful when you want to preserve the original structure of the table.

For example, let’s say we have a table named orders which lists all the product orders a company receives. Each row represents an order and records the date the order was placed, the customer who placed the order, and the order_amount (in dollars).

Using a window function, we could add in a new column total_order_amount which represents the total order amount per customer. We could simply write:

SELECT
  date,
  customer,
  order_amount, 
  SUM(order_amount) OVER(PARTITION BY customer) AS total_order_amount
FROM orders
ORDER BY date

which would return:

As you can see, the window function SUM(order_amount) OVER(PARTITION BY customer) effectively partitioned our table into different "windows" (one for each customer) and then calculated the total_order_amount for each of these windows. All of this was achieved without using a GROUP BY aggregation, allowing us to retain the same number of rows.

Enter the QUALIFY clause

Now that we’ve got our window function set up, let’s imagine that we want to send a Thank You email to our highest-spending customers. Before we can do this, we need to filter this table to see the orders of customers whose total_order_amount exceeded $5,000. In other words, we want to produce something like this:

A typical way of producing this would be to use a WITH clause:

WITH my_order_table AS (
  SELECT
    date,
    customer,
    order_amount, 
    SUM(order_amount) OVER(PARTITION BY customer) AS total_order_amount
  FROM orders
)

SELECT
  *
FROM my_order_table
WHERE total_order_amount > 5000
ORDER BY date

… or a nested subquery:

SELECT
  *
FROM 
    (SELECT
      date,
      customer,
      order_amount, 
      SUM(order_amount) OVER(PARTITION BY customer) AS total_order_amount
    FROM orders) AS A

WHERE A.total_order_amount > 5000
ORDER BY A.date

However, while both of these are perfectly acceptable solutions, the QUALIFY clause enables us to simplify and shorten this code significantly. We could simply write:

SELECT
  date,
  customer,
  order_amount, 
  SUM(order_amount) OVER(PARTITION BY customer) AS total_order_amount
FROM orders
QUALIFY total_order_amount > 5000

which would return the exact same result.


The QUALIFY statement is pure syntactic sugar

The great thing about the QUALIFY clause is that it is evaluated after window functions are computed. This means that you can reference the window function column using the alias you assigned to it. In the example above, I demonstrated this by writing QUALIFY total_order_amount.., referencing the alias total_order_amount which I assigned at the point of creating the original window function.

It’s also not necessary to write the window function in the SELECT clause. For example, if I didn’t want to actually show the total order amount as a separate column, but still needed to filter on this window function, I could write:

SELECT
  date,
  customer,
  order_amount, 
FROM orders
QUALIFY SUM(order_amount) OVER(PARTITION BY customer) > 5000

which would return the same result, but without the total_order_amount column:

On top of all this, the QUALIFY clause can also be slightly more efficient than using subqueries or WITH. Not only will this speed up your development time; it also helps a ton with cost optimisation.

Another example: Select the top-N-per-group

A common use case of the QUALIFY clause is to select the top-N-per-group. For example, imagine that we had the following table sales which records every sale that a shop makes on a given day. Each row represents a single transaction and contains the timestamp of the sale, the product purchased, and the total amount of money spent in that sale.

Using the window function:

SELECT
  Timestamp,
  Product,
  Sale_value,
  ROW_NUMBER() OVER(PARTITION BY Product ORDER BY Timestamp ASC) as row_num
FROM sales
ORDER BY Product, row_num

we could assign a row number to each sale, within each product category:

and then, using a QUALIFY statement, we could filter the result to only show the first row in each category:

SELECT
  Timestamp,
  Product,
  Sale_value,
  ROW_NUMBER() OVER(PARTITION BY Product ORDER BY Timestamp ASC) as row_num
FROM sales
QUALIFY row_num = 1
ORDER BY Product

And there you have it – the magic of QUALIFY.

One final thing

The QUALIFY clause is not yet available in all dialects of SQL. At the time of writing, it’s supported by many of the big platforms including Bigquery, Databricks, Snowflake, and H2. If your platform/dialect is not compatible, you’ll have to rely on subqueries/WITH for now, but you can keep an eye on Modern SQL to see when it becomes available.


Related Articles