SQL Window Functions: The Intuitive Guide

Intuitively learn different components of window functions using Postgres and implement them into your data workflow

Do Lee
Towards Data Science

--

Introduction

The main goal is to understand the fundamental concepts of window functions and apply them to your SQL workflow. Window functions are nothing more than FOPO (Function Over PartitionBy OrderBy). Here’s a quick outline of what will be covered in this article.

  • The GROUP BY
  • The Window Function — 4 Major Components (The FOPO)
  • The func() OVER ()
  • The PARTITION BY
  • The ORDER BY
  • The PARTITION BY & ORDER BY
  • Summary
  • Appendix

The GROUP BY

If you have a good understanding of the GROUP BY clause, you already have a head start understanding window functions. In a nutshell, the GROUP BY clause condenses a table into fewer rows or outputs unique rows. When grouping by a column, the SQL query outputs a single row for every distinct value found under that column. In grouping multiple columns, the output will consist of unique combinations of the specified columns with or without some aggregate functions. Here’s an example.

The 1 in GROUP BY 1 represents the position of the grouping column under the SELECT. Columns listed under the SELECT are like an array with starting index position 1.

In this product_orders table, every order_id is attached to all product items (product_name) purchased by a customer.

  • First, the order_id is designated as the GROUP BY column.
  • Second, the amounts corresponding to the order_id are summed into one value (total_amount) using the SUM aggregate function.

Now, let’s take a deeper look into the mechanics of window functions.

The Window Function — 4 Major Components (The FOPO)

  • First, the function component determines the treatment of the data. For example, we can rank based on a column (e.g., DENSE_RANK) or create equal-sized bins (e.g., NTILE). Check out the available window functions in Postgres here. Aggregate functions (e.g., SUM, COUNT, AVG, etc.) can be used as window functions as well — listed here.
  • Second, the OVER clause is the glue that holds everything together. The functions from above are invoked or enabled with the use of the OVER clause to create window functions.
  • Third, similar to GROUP BY, the PARTITION BY is like a subgroup-by clause. Rather than collapsing the dataset, the PARTITION BY creates subgroups/blocks/partitions based on the specified column or columns.
  • Lastly, the ORDER BY clause orders the data, again, based on the specified column or columns.

Based on the level of treatment needed to transform the data not all components need to be used, except the OVER clause. (Remember: The OVER clause invokes the window functions and enables aggregate functions to become window functions.) Here are the use cases.

  • func() OVER ()
  • func(<column, integer value>) OVER ()
  • func() OVER (PARTITION BY <column(s)>)
  • func(<column, integer value>) OVER (PARTITION BY <column(s)>)
  • func() OVER (PARTITION BY <column(s)> ORDER BY <column(s)>)
  • func(<column, integer value>) OVER (PARTITION BY <column(s)> ORDER BY <column(s)>)

The func() OVER ()

The OVER () approach without defining a PARTITION BY and/or ORDER BY applies the function to the entire dataset. This approach pairs well with aggregate functions. The integrity of the table or dataset is maintained while a function is applied to calculate a value using all the rows.

In the orders table, every row is a unique order_id (primary key). I added a few more orders to this table for demonstration purposes.

In this example, three aggregate functions are paired with the OVER clause to create three window functions represented by total_amount, total_count, and total_mean. The total_amount, for example, sums all the values under the amount column, and the new total appears in every row. The same logic applies to total_count and total_mean. This setup helps calculate % of total and creating ratios against a total quantity or a statistic like the mean.

As a result, values from the window functions are useful in creating standardized metrics for easy comparison across the dataset. The first grey highlighted block shows the aggregate values in the provided example below, while the second block contains metrics describing the data.

The PARTITION BY

As I mentioned earlier, the PARTITION BY is similar to GROUP BY. The best way to think of PARTITION BY is subgrouping (a group within a group). Like the GROUP BY, start by defining the PARTITION BY column or columns. Once the subgroups are defined, the OVER clause will invoke or enable the function to execute its logic onto each subgroup. Unlike the GROUP BY, the PARTITION BY clause does not condense the table; instead, it maintains the table's integrity while adding the output as a column.

For instance, using the product_orders table, the PARTITION BY column is the order_id. Moreover, the number of subgroups is defined by the unique number of order_id’s. In this case, that number is three.

  • NTILE: The NTILE(3) equals the number of bins, and PARTITION BY determines the subgroups (or partitions), in which every subgroup will be divided into three bins.
  • FIRST_VALUE: The function outputs the first value or row in the defined subgroup.
  • COUNT: Counts the number of rows in each defined subgroup.
  • SUM: Sums the values (e.g., amount) in each subgroup.
  • AVG: Calculates the mean of each defined subgroup.

In the output, the highlighted grey blocks below represent the defined subgroups by the PARTITION BY clause and the creation of new columns by the window functions.

The ORDER BY

The ORDER BY clause is used when ranking or ordering the data. It can be a standalone clause within the window function or paired with the PARTITION BY clause. At times, just using the ORDER BY clause will be more appropriate, which allows the ranking or ordering of the entire dataset. Here’s an example.

Using the ORDER BY clause within the ranking functions generates an ordered view based on the values found in the num_product_items and amount column. The difference between ranking functions is shown in the orange highlighted area below. Just a reminder that the ORDER BY clause is set to order by ascending (ASC). For descending, it must be denoted with DESC (e.g., ORDER BY amount DESC).

  • ROW_NUMBER(): The function ranks sequentially either by ascending or descending regardless of duplicate values in the column.
  • RANK(): The function ranks sequentially either by ascending or descending, but there will be gaps when there are duplicate values. For example, under num_product_items, there are two occurrences of number 1; thus, the function ranks both as 1 (under rank_by_items). Next, there are two occurrences of 2 under num_product_items, and the function ranks these as 3. The function skipped rank 2 because the second occurrence of 1 already captured the 2nd rank position. Furthermore, because the second occurrence of 3 has already taken the 4th rank position, the next rank starts with 5.
  • DENSE_RANK(): This function is like the RANK() function but does not skip or create gaps. Unlike ROW_NUMBER(), where every row is designated with a unique rank regardless of duplicate values, DENSE_RANK() will apply the same rank for duplicate values without the gaps.

The PARTITION BY & ORDER BY

After examining the mechanics of PARTITION BY and ORDER BY separately, pairing these two components is the next step. The PARTITION BY creates subgroups (or partitions) based on the chosen column or set of columns. The ORDER BY organizes the data in ascending or descending order. I’ll be illustrating the pairing of these components using the product_orders table.

SELECT * FROM product_orders; This table consists of three order_ids with a list of purchased product items.

In this example, I’m using nine window functions to demonstrate PARTITION BY and ORDER BY pairing. At this stage, if you have a solid understanding of each component, this should be straightforward. The description of each window function for Postgres is available here.

In this SQL query, all the window functions are using order_id in the PARTITION BY clause. The number of unique order_id’s equals the number of subgroups, and the ORDER BY column orders the data for each subgroup.

One item not discussed earlier is the use of the “RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING” clause in FIRST_VALUE(), LAST_VALUE(), and NTH_VALUE() functions. This clause is needed to define the first and last row of the subgroup. There is nothing special about this clause, except it must be included to set the frame of the window function.

Summary

The key takeaway is that window functions are an amalgamation of existing SQL concepts bundled together to create a different way to slice and dice the data. There are clear benefits of using window functions and allows you to quickly output transformed data versus taking the time to create your own custom functions.

The way that I like to think of window functions is FOPO (Function Over PartitionBy OrderBy).

  • First, choose a function that is most applicable to your use case.
  • Second, remember that the OVER clause invokes the window function (e.g., DENSE_RANK(), LAG(), etc.) and enables the aggregate function to become a window function.
  • Third, the PARTITION BY clause defines the subgroups (or partitions) — a group within a group — in which the chosen function runs through each partition.
  • Lastly, the ORDER BY clause orders the data in each partition based on a column or set of columns.

In the Appendix, I shared the SQL statements used to create my tables and the data. The best way to learn is to practice, so feel free to dive deeper on your own. You already have the basics under your belt, and next is just putting together the pieces based on your use case. Thanks for reading and have fun querying!

Appendix

If you already have Postgres installed on your computer, please run these CREATE TABLE and INSERT statements and run the SQL queries that I have shared earlier. For downloads, go to this page.

CREATE TABLE statements for orders and product_orders with INSERT statements.

--

--