The world’s leading publication for data science, AI, and ML professionals.

SQL Window Functions – Part 2

Different Types Of SQL Windows Functions.

Photo by: Franki | Unsplash.com
Photo by: Franki | Unsplash.com

Introduction:

A Window Function performs a Data Analysis calculation across a set of table rows that are somehow related to the current row. Address the comparable type of calculation can be done with an aggregate function that gives a single row or grouped by condition (refer to Figure 1).

Window function does not cause rows to become grouped into a single output row. Rows retain their separate identities also able to access more than just the current row of the query result. (refer to Figure 1).

Window Function Syntax:
Window_Function([All] expression) 
OVER( [PARTITION BY expression_list] [ORDER BY order_list Row_or_ Range clause] )
Figure 1 - Difference between - Aggregated and Windows function
Figure 1 – Difference between – Aggregated and Windows function

For more information please visit my last blog, I Introduced the SQL Window Function-Part I and gave an overview of different types of SQL Window Functions.

In this blog, I am explaining all three major types of SQL Window Function along with examples.

The database used to explain the below concepts: Postgres database.

Dataset: Available at Github: Retails.csv

Retails Table Screenshot.
Retails Table Screenshot.

Types of SQL Window Functions:

  1. Window Aggregated Function
  2. Window Ranking Aggregated Function
  3. Window Analytical Function

1. Window Aggregated Function

Consist one of the supporting aggregated function i.e. AVG(), COUNT(), MIN(), MAX(), SUM().

a. AVG():

  • Returns average value for the input expression values.
  • The function works with Numeric Values and ignores NULL values.

Example:

Write a query to calculate the average order price.

Query:
SELECT order_id, product_id, price, AVG(price) OVER (PARTITION BY order_id) AS Average_Order_Price 
FROM retails
Output - Query - AVG()
Output – Query – AVG()

Explanation:

  • To calculate average order price, window function used AVG() on price columns and partition by on order_id
  • Consider order id: 1112 – consist 3 products (i.e. 1,2,5). The average value of those 3 products are (866 + 163 +173) / 3 = 400.667

b. COUNT():

  • Calculates the number of rows with NULL values too if available in column or expression.
  • This window function is helpful while creating a new feature in the dataset. Like count number of entries belong to each customer.

Example:

Query 1: Calculate the number of products purchased by a customer in the order.

SELECT order_id, name, product_id, COUNT(*) OVER (Partition BY order_id) AS Number_of_Products 
FROM retails

Query 2: Calculate the number of product sales purchased (running total) by the customer.

SELECT order_id, name, product_id, COUNT(*) OVER (Order BY order_id) AS Number_of_Products 
FROM retails
Output - Query 1 & 2 - COUNT()
Output – Query 1 & 2 – COUNT()

Explanation:

Query 1:

  • Partition by order_id counts number of records belongs to particular order_id.
  • In the output, we can see several products for each order displayed.

Query 2:

  • Order by order_id counts a number of records and particular order_id and then add number records of consecutive order.
  • Output: We can see the count is increased by the number of records related to particular order_id.

c. Min() or Max():

  • Min() or Max() return Minimum or Maximum value of the expression across the input values respectively.
  • Both window function works with Numeric values and ignores NULL values.

Example:

Below query add a new feature into the result set, Minumum and Maximum price of the product purchased in respective order.

Query:
SELECT order_id, name, product_id, price, 
MIN(price) OVER (Partition BY order_id) AS Minimum_Price_Product,
MAX(price) OVER (Partition BY order_id) AS Maximum_Price_Product 
FROM retails
Output- Query- MIN() or MAX()
Output- Query- MIN() or MAX()

Explanation:

  • For each order_id record respective minimum and maximum price of the product has been added.
  • We can use each function separately too.

d. Sum():

  • Returns sum/total expression across all input value.
  • The function works with Numeric Values and ignores NULL values.

Example:

Below query return Total Price of each order_id.

Query:
SELECT order_id, name, product_id, price, 
SUM(price) OVER (PARTITION BY order_id) AS Average_Order_Price 
FROM retails
Output - Query - SUM()
Output – Query – SUM()

Explanation:

  • New column added with total_order_price for each order_id.
  • Helpful to analyse data, where we have a number of records, belongs to each order_id.

2. Window Ranking Aggregated Function:

Consist one of the supporting ranking function i.e. RANK(), DENSE_RANK(), ROW_NUMBER().

a. RANK():

  • The Rank of a value in a group of values based on the ORDER BY expression in the OVER clause (refer Query 1).
  • Each value is ranked within its PARTITION BY expression (refer Query 2).
  • Rows with equal values for the ranking criteria receive the same rank.
  • Tie or same rank skip the consecutive rank eg. Rank (): 1,1,3,4,5.

Example:

Query 1: Rank the product based on their prices.

SELECT order_id, name, product_id, price, 
RANK() OVER (ORDER BY price) AS Rank_Product_Price 
FROM retails

Query 2: Rank the product based on their prices in each order (i.e. partition by order_id).

SELECT order_id, name, product_id, price, 
RANK() OVER (PARTITION BY order_id ORDER BY price) AS Rank_Product_Price 
FROM retails
Output - Query 1 & 2 - RANK()
Output – Query 1 & 2 – RANK()

Explanation:

As we can see in both query, ORDER BY states the expression used to rank the values.

Query 1:

  • The ranking is done based on product_price.
  • Also note, 9 rows with the same value has tie rank 1.
  • So next Rank value starts with 10.

Query 2:

  • Rank has been done by ORDER BY expression i.e. price column.
  • Check order_id 114, we can see the rank for first 2 product prices are same. Hence rank assign to it is 1.
  • Next product price rank with 3 within that order_id.

b. DENSE_RANK():

  • Similarly to Rank() function, Rank of a value in a group of values based on the ORDER BY expression and the OVER clause and each value is ranked within its PARTITION BY expression.
  • The difference is, Rows with equal values receive the same rank and Tie or the same rank not skip the consecutive rank.
  • Example: Dense_Rank(): 1,1,2,3,4

EXAMPLE:

Query: Dense_Rank the product based on their prices in each order (i.e. partition by order_id).

SELECT order_id, name, product_id, price, 
DENSE_RANK() OVER (PARTITION BY order_id ORDER BY price) AS Dense_Rank_Product_Price 
FROM retails
Output - Query - Dense_Rank()
Output – Query – Dense_Rank()

Explanation:

  • As we can see ranking to each row done based on ORDER BY expression i.e. price values also within each order_id i.e. (PARTITION BY order_id).
  • order_id 1114, have 5 products out of which 2 products having the same price hence rank tie i.e. 1.
  • The next rank starts with 2 (this is the major difference between Rank() and Desne_Rank() function).
  • Dense_Rank() not skip the consecutive rank number.

c. CUME_DIST():

  • Calculates ** Relative Rank of the current row within a window partition based on below Formula**:

EXAMPLE:

Query: CUME_DIST i.e. Relative rank the product based on their prices in each order (i.e. partition by order_id).

SELECT order_id, name, product_id, price, 
CUME_DIST() OVER (PARTITION BY order_id ORDER BY price) AS Dense_Rank_Product_Price 
FROM retails
Output - Query - CUME_DIST() i.e. Relative Rank
Output – Query – CUME_DIST() i.e. Relative Rank

Explanation:

Let’s consider order_id 1112 having 3 products, relative rank calculated as discussed below formula used:

  • Row no. 3— First product: 1/3 = 0.3333
  • Row no. 4— Second product: 2/3 = 0.666
  • Row no. 5— Third product: 3/3 = 1

Similarly, if product having same value or price then relative rank also same check order_id 1114 in output screenshot.

d. ROW_NUMBER():

  • An ordinal number of the current row within its partition based on ORDER BY expression in the OVER clause.
  • Each value is ordered within its PARTITION BY expression.
  • Rows with equal values for the ORDER BY expressions receive different row numbers non deterministically.

EXAMPLE:

Query: Assign Row_Number to the product based on their prices in each order (i.e. partition by order_id).

SELECT order_id, name, product_id, price, 
ROW_NUMBER() OVER (PARTITION BY order_id ORDER BY price) AS Row_Number_Product_Price 
FROM retails
Output - Query - Row_Number()
Output – Query – Row_Number()

Explanation:

  • As we can see in output screenshot, row number assign based on price (ORDER BY expression) within each order (PARTITION BY order_id).
  • Not consider value same or not, just assign row_number to each row in the expression.

e. NTILE():

  • Divides the rows for each window partition, as equally as possible, into a specified number of ranked groups.
  • Requires ORDER BY clause in the OVER clause.
  • The column or expression specified in ORDER BY clause, first all values has been sorted in ascending order and then equally assign group number.

Example:

Query: Assign **** Group/cluster/bucket number to all row into 10 different groups based on the product price.

SELECT order_id, name, product_id, price, 
NTILE(10) OVER (ORDER BY price) AS NTile_Product_Price 
FROM retails
Output - Query - NTILE()
Output – Query – NTILE()

Explanation:

  • In this dataset, we have a total of 50 records.
  • Hence, each clustered consist of 5 rows as shown in the output screenshot.
  • First, all rows have been sorted with the price and then assign a group number to each row.

f. PERCENT_RANK()

  • Percentage rank of the current row using the following formula:

Example:

Query: Calculate or assign percentage rank to all row based on the product price.

SELECT order_id, name, product_id, price, 
PERCENT_RANK() OVER (PARTITION BY order_id ORDER BY price) AS Row_Number_Product_Price 
FROM retails
Output - Query - Percent_Rank()
Output – Query – Percent_Rank()

Explanation:

Let’s consider order_id 1114 having 5 products, relative rank calculated as discussed below formula used:

  • Row no. 9— First product: (1–1)/(5–1) = 0
  • Row no. 10— Second product: (1–1)/(5–1) = 0
  • Row no. 11— Third product: (3–1)/(5–1) = 0.5
  • Row no. 12 – Forth product: (4–1)/(5–1) = 0.75
  • Row no. 13 – Fifth product: (5–1)/(5–1) = 1

3. Window Analytic Functions:

Consist one of the supporting ranking function i.e. LAG(), LEAD(), FIRST_VALUE(), LAST_VALUE().

a. LAG() or LEAD():

Syntax:

LAG | LEAD (expression)
    OVER ([ PARTITION BY expression_list] [ORDER BY order_list] )
  • LAG or LEAD returns, value for the row value before or after the current row in a partition respectively.
  • If no row exists, null is returned.

Example:

Query: Add new feature 1 step LAG or LEAD product price within each order (i.e. PARTITION BY order_id)

SELECT order_id, name, product_id, price, 
LAG(price,1) OVER (PARTITION BY order_id ORDER BY price) AS LAG_Product_Price,
LEAD(price,1) OVER (PARTITION BY order_id ORDER BY price) AS LEAD_Product_Price
FROM retails
Output - Query - LAG() or LEAD()
Output – Query – LAG() or LEAD()

Explanation:

  • As we can see both lag and lead column give 1 step value respectively within the partition of order_id.
  • When its first row in LAG() gets imputed with NULL.
  • Similarly, in LEAD() the last row gets imputed with NULL.

b. FIRST_VALUE() or LAST_VALUE():

Syntax:

FIRST_VALUE | LAST_VALUE ( expression ) 
      OVER ( [PARTITION BY expression_list ] [ ORDER BY order_list ][ row_or_range_clause ] )
  • FIRST_VALUE or LAST_VALUE return First-row or Last-row value of the specified expression with respect to the window frame (partition) respectively.

Example:

Query: Add new feature FIRST_VALUE or LAST_VALUE product price within each order (i.e. PARTITION BY order_id).

SELECT order_id, name, product_id, price, 
FIRST_VALUE(price) OVER (PARTITION BY order_id) AS FIRST_VALUE_Product_Price,
LAST_VALUE(price) OVER (PARTITION BY order_id) AS LAST_VALUE_Product_Price
FROM retails
Output - Query - First_Value() or Last_Value()
Output – Query – First_Value() or Last_Value()

Explanation:

  • As we can see both First_Value() and Last_Value() column give first-row value and last-row value respectively.

c. NTH_VALUE():

Syntax:

NTH_VALUE (expression, nth_value ) 
      OVER ( [PARTITION BY expression_list ] [ ORDER BY order_list ][ row_or_range_clause ] )
  • NTH_VALUE return nth value assigned with expression with respect to the window frame (partition) respectively.
  • If nth_value is not available then its get imputed with NULL.

Example:

Query: Add new feature NTH_VALUE product price within each order (i.e. PARTITION BY order_id).

SELECT order_id, name, product_id, price, 
NTH_VALUE(price,3) OVER (PARTITION BY order_id) AS NTH_VALUE_Product_Price
FROM retails
Output - Query - NTH_VALUE()
Output – Query – NTH_VALUE()

Explanation:

  • As in query we assigned nth_value as 3, it returns the 3rd-row value is partitioned by order_id.
  • For order_id 1111 there are only 2 products, hence column gets imputed with the NULL value.

Conclusion:

In this blog, I tried to explain the major types of Window Function in SQL.

Window Functions are very useful while doing Data Analysis using SQL and easy to use.

Window Functions revolves around OVER, PARTITION BY and ROW or RANGE clauses.

I hope you like this blog, feel free to join me on LinkedIn. I love to know your thoughts on this article and feedback.

Thank you for reading.

Resources:

Dataset and Sql scripts were available at GITHUB.

References:

  1. postgresql.org (2019). PostgreSQL: Documentation: 9.1: Window Functions. [online] Postgresql.org. Available at: https://www.postgresql.org/docs/9.1/tutorial-window.html.
  2. drill.apache.org (n.d.). SQL Window Functions Introduction – Apache Drill. [online] drill.apache.org. Available at: https://drill.apache.org/docs/sql-window-functions-introduction/.

Related Articles