Intro to Window Functions

Note: For a better view of the exercise and solution panel, please close the left sidebar.

Introduction

Window functions in SQL are a type of function that allow you to perform calculations across a set of rows that are related to the current row in a query result set. These functions are commonly used for tasks such as calculating running totals, finding ranking or percentiles, and computing moving averages.

A window function typically involves the use of the OVER() clause, which defines the window or subset of rows over which the function is calculated. The window can be defined using various criteria, such as the current row and a specified number of rows before and/or after it, or a range of rows based on a specific ordering.

Window functions are available in many popular SQL database systems, such as PostgreSQL, MySQL, and SQL Server. They are a powerful tool for performing complex calculations and data analysis tasks within SQL queries.

Why Use Window Functions?

Window functions are particularly useful when you need to:

Calculate running totals or cumulative sums
Find rankings and percentiles
Compute moving averages
Compare values between consecutive rows
Analyze data distribution patterns

Unlike aggregate functions with GROUP BY, window functions allow you to compute values that are related to the current row without collapsing the result set into groups.

Basic Example

For example, this query calculates the total revenue generated by each seller:

SELECT 
    seller_id,
    price * freight_value AS order_revenue,
    SUM(price * freight_value) OVER (PARTITION BY seller_id) AS total_revenue
FROM order_items;

In this query, we're using the SUM() window function to calculate the total revenue generated by each seller, partitioned by seller_id.

The PARTITION BY clause is used in conjunction with window functions to divide a result set into partitions or groups based on one or more columns. The window function is then applied to each partition separately, and the results are combined into a single result set.

Types of Window Functions

There are several types of window functions in SQL. Here are the most commonly used ones:

Ranking Functions

These functions assign a rank to each row within a result set, based on a specified ordering. Examples include:

ROW_NUMBER(): Assigns a unique sequential number to each row
RANK(): Assigns a rank to each row, with ties receiving the same rank and gaps after ties
DENSE_RANK(): Assigns a rank to each row, with ties receiving the same rank but no gaps

Aggregate Functions

These functions calculate summary statistics over a specified window of rows, such as:

SUM(): Calculate the sum of values in the window
AVG(): Calculate the average of values in the window
MIN(): Find the minimum value in the window
MAX(): Find the maximum value in the window
COUNT(): Count the number of rows in the window

They are similar to their non-windowed counterparts, but they operate over a subset of rows rather than the entire result set.

Navigation Functions

These functions allow you to access data from a previous or subsequent row within a window:

LAG(): Access data from a previous row
LEAD(): Access data from a subsequent row

Percentile Functions

These functions calculate the percentile rank of each row within a result set:

PERCENT_RANK(): Calculate the relative rank of each row (0 to 1)
CUME_DIST(): Calculate the cumulative distribution of each row

Analytic Functions

These functions allow you to perform complex calculations over a specified window of rows:

NTILE(): Divide rows into buckets
FIRST_VALUE(): Get the first value in the window
LAST_VALUE(): Get the last value in the window

Syntax

The syntax of window functions in SQL generally follows this basic structure:

<function>() OVER (
    [PARTITION BY <partition column(s)>]
    [ORDER BY <order column(s)> [ASC|DESC]]
    [ROWS|RANGE <window specification>]
)

Here is a breakdown of the different parts of the syntax:

<function>(): This is the window function that you want to use, such as SUM(), AVG(), RANK(), or LAG().

OVER: This keyword specifies that you are using a window function.

PARTITION BY: This optional clause allows you to partition the result set into subsets based on one or more columns. The window function will then be applied to each subset separately.

ORDER BY: This optional clause specifies the order in which the rows are evaluated within each partition. You can order by one or more columns, and specify whether the order is ascending or descending.

ROWS/RANGE: This optional clause specifies the type of window to use. ROWS specifies a fixed number of rows before and/or after the current row, while RANGE specifies a range of values. Note that not all window functions support both ROWS and RANGE.

<window specification>: This specifies the size and shape of the window, based on the ROWS or RANGE clause. For example, you might use:

ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING

to specify a window of three rows (the current row plus one row before and after it).

ROWS vs RANGE

Both ROWS and RANGE are used to define the window frame, but they work differently:

ROWS: Physical Offset

ROWS defines the window based on physical row offsets from the current row. It considers a fixed number of rows before and/or after the current row.

Suppose we have the following dataset:

sales_date	sales_amount
2021-01-01	100
2021-01-02	200
2021-01-03	150
2021-01-04	300
2021-01-05	250
2021-01-06	175
2021-01-07	400

If we want to calculate the rolling average sales amount over a window of three days for each sales transaction, we can use the ROWS clause:

SELECT 
    sales_date, 
    sales_amount, 
    AVG(sales_amount) OVER (
        ORDER BY sales_date 
        ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING
    ) AS rolling_avg
FROM sales_transactions;

This will produce the following result:

sales_date	sales_amount	rolling_avg
2021-01-01	100	150
2021-01-02	200	150
2021-01-03	150	216.67
2021-01-04	300	233.33
2021-01-05	250	241.67
2021-01-06	175	275
2021-01-07	400	287.5

Notice that the rolling average for each row is computed based on the sales amounts of the current row and the two adjacent rows, as defined by the ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING clause.

RANGE: Logical Offset

RANGE defines the window based on logical value ranges from the current row's value. It considers all rows whose values fall within a specified range.

If we want to calculate the rolling average sales amount over a window of transactions whose sales amounts are within 50 units of the current transaction's sales amount, we can use the RANGE clause:

SELECT 
    sales_date, 
    sales_amount, 
    AVG(sales_amount) OVER (
        ORDER BY sales_amount 
        RANGE BETWEEN 50 PRECEDING AND 50 FOLLOWING
    ) AS rolling_avg
FROM sales_transactions;

This will produce the following result:

sales_date	sales_amount	rolling_avg
2021-01-01	100	125
2021-01-03	150	156.25
2021-01-06	175	175
2021-01-02	200	193.75
2021-01-05	250	250
2021-01-04	300	275
2021-01-07	400	400

Notice that the rolling average for each row is computed based on the sales amounts of all transactions whose sales amounts are within 50 units of the current transaction's sales amount, as defined by the RANGE BETWEEN 50 PRECEDING AND 50 FOLLOWING clause. The resulting rolling averages are smoother than the ones obtained using the ROWS clause, because they take into account all transactions with similar sales amounts, rather than just the adjacent ones.

Common Functions

ROW_NUMBER()

The ROW_NUMBER() function always produces a unique sequential number for each row, and it does not skip any numbers or repeat any numbers.

Suppose we have a table called scores with the following data:

id	student_name	score
1	Alice	90
2	Bob	85
3	Charlie	90
4	David	80

SELECT 
    id, 
    student_name, 
    score, 
    ROW_NUMBER() OVER (ORDER BY score DESC) AS row_number
FROM scores;

This query produces the following result:

id	student_name	score	row_number
1	Alice	90	1
3	Charlie	90	2
2	Bob	85	3
4	David	80	4

In this result set, each row has a unique rank, assigned by the ROW_NUMBER() function, based on the score column in descending order. Note that Alice and Charlie still have the same score, but they have different row numbers.

RANK() vs DENSE_RANK()

RANK() and DENSE_RANK() are both ranking functions in SQL that assign a rank to each row based on the values in one or more columns. The main difference between the two is how they handle ties.

RANK() function assigns the same rank to rows with the same value, and then skips the next rank. For example, if two rows have the same value and are assigned a rank of 1, the next row will be assigned a rank of 3.

DENSE_RANK() function assigns the same rank to rows with the same value, but does not skip any ranks. For example, if two rows have the same value and are assigned a dense rank of 1, the next row will be assigned a dense rank of 2.

Here's an example to illustrate the difference between the two functions. Using the same scores table:

SELECT 
    id, 
    student_name, 
    score, 
    RANK() OVER (ORDER BY score DESC) AS rank,
    DENSE_RANK() OVER (ORDER BY score DESC) AS dense_rank
FROM scores;

RANK() result:

id	student_name	score	rank
1	Alice	90	1
3	Charlie	90	1
2	Bob	85	3
4	David	80	4

DENSE_RANK() result:

id	student_name	score	dense_rank
1	Alice	90	1
3	Charlie	90	1
2	Bob	85	2
4	David	80	3

Notice that with RANK(), Alice and Charlie are assigned a rank of 1, and Bob is assigned a rank of 3 because there is no rank 2. With DENSE_RANK(), Alice and Charlie are still assigned a dense rank of 1, but Bob is now assigned a dense rank of 2, since there is no rank 2 in the result set. Therefore, DENSE_RANK() assigns a contiguous ranking to each row, without any gaps, even if there are ties.

LAG() vs LEAD()

LAG() and LEAD() are navigation functions that allow you to access data from a previous or subsequent row within a window.

LAG() function retrieves the value from a previous row in the result set.

LEAD() function retrieves the value from a subsequent row in the result set.

Suppose we have the following dataset:

sales_date	sales_amount
2021-01-01	100
2021-01-02	200
2021-01-03	150
2021-01-04	300
2021-01-05	250
2021-01-06	175
2021-01-07	400

If we want to calculate the percentage change in sales amount from one transaction to the next, we can use the LAG() function:

SELECT 
    sales_date, 
    sales_amount, 
    LAG(sales_amount) OVER (ORDER BY sales_date) AS prev_sales_amount, 
    (sales_amount - LAG(sales_amount) OVER (ORDER BY sales_date)) * 100.0 
        / LAG(sales_amount) OVER (ORDER BY sales_date) AS pct_change
FROM sales_transactions;

This will produce the following result:

sales_date	sales_amount	prev_sales_amount	pct_change
2021-01-01	100	NULL	NULL
2021-01-02	200	100	100
2021-01-03	150	200	-25
2021-01-04	300	150	100
2021-01-05	250	300	-16.67
2021-01-06	175	250	-30
2021-01-07	400	175	128.57

Notice that the LAG() function is used to retrieve the sales amount from the previous transaction, based on the ORDER BY clause. The resulting column prev_sales_amount is NULL for the first row, since there is no previous transaction to retrieve.

If we want to calculate the difference in sales amount between the current transaction and the transaction two days ahead, we can use the LEAD() function:

SELECT 
    sales_date, 
    sales_amount, 
    LEAD(sales_amount, 2) OVER (ORDER BY sales_date) AS next_sales_amount, 
    LEAD(sales_amount, 2) OVER (ORDER BY sales_date) - sales_amount AS sales_diff
FROM sales_transactions;

This will produce the following result:

sales_date	sales_amount	next_sales_amount	sales_diff
2021-01-01	100	150	50
2021-01-02	200	300	100
2021-01-03	150	250	100
2021-01-04	300	175	-125
2021-01-05	250	400	150
2021-01-06	175	NULL	NULL
2021-01-07	400	NULL	NULL

Notice that the LEAD() function is used to retrieve the sales amount from the transaction two days ahead, based on the ORDER BY clause and the offset value of 2. The resulting column next_sales_amount is NULL for the last two rows, since there are no transactions two days ahead to retrieve.

PERCENT_RANK() vs CUME_DIST()

Both PERCENT_RANK() and CUME_DIST() are percentile functions, but they calculate different measures:

PERCENT_RANK() calculates the relative rank of each row within the result set, returning a value between 0 and 1. It shows what percentage of rows have a value less than the current row's value.

CUME_DIST() calculates the cumulative distribution of each row within the result set, returning a value between 0 and 1. It shows what percentage of rows have a value less than or equal to the current row's value.

Using the same sales_transactions table:

SELECT   
    sales_date,   
    sales_amount,   
    PERCENT_RANK() OVER (ORDER BY sales_amount) AS percent_rank,   
    CUME_DIST() OVER (ORDER BY sales_amount) AS cume_dist 
FROM sales_transactions;

This will produce the following result:

sales_date	sales_amount	percent_rank	cume_dist
2021-01-01	100	0.00	0.14
2021-01-03	150	0.17	0.29
2021-01-06	175	0.33	0.43
2021-01-02	200	0.50	0.57
2021-01-05	250	0.67	0.71
2021-01-04	300	0.83	0.86
2021-01-07	400	1.00	1.00

The PERCENT_RANK() function calculates the relative rank of each row within the result set based on the sales_amount column. In this case, the sales amounts range from 100 to 400, and the PERCENT_RANK() function assigns a value between 0 and 1 to each row, indicating the percentage of rows with a sales amount less than the current row's sales amount. For example, on January 1st, the sales amount is 100, which is the lowest sales amount in the dataset, so its percent_rank is 0.00. On January 7th, the sales amount is 400, which is the highest sales amount in the dataset, so its percent_rank is 1.00.

The CUME_DIST() function calculates the cumulative distribution of each row within the result set based on the sales_amount column. It assigns a value between 0 and 1 to each row, indicating the cumulative percentage of rows with a sales amount less than or equal to the current row's sales amount. For example, on January 1st, the sales amount is 100, which is the lowest sales amount in the dataset, so its cume_dist is 0.14, indicating that 14% of the rows (1 out of 7) have a sales amount less than or equal to 100. On January 7th, the sales amount is 400, which is the highest sales amount in the dataset, so its cume_dist is 1.00, indicating that 100% of the rows have a sales amount less than or equal to 400.

Summary

Window functions are a powerful feature in SQL that enable you to perform calculations across related rows without collapsing the result set. They are essential for:

Ranking and ordering data within partitions
Calculating running totals and moving averages
Comparing values between consecutive rows
Analyzing data distribution patterns

Understanding how to use window functions with PARTITION BY, ORDER BY, and window frame specifications (ROWS vs RANGE) will significantly enhance your SQL analytical capabilities and allow you to solve complex data analysis problems efficiently.

Practice

Which products had the highest and lowest number of orders?

For each product, what is the price difference between the current order and the previous order?

What is the most expensive and cheapest product sold in each product category in the Olist dataset?

Please wait while we prepare your content

Intro to Window Functions

Introduction

Why Use Window Functions?

Basic Example

Types of Window Functions

Ranking Functions

Aggregate Functions

Navigation Functions

Percentile Functions

Analytic Functions

Syntax

ROWS vs RANGE

ROWS: Physical Offset

RANGE: Logical Offset

Common Functions

ROW_NUMBER()

RANK() vs DENSE_RANK()

LAG() vs LEAD()

PERCENT_RANK() vs CUME_DIST()

Summary

Practice

Which products had the highest and lowest number of orders?

Sample solution

For each product, what is the price difference between the current order and the previous order?

Sample solution

What is the most expensive and cheapest product sold in each product category in the Olist dataset?

Sample solution

Intro to Window Functions

Introduction

Why Use Window Functions?

Basic Example

Types of Window Functions

Ranking Functions

Aggregate Functions

Navigation Functions

Percentile Functions

Analytic Functions

Syntax

ROWS vs RANGE

ROWS: Physical Offset

RANGE: Logical Offset

Common Functions

ROW_NUMBER()

RANK() vs DENSE_RANK()

LAG() vs LEAD()

PERCENT_RANK() vs CUME_DIST()

Summary

Practice

Which products had the highest and lowest number of orders?

Sample solution

For each product, what is the price difference between the current order and the previous order?

Sample solution

What is the most expensive and cheapest product sold in each product category in the Olist dataset?

Sample solution