Note: For a better view of the exercise and solution panel, please close the left sidebar.
Window functions in SQL are a type of function that allow you to perform calculations across a set of rows that are related to the current row in a query result set. These functions are commonly used for tasks such as calculating running totals, finding ranking or percentiles, and computing moving averages.
A window function typically involves the use of the OVER() clause, which defines the window or subset of rows over which the function is calculated. The window can be defined using various criteria, such as the current row and a specified number of rows before and/or after it, or a range of rows based on a specific ordering.
Window functions are available in many popular SQL database systems, such as PostgreSQL, MySQL, and SQL Server. They are a powerful tool for performing complex calculations and data analysis tasks within SQL queries.
Window functions are particularly useful when you need to:
Unlike aggregate functions with GROUP BY, window functions allow you to compute values that are related to the current row without collapsing the result set into groups.
For example, this query calculates the total revenue generated by each seller:
SELECT
seller_id,
price * freight_value AS order_revenue,
SUM(price * freight_value) OVER (PARTITION BY seller_id) AS total_revenue
FROM order_items;In this query, we're using the SUM() window function to calculate the total revenue generated by each seller, partitioned by seller_id.
The PARTITION BY clause is used in conjunction with window functions to divide a result set into partitions or groups based on one or more columns. The window function is then applied to each partition separately, and the results are combined into a single result set.
There are several types of window functions in SQL. Here are the most commonly used ones:
These functions assign a rank to each row within a result set, based on a specified ordering. Examples include:
ROW_NUMBER(): Assigns a unique sequential number to each rowRANK(): Assigns a rank to each row, with ties receiving the same rank and gaps after tiesDENSE_RANK(): Assigns a rank to each row, with ties receiving the same rank but no gapsThese functions calculate summary statistics over a specified window of rows, such as:
SUM(): Calculate the sum of values in the windowAVG(): Calculate the average of values in the windowMIN(): Find the minimum value in the windowMAX(): Find the maximum value in the windowCOUNT(): Count the number of rows in the windowThey are similar to their non-windowed counterparts, but they operate over a subset of rows rather than the entire result set.
These functions allow you to access data from a previous or subsequent row within a window:
LAG(): Access data from a previous rowLEAD(): Access data from a subsequent rowThese functions calculate the percentile rank of each row within a result set:
PERCENT_RANK(): Calculate the relative rank of each row (0 to 1)CUME_DIST(): Calculate the cumulative distribution of each rowThese functions allow you to perform complex calculations over a specified window of rows:
NTILE(): Divide rows into bucketsFIRST_VALUE(): Get the first value in the windowLAST_VALUE(): Get the last value in the windowThe syntax of window functions in SQL generally follows this basic structure:
<function>() OVER (
[PARTITION BY <partition column(s)>]
[ORDER BY <order column(s)> [ASC|DESC]]
[ROWS|RANGE <window specification>]
)Here is a breakdown of the different parts of the syntax:
<function>(): This is the window function that you want to use, such as SUM(), AVG(), RANK(), or LAG().
OVER: This keyword specifies that you are using a window function.
PARTITION BY: This optional clause allows you to partition the result set into subsets based on one or more columns. The window function will then be applied to each subset separately.
ORDER BY: This optional clause specifies the order in which the rows are evaluated within each partition. You can order by one or more columns, and specify whether the order is ascending or descending.
ROWS/RANGE: This optional clause specifies the type of window to use. ROWS specifies a fixed number of rows before and/or after the current row, while RANGE specifies a range of values. Note that not all window functions support both ROWS and RANGE.
<window specification>: This specifies the size and shape of the window, based on the ROWS or RANGE clause. For example, you might use:
ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWINGto specify a window of three rows (the current row plus one row before and after it).
Both ROWS and RANGE are used to define the window frame, but they work differently:
ROWS defines the window based on physical row offsets from the current row. It considers a fixed number of rows before and/or after the current row.
Suppose we have the following dataset:
| sales_date | sales_amount |
|---|---|
| 2021-01-01 | 100 |
| 2021-01-02 | 200 |
| 2021-01-03 | 150 |
| 2021-01-04 | 300 |
| 2021-01-05 | 250 |
| 2021-01-06 | 175 |
| 2021-01-07 | 400 |
If we want to calculate the rolling average sales amount over a window of three days for each sales transaction, we can use the ROWS clause:
SELECT
sales_date,
sales_amount,
AVG(sales_amount) OVER (
ORDER BY sales_date
ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING
) AS rolling_avg
FROM sales_transactions;This will produce the following result:
| sales_date | sales_amount | rolling_avg |
|---|---|---|
| 2021-01-01 | 100 | 150 |
| 2021-01-02 | 200 | 150 |
| 2021-01-03 | 150 | 216.67 |
| 2021-01-04 | 300 | 233.33 |
| 2021-01-05 | 250 | 241.67 |
| 2021-01-06 | 175 | 275 |
| 2021-01-07 | 400 | 287.5 |
Notice that the rolling average for each row is computed based on the sales amounts of the current row and the two adjacent rows, as defined by the ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING clause.
RANGE defines the window based on logical value ranges from the current row's value. It considers all rows whose values fall within a specified range.
If we want to calculate the rolling average sales amount over a window of transactions whose sales amounts are within 50 units of the current transaction's sales amount, we can use the RANGE clause:
SELECT
sales_date,
sales_amount,
AVG(sales_amount) OVER (
ORDER BY sales_amount
RANGE BETWEEN 50 PRECEDING AND 50 FOLLOWING
) AS rolling_avg
FROM sales_transactions;This will produce the following result:
| sales_date | sales_amount | rolling_avg |
|---|---|---|
| 2021-01-01 | 100 | 125 |
| 2021-01-03 | 150 | 156.25 |
| 2021-01-06 | 175 | 175 |
| 2021-01-02 | 200 | 193.75 |
| 2021-01-05 | 250 | 250 |
| 2021-01-04 | 300 | 275 |
| 2021-01-07 | 400 | 400 |
Notice that the rolling average for each row is computed based on the sales amounts of all transactions whose sales amounts are within 50 units of the current transaction's sales amount, as defined by the RANGE BETWEEN 50 PRECEDING AND 50 FOLLOWING clause. The resulting rolling averages are smoother than the ones obtained using the ROWS clause, because they take into account all transactions with similar sales amounts, rather than just the adjacent ones.
The ROW_NUMBER() function always produces a unique sequential number for each row, and it does not skip any numbers or repeat any numbers.
Suppose we have a table called scores with the following data:
| id | student_name | score |
|---|---|---|
| 1 | Alice | 90 |
| 2 | Bob | 85 |
| 3 | Charlie | 90 |
| 4 | David | 80 |
SELECT
id,
student_name,
score,
ROW_NUMBER() OVER (ORDER BY score DESC) AS row_number
FROM scores;This query produces the following result:
| id | student_name | score | row_number |
|---|---|---|---|
| 1 | Alice | 90 | 1 |
| 3 | Charlie | 90 | 2 |
| 2 | Bob | 85 | 3 |
| 4 | David | 80 | 4 |
In this result set, each row has a unique rank, assigned by the ROW_NUMBER() function, based on the score column in descending order. Note that Alice and Charlie still have the same score, but they have different row numbers.
RANK() and DENSE_RANK() are both ranking functions in SQL that assign a rank to each row based on the values in one or more columns. The main difference between the two is how they handle ties.
RANK() function assigns the same rank to rows with the same value, and then skips the next rank. For example, if two rows have the same value and are assigned a rank of 1, the next row will be assigned a rank of 3.
DENSE_RANK() function assigns the same rank to rows with the same value, but does not skip any ranks. For example, if two rows have the same value and are assigned a dense rank of 1, the next row will be assigned a dense rank of 2.
Here's an example to illustrate the difference between the two functions. Using the same scores table:
SELECT
id,
student_name,
score,
RANK() OVER (ORDER BY score DESC) AS rank,
DENSE_RANK() OVER (ORDER BY score DESC) AS dense_rank
FROM scores;RANK() result:
| id | student_name | score | rank |
|---|---|---|---|
| 1 | Alice | 90 | 1 |
| 3 | Charlie | 90 | 1 |
| 2 | Bob | 85 | 3 |
| 4 | David | 80 | 4 |
DENSE_RANK() result:
| id | student_name | score | dense_rank |
|---|---|---|---|
| 1 | Alice | 90 | 1 |
| 3 | Charlie | 90 | 1 |
| 2 | Bob | 85 | 2 |
| 4 | David | 80 | 3 |
Notice that with RANK(), Alice and Charlie are assigned a rank of 1, and Bob is assigned a rank of 3 because there is no rank 2. With DENSE_RANK(), Alice and Charlie are still assigned a dense rank of 1, but Bob is now assigned a dense rank of 2, since there is no rank 2 in the result set. Therefore, DENSE_RANK() assigns a contiguous ranking to each row, without any gaps, even if there are ties.
LAG() and LEAD() are navigation functions that allow you to access data from a previous or subsequent row within a window.
LAG() function retrieves the value from a previous row in the result set.
LEAD() function retrieves the value from a subsequent row in the result set.
Suppose we have the following dataset:
| sales_date | sales_amount |
|---|---|
| 2021-01-01 | 100 |
| 2021-01-02 | 200 |
| 2021-01-03 | 150 |
| 2021-01-04 | 300 |
| 2021-01-05 | 250 |
| 2021-01-06 | 175 |
| 2021-01-07 | 400 |
If we want to calculate the percentage change in sales amount from one transaction to the next, we can use the LAG() function:
SELECT
sales_date,
sales_amount,
LAG(sales_amount) OVER (ORDER BY sales_date) AS prev_sales_amount,
(sales_amount - LAG(sales_amount) OVER (ORDER BY sales_date)) * 100.0
/ LAG(sales_amount) OVER (ORDER BY sales_date) AS pct_change
FROM sales_transactions;This will produce the following result:
| sales_date | sales_amount | prev_sales_amount | pct_change |
|---|---|---|---|
| 2021-01-01 | 100 | NULL | NULL |
| 2021-01-02 | 200 | 100 | 100 |
| 2021-01-03 | 150 | 200 | -25 |
| 2021-01-04 | 300 | 150 | 100 |
| 2021-01-05 | 250 | 300 | -16.67 |
| 2021-01-06 | 175 | 250 | -30 |
| 2021-01-07 | 400 | 175 | 128.57 |
Notice that the LAG() function is used to retrieve the sales amount from the previous transaction, based on the ORDER BY clause. The resulting column prev_sales_amount is NULL for the first row, since there is no previous transaction to retrieve.
If we want to calculate the difference in sales amount between the current transaction and the transaction two days ahead, we can use the LEAD() function:
SELECT
sales_date,
sales_amount,
LEAD(sales_amount, 2) OVER (ORDER BY sales_date) AS next_sales_amount,
LEAD(sales_amount, 2) OVER (ORDER BY sales_date) - sales_amount AS sales_diff
FROM sales_transactions;This will produce the following result:
| sales_date | sales_amount | next_sales_amount | sales_diff |
|---|---|---|---|
| 2021-01-01 | 100 | 150 | 50 |
| 2021-01-02 | 200 | 300 | 100 |
| 2021-01-03 | 150 | 250 | 100 |
| 2021-01-04 | 300 | 175 | -125 |
| 2021-01-05 | 250 | 400 | 150 |
| 2021-01-06 | 175 | NULL | NULL |
| 2021-01-07 | 400 | NULL | NULL |
Notice that the LEAD() function is used to retrieve the sales amount from the transaction two days ahead, based on the ORDER BY clause and the offset value of 2. The resulting column next_sales_amount is NULL for the last two rows, since there are no transactions two days ahead to retrieve.
Both PERCENT_RANK() and CUME_DIST() are percentile functions, but they calculate different measures:
PERCENT_RANK() calculates the relative rank of each row within the result set, returning a value between 0 and 1. It shows what percentage of rows have a value less than the current row's value.
CUME_DIST() calculates the cumulative distribution of each row within the result set, returning a value between 0 and 1. It shows what percentage of rows have a value less than or equal to the current row's value.
Using the same sales_transactions table:
SELECT
sales_date,
sales_amount,
PERCENT_RANK() OVER (ORDER BY sales_amount) AS percent_rank,
CUME_DIST() OVER (ORDER BY sales_amount) AS cume_dist
FROM sales_transactions;This will produce the following result:
| sales_date | sales_amount | percent_rank | cume_dist |
|---|---|---|---|
| 2021-01-01 | 100 | 0.00 | 0.14 |
| 2021-01-03 | 150 | 0.17 | 0.29 |
| 2021-01-06 | 175 | 0.33 | 0.43 |
| 2021-01-02 | 200 | 0.50 | 0.57 |
| 2021-01-05 | 250 | 0.67 | 0.71 |
| 2021-01-04 | 300 | 0.83 | 0.86 |
| 2021-01-07 | 400 | 1.00 | 1.00 |
The PERCENT_RANK() function calculates the relative rank of each row within the result set based on the sales_amount column. In this case, the sales amounts range from 100 to 400, and the PERCENT_RANK() function assigns a value between 0 and 1 to each row, indicating the percentage of rows with a sales amount less than the current row's sales amount. For example, on January 1st, the sales amount is 100, which is the lowest sales amount in the dataset, so its percent_rank is 0.00. On January 7th, the sales amount is 400, which is the highest sales amount in the dataset, so its percent_rank is 1.00.
The CUME_DIST() function calculates the cumulative distribution of each row within the result set based on the sales_amount column. It assigns a value between 0 and 1 to each row, indicating the cumulative percentage of rows with a sales amount less than or equal to the current row's sales amount. For example, on January 1st, the sales amount is 100, which is the lowest sales amount in the dataset, so its cume_dist is 0.14, indicating that 14% of the rows (1 out of 7) have a sales amount less than or equal to 100. On January 7th, the sales amount is 400, which is the highest sales amount in the dataset, so its cume_dist is 1.00, indicating that 100% of the rows have a sales amount less than or equal to 400.
Window functions are a powerful feature in SQL that enable you to perform calculations across related rows without collapsing the result set. They are essential for:
Understanding how to use window functions with PARTITION BY, ORDER BY, and window frame specifications (ROWS vs RANGE) will significantly enhance your SQL analytical capabilities and allow you to solve complex data analysis problems efficiently.