Understand SQL Window Functions Once and For All | by Mateus Trentz | May, 2024


A step-by-step guide to understanding window functions

Photo by Yasmina H on Unsplash

Window functions are key to writing SQL code that is both efficient and easy to understand. Knowing how they work and when to use them will unlock new ways of solving your reporting problems.

The objective of this article is to explain window functions in SQL step by step in an understandable way so that you don’t need to rely on only memorizing the syntax.

Here is what we will cover:

  • An explanation on how you should view window functions
  • Go over many examples in increasing difficulty
  • Look at one specific real-case scenario to put our learnings into practice
  • Review what we’ve learned

Our dataset is simple, six rows of revenue data for two regions in the year 2023.

If we took this dataset and ran a GROUP BY sum on the revenue of each region, it would be clear what happens, right? It would result in only two remaining rows, one for each region, and then the sum of the revenues:

The way I want you to view window functions is very similar to this but, instead of reducing the number of rows, the aggregation will run “in the background” and the values will be added to our existing rows.

First, an example:

SELECT
id,
date,
region,
revenue,
SUM(revenue) OVER () as total_revenue
FROM
sales

Notice that we don’t have any GROUP BY and our dataset is left intact. And yet we were able to get the sum of all revenues. Before we go more in depth in how this worked let’s just quickly talk about the full syntax before we start building up our knowledge.

The syntax goes like this:

SUM([some_column]) OVER (PARTITION BY [some_columns] ORDER BY [some_columns])

Picking apart each section, this is what we have:

  • An aggregation or window function: SUM, AVG, MAX, RANK, FIRST_VALUE
  • The OVER keyword which says this is a window function
  • The PARTITION BY section, which defines the groups
  • The ORDER BY section which defines if it’s a running function (we will cover this later on)

Don’t stress over what each of these means yet, as it will become clear when we go over the examples. For now just know that to define a window function we will use the OVER keyword. And as we saw in the first example, that’s the only requirement.

Moving to something actually useful, we will now apply a group in our function. The initial calculation will be kept to show you that we can run more than one window function at once, which means we can do different aggregations at once in the same query, without requiring sub-queries.

SELECT
id,
date,
region,
revenue,
SUM(revenue) OVER (PARTITION BY region) as region_total,
SUM(revenue) OVER () as total_revenue
FROM sales

As said, we use the PARTITION BY to define our groups (windows) that are used by our aggregation function! So, keeping our dataset intact we’ve got:

  • The total revenue for each region
  • The total revenue for the whole dataset

We’re also not restrained to a single group. Similar to GROUP BY we can partition our data on Region and Quarter, for example:

SELECT
id,
date,
region,
revenue,
SUM(revenue) OVER (PARTITION BY
region,
date_trunc('quarter', date)
) AS region_quarterly_revenue
FROM sales

In the image we see that the only two data points for the same region and quarter got grouped together!

At this point I hope it’s clear how we can view this as doing a GROUP BY but in-place, without reducing the number of rows in our dataset. Of course, we don’t always want that, but it’s not that uncommon to see queries where someone groups data and then joins it back in the original dataset, complicating what could be a single window function.

Moving on to the ORDER BY keyword. This one defines a running window function. You’ve probably heard of a Running Sum once in your life, but if not, we should start with an example to make everything clear.

SELECT
id,
date,
region,
revenue,
SUM(revenue) OVER (ORDER BY id) as running_total
FROM sales

What happens here is that we’ve went, row by row, summing the revenue with all previous values. This was done following the order of the id column, but it could’ve been any other column.

This specific example is not particularly useful because we’re summing across random months and two regions, but using what we’ve learned we can now find the cumulative revenue per region. We do that by applying the running sum within each group.

SELECT
id,
date,
region,
revenue,
SUM(revenue) OVER (PARTITION BY region ORDER BY date) as running_total
FROM sales

Take the time to make sure you understand what happened here:

  • For each region we’re walking up month by month and summing the revenue
  • Once it’s done for that region we move to the next one, starting from scratch and again moving up the months!

It’s quite interesting to notice here that when we’re writing these running functions we have the “context” of other rows. What I mean is that to get the running sum at one point, we must know the previous values for the previous rows. This becomes more obvious when we learn that we can manually chose how many rows before/after we want to aggregate on.

SELECT
id,
date,
region,
revenue,
SUM(revenue) OVER (ORDER BY id ROWS BETWEEN 1 PRECEDING AND 2 FOLLOWING)
AS useless_sum
FROM
sales

For this query we specified that for each row we wanted to look at one row behind and two rows ahead, so that means we get the sum of that range! Depending on the problem you’re solving this can be extremely powerful as it gives you complete control on how you’re grouping your data.

Finally, one last function I want to mention before we move into a harder example is the RANK function. This gets asked a lot in interviews and the logic behind it is the same as everything we’ve learned so far.

SELECT
*,
RANK() OVER (PARTITION BY region ORDER BY revenue DESC) as rank,
RANK() OVER (ORDER BY revenue DESC) as overall_rank
FROM
sales
ORDER BY region, revenue DESC

Just as before, we used ORDER BY to specify the order which we will walk, row by row, and PARTITION BY to specify our sub-groups.

The first column ranks each row within each region, meaning that we will have multiple “rank one’s” in the dataset. The second calculation is the rank across all rows in the dataset.

This is a problem that shows up every now and then and to solve it on SQL it takes heavy usage of window functions. To explain this concept we will use a different dataset containing timestamps and temperature measurements. Our goal is to fill in the rows missing temperature measurements with the last measured value.

Here is what we expect to have at the end:

Before we start I just want to mention that if you’re using Pandas you can solve this problem simply by running df.ffill() but if you’re on SQL the problem gets a bit more tricky.

The first step to solve this is to, somehow, group the NULLs with the previous non-null value. It might not be clear how we do this but I hope it’s clear that this will require a running function. Meaning that it’s a function that will “walk row by row”, knowing when we hit a null value and when we hit a non-null value.

The solution is to use COUNT and, more specifically, count the values of temperature measurements. In the following query I run both a normal running count and also a count over the temperature values.

SELECT
*,
COUNT() OVER (ORDER BY timestamp) as normal_count,
COUNT(temperature) OVER (ORDER BY timestamp) as group_count
from sensor
  • In the first calculation we simply counted up each row increasingly
  • On the second one we counted every value of temperature we saw, not counting when it was NULL

The normal_count column is useless for us, I just wanted to show what a running COUNT looked like. Our second calculation though, the group_count moves us closer to solving our problem!

Notice that this way of counting makes sure that the first value, just before the NULLs start, is counted and then, every time the function sees a null, nothing happens. This makes sure that we’re “tagging” every subsequent null with the same count we had when we stopped having measurements.

Moving on, we now need to copy over the first value that got tagged into all the other rows within that same group. Meaning that for the group 2 needs to all be filled with the value 15.0.

Can you think of a function now that we can use here? There is more than one answer for this, but, again, I hope that at least it’s clear that now we’re looking at a simple window aggregation with PARTITION BY .

SELECT
*,
FIRST_VALUE(temperature) OVER (PARTITION BY group_count) as filled_v1,
MAX(temperature) OVER (PARTITION BY group_count) as filled_v2
FROM (
SELECT
*,
COUNT(temperature) OVER (ORDER BY timestamp) as group_count
from sensor
) as grouped
ORDER BY timestamp ASC

We can use both FIRST_VALUE or MAX to achieve what we want. The only goal is that we get the first non-null value. Since we know that each group contains one non-null value and a bunch of null values, both of these functions work!

This example is a great way to practice window functions. If you want a similar challenge try to add two sensors and then forward fill the values with the previous reading of that sensor. Something similar to this:

Could you do it? It doesn’t use anything that we haven’t learned here so far.

By now we know everything that we need about how window functions work in SQL, so let’s just do a quick recap!

This is what we’ve learned:

  • We use the OVER keyword to write window functions
  • We use PARTITION BY to specify our sub-groups (windows)
  • If we provide only the OVER() keyword our window is the whole dataset
  • We use ORDER BY when we want to have a running function, meaning that our calculation walks row by row
  • Window functions are useful when we want to group data to run an aggregation but we want to keep our dataset as is



Source link

Be the first to comment

Leave a Reply

Your email address will not be published.


*