The WITHIN GROUP
clause is useful in PostgreSQL and TimescaleDB when performing aggregations on ordered subsets of data. It essentially allows you to pivot multiple rows of data onto a single row. This clause simplifies many operations that were once only possible through the use of window functions in combination with a common table expression or CTE.
The expressions preceding WITHIN GROUP
are called direct arguments, which are only evaluated once per aggregate call and not for each row in the input. Direct arguments are commonly used for percentile fractions and other aggregate calculations that make sense when you use a single value per calculation. The records in a WITHIN GROUP
clause require a group of rows organized by the ORDER BY
clause.
Here's the WITHIN GROUP
syntax:
SELECT <aggregate_function> WITHIN GROUP (ORDER BY <sort_expression>) FROM <table>;
Finding the most frequent value in a series of data
Using WITHIN GROUP with GROUP BY
Let’s say you have this data in a database table called city_data
:
id | day | city | temperature | precipitation |
17 | 2021-09-04 | Miami | 68.36 | 0.00 |
19 | 2021-09-05 | Miami | 72.50 | 0.00 |
11 | 2021-09-01 | Miami | 65.30 | 0.28 |
13 | 2021-09-02 | Miami | 64.40 | 0.79 |
15 | 2021-09-03 | Miami | 68.18 | 0.47 |
18 | 2021-09-04 | Atlanta | 67.28 | 0.00 |
20 | 2021-09-05 | Atlanta | 68.72 | 0.00 |
12 | 2021-09-01 | Atlanta | 63.14 | 0.20 |
14 | 2021-09-02 | Atlanta | 62.60 | 0.59 |
16 | 2021-09-03 | Atlanta | 62.60 | 0.39 |
Here are some examples of using WITHIN GROUP
in SQL queries.
PostgreSQL and TimescaleDB have two statistical window functions that are often used with WITHIN GROUP
. Both percentile_cont
and percentile_disc
calculate a percentile, but they do so in different ways.
Here is an example of using percentile_disc
to aggregate:
SELECT
percentile_disc(0.5) WITHIN GROUP (
ORDER BY temperature)
FROM city_data;
The result is:
percentile_disc |
65.30 |
Notice that this result is a value that exists in the data. The WITHIN GROUP
clause specifies the numeric values that are used by the percentile function. In this case, percentile_disc
returns a discrete value from the data set. On the other hand, percentile_cont
interpolates the values. Here is the query using percentile_cont
:
SELECT
percentile_cont(0.5) WITHIN GROUP (
ORDER BY temperature)
FROM city_data;
The result is:
percentile_cont |
66.28999999999999 |
The mode() function returns the most commonly occurring value specific in the data set specified by WITHIN GROUP
. Here is an example query using mode()
on the city_data
data set.
SELECT
mode() WITHIN GROUP (
ORDER BY temperature)
FROM city_data;
And here is the result, because 62.60 is the only value that occurs twice in the temperature column:
mode |
62.60 |
If you want to know how a specific temperature in the city_data
table ranks within the group that you have designated in the WITHIN GROUP
clause, you can use the rank function. Here is an example:
SELECT
rank(72.5) WITHIN GROUP (
ORDER BY temperature)
FROM city_data;
And here is the result:
rank |
10 |
The temperature 72.5 is the highest value in the table because it is sorted by default in ascending order; that value ends up at the end of the list of values. The following query gives a different result:
SELECT
rank(72.5) WITHIN GROUP (
ORDER BY temperature DESC)
FROM city_data;
rank |
1 |
Using WITHIN GROUP across a complete table is not very useful. In this query, the cities in the data set are grouped, so that the results have the 50-percentile of each city in a separate row:
SELECT
city,
percentile_cont(0.5) WITHIN GROUP (
ORDER BY
temperature)
FROM
city_data
GROUP BY
city;
Results:
city | percentile_cont |
Atlanta | 63.14 |
Miami | 68.18 |
To learn more about WITHIN GROUP
and how to use it in PostgreSQL, see the aggregate expressions, ordered set aggregate functions, and hypothetical set aggregate functions in PostgreSQL documentation.
For more examples of how to use WITHIN GROUP
in your own queries, see these Timescale resources: