Start supercharging your PostgreSQL today.
PostgreSQL has two functions to calculate the percentile for a list of values at any percentage: percentile_cont()
and percentile_disc()
. These two functions work similarly, but they differ in how they produce the final result. Both are used with ordered-set aggregates returned by the WITHIN GROUP
clause.
The percentile_disc()
function returns a value from the input set that is the closest to the percentile requested. The value returned will actually exist in the set.
Here's the percentile_disc()
syntax:
SELECT
percentile_disc(<fraction double precision>) WITHIN GROUP (<sort_expression>)
FROM <table>;
The percentile_cont()
function returns an interpolated value between multiple values based on the distribution. It is more accurate, but it may return a fractional value between two values in the input set.
percentile_cont()
syntax:
SELECT
percentile_cont(<fraction double precision>) WITHIN GROUP (<sort_expression>)
FROM <table>;
Calculating the median
Calculating multiple percentiles
Calculating a series of percentiles
For the following examples, we will use this set of weather data stored in a table called city_data:
day | city | temperature | precipitation |
2021-09-04 | Miami | 68.36 | 0.00 |
2021-09-05 | Miami | 72.50 | 0.00 |
2021-09-01 | Miami | 65.30 | 0.28 |
2021-09-02 | Miami | 64.40 | 0.79 |
2021-09-03 | Miami | 68.18 | 0.47 |
2021-09-04 | Atlanta | 67.28 | 0.00 |
2021-09-05 | Atlanta | 68.72 | 0.00 |
2021-09-01 | Atlanta | 63.14 | 0.20 |
2021-09-02 | Atlanta | 62.60 | 0.59 |
2021-09-03 | Atlanta | 62.60 | 0.39 |
The median is also known as the 50th percentile. You can calculate it from the dataset with the following query:
SELECT
percentile_disc(0.5) WITHIN GROUP (
ORDER BY temperature)
FROM city_data;
The result is:
percentile_disc |
65.30 |
Because the query used percentile_disc()
, the result is a value that exists in the dataset. If you want to find the true median, it is not a value in this data, and you have to use percentile_cont()
. Here is the query:
SELECT
percentile_cont(0.5) WITHIN GROUP (
ORDER BY temperature)
FROM city_data;
And the result:
percentile_cont |
66.28999999999999 |
But since there are two cities, you might want to calculate the median temperature of each by adding a GROUP BY
clause. Here is that query:
SELECT
city,
percentile_cont(0.5) WITHIN GROUP (
ORDER BY
temperature)
FROM
city_data
GROUP BY
city;
The results is:
city | percentile_cont |
Atlanta | 63.14 |
Miami | 68.18 |
For this example, we are going to use a database table called conditions that contains these values:
time | device_id | temperature | humidity |
2016-11-15 07:00:00 | weather-pro-000001 | 32.4 | 49.8 |
2016-11-15 07:00:00 | weather-pro-000002 | 39.800000000000004 | 50.2 |
2016-11-15 07:00:00 | weather-pro-000003 | 36.800000000000004 | 49.8 |
2016-11-15 07:00:00 | weather-pro-000004 | 71.8 | 50.1 |
2016-11-15 07:00:00 | weather-pro-000005 | 71.8 | 49.9 |
2016-11-15 07:00:00 | weather-pro-000006 | 37 | 49.8 |
Let’s say that we want to calculate various percentiles for the humidity for each device. Here is an example query:
SELECT
device_id,
percentile_cont(0.25) WITHIN GROUP(
ORDER BY
humidity) AS percentile_25,
percentile_cont(0.50) WITHIN GROUP(
ORDER BY
humidity) AS percentile_50,
percentile_cont(0.75) WITHIN GROUP(
ORDER BY
humidity) AS percentile_75,
percentile_cont(0.95) WITHIN GROUP(
ORDER BY
humidity) AS percentile_95
FROM
conditions
GROUP BY
device_id ;
Here is a part of the result:
device_id | percentile_25 | percentile_50 | percentile_75 | percentile_95 |
weather-pro-000000 | 49.29999999999999 | 50.500000000000036 | 53.10000000000007 | 54.9000000000001 |
weather-pro-000001 | 49.09999999999999 | 50.00000000000003 | 51.60000000000005 | 55.6 |
weather-pro-000002 | 52.500000000000036 | 53.60000000000005 | 54.00000000000006 | 54.500000000000064 |
weather-pro-000003 | 51.100000000000016 | 51.90000000000003 | 52.90000000000004 | 53.800000000000054 |
weather-pro-000004 | 48.60000000000001 | 49.20000000000002 | 49.60000000000002 | 50.400000000000034 |
For this example, we are going back to our original city_data
dataset because this query can take a long time to run on a big dataset. We are going to use the generate_series()
to create every single whole percentage and then use those values in percentile_cont. Here is the query:
SELECT
city,
percentile,
percentile_cont(p) WITHIN GROUP (
ORDER BY
temperature)
FROM
city_data,
generate_series(0.01, 1, 0.01) AS percentile
GROUP BY
city, percentile;
Here is a selection of the results since the query generates 200 rows of them:
city | percentile | percentile_cont |
Atlanta | 0.25 | 62.6 |
Atlanta | 0.26 | 62.6216 |
Atlanta | 0.27 | 62.6432 |
Atlanta | 0.28 | 62.6648 |
Atlanta | 0.29 | 62.6864 |
Atlanta | 0.30 | 62.708 |
Calculating the percentile over large datasets, like time-series data in a Timesscale database, can involve a lot of expensive calculations. It can increase the memory footprint of the database, result in higher network costs, and make streaming data unfeasible. The aggregates are also not partializable or parallelizable.
Many times you don’t need this type of accuracy, and approximate percentile calculations will be close enough. This is why Timescale introduced the approx_percentile()
hyperfunction. The approx_percentile()
function implements the UDDSketch algorithm that uses a modified histogram to approximate the shape of a distribution. This allows for calculating a “good enough” percentile without needing to use all the data or ordering it before it returns the result.
approx_percentile()
syntax:
approx_percentile(
percentile DOUBLE PRECISION,
sketch uddsketch
) RETURNS DOUBLE PRECISION
The second parameter is the sketch to perform the approx_percentile
on and is usually returned from a percentile_agg()
call. Here is an example query:
SELECT
approx_percentile(0.01, percentile_agg(data))
FROM generate_series(0, 100) data;
Result:
approx_percentile |
0.999 |
To learn more about how to use percentile_cont()
and percentile_disc()
in PostgreSQL, you can see the PostgreSQL documentation.
To find out more about Timescale’s approx_percentile()
function, you can read more about it in the Timescale documentation.
For examples of how to use these functions in your queries, see these sections of the Timescale documentation: