Understanding percentile_cont() and percentile_disc() in PostgreSQL

Try for free

Start supercharging your PostgreSQL today.

Abstract shapes over a dark background.

PostgreSQL has two functions to calculate the percentile for a list of values at any percentage: percentile_cont() and percentile_disc(). These two functions work similarly, but they differ in how they produce the final result. Both are used with ordered-set aggregates returned by the WITHIN GROUP clause.

The percentile_disc() function returns a value from the input set that is the closest to the percentile requested. The value returned will actually exist in the set. 

Here's the percentile_disc() syntax:

SELECT percentile_disc(<fraction double precision>) WITHIN GROUP (<sort_expression>) FROM <table>;

The percentile_cont() function returns an interpolated value between multiple values based on the distribution. It is more accurate, but it may return a fractional value between two values in the input set.

percentile_cont() syntax:

SELECT percentile_cont(<fraction double precision>) WITHIN GROUP (<sort_expression>) FROM <table>;

Examples

  • Calculating the median

  • Calculating multiple percentiles

  • Calculating a series of percentiles

For the following examples, we will use this set of weather data stored in a table called city_data:

day

city

temperature

precipitation

2021-09-04

Miami

68.36

0.00

2021-09-05

Miami

72.50

0.00

2021-09-01

Miami

65.30

0.28

2021-09-02

Miami

64.40

0.79

2021-09-03

Miami

68.18

0.47

2021-09-04

Atlanta

67.28

0.00

2021-09-05

Atlanta

68.72

0.00

2021-09-01

Atlanta

63.14

0.20

2021-09-02

Atlanta

62.60

0.59

2021-09-03

Atlanta

62.60

0.39

Calculating the Median

The median is also known as the 50th percentile. You can calculate it from the dataset with the following query:

SELECT percentile_disc(0.5) WITHIN GROUP ( ORDER BY temperature) FROM city_data;

The result is:

percentile_disc

65.30

Because the query used percentile_disc(), the result is a value that exists in the dataset. If you want to find the true median, it is not a value in this data, and you have to use percentile_cont(). Here is the query:

SELECT percentile_cont(0.5) WITHIN GROUP ( ORDER BY temperature) FROM city_data;

And the result:

percentile_cont

66.28999999999999

But since there are two cities, you might want to calculate the median temperature of each by adding a GROUP BY clause. Here is that query:

SELECT city, percentile_cont(0.5) WITHIN GROUP ( ORDER BY temperature) FROM city_data GROUP BY city;

The results is:

city

percentile_cont

Atlanta

63.14

Miami

68.18

Calculating Multiple Percentiles

For this example, we are going to use a database table called conditions that contains these values:

time

device_id

temperature

humidity

2016-11-15 07:00:00

weather-pro-000001

32.4

49.8

2016-11-15 07:00:00

weather-pro-000002

39.800000000000004

50.2

2016-11-15 07:00:00

weather-pro-000003

36.800000000000004

49.8

2016-11-15 07:00:00

weather-pro-000004

71.8

50.1

2016-11-15 07:00:00

weather-pro-000005

71.8

49.9

2016-11-15 07:00:00

weather-pro-000006

37

49.8

Let’s say that we want to calculate various percentiles for the humidity for each device. Here is an example query:

SELECT device_id, percentile_cont(0.25) WITHIN GROUP( ORDER BY humidity) AS percentile_25, percentile_cont(0.50) WITHIN GROUP( ORDER BY humidity) AS percentile_50, percentile_cont(0.75) WITHIN GROUP( ORDER BY humidity) AS percentile_75, percentile_cont(0.95) WITHIN GROUP( ORDER BY humidity) AS percentile_95 FROM conditions GROUP BY device_id ;

Here is a part of the result:

device_id

percentile_25

percentile_50

percentile_75

percentile_95

weather-pro-000000

49.29999999999999

50.500000000000036

53.10000000000007

54.9000000000001

weather-pro-000001

49.09999999999999

50.00000000000003

51.60000000000005

55.6

weather-pro-000002

52.500000000000036

53.60000000000005

54.00000000000006

54.500000000000064

weather-pro-000003

51.100000000000016

51.90000000000003

52.90000000000004

53.800000000000054

weather-pro-000004

48.60000000000001

49.20000000000002

49.60000000000002

50.400000000000034

Calculating a Series of Percentiles

For this example, we are going back to our original city_data dataset because this query can take a long time to run on a big dataset. We are going to use the generate_series() to create every single whole percentage and then use those values in percentile_cont. Here is the query:

SELECT city, percentile, percentile_cont(p) WITHIN GROUP ( ORDER BY temperature) FROM city_data, generate_series(0.01, 1, 0.01) AS percentile GROUP BY city, percentile;

Here is a selection of the results since the query generates 200 rows of them:

city

percentile

percentile_cont

Atlanta

0.25

62.6

Atlanta

0.26

62.6216

Atlanta

0.27

62.6432

Atlanta

0.28

62.6648

Atlanta

0.29

62.6864

Atlanta

0.30

62.708

Why Use the Timescale approx_percentile() Function Instead of PostgreSQL Percentile Functions?

Calculating the percentile over large datasets, like time-series data in a Timesscale database, can involve a lot of expensive calculations. It can increase the memory footprint of the database, result in higher network costs, and make streaming data unfeasible. The aggregates are also not partializable or parallelizable.

Many times you don’t need this type of accuracy, and approximate percentile calculations will be close enough. This is why Timescale introduced the approx_percentile() hyperfunction. The approx_percentile() function implements the UDDSketch algorithm that uses a modified histogram to approximate the shape of a distribution. This allows for calculating a “good enough” percentile without needing to use all the data or ordering it before it returns the result.

approx_percentile() syntax:

approx_percentile(     percentile DOUBLE PRECISION,     sketch  uddsketch ) RETURNS DOUBLE PRECISION

The second parameter is the sketch to perform the approx_percentile on and is usually returned from a percentile_agg() call. Here is an example query:

SELECT     approx_percentile(0.01, percentile_agg(data)) FROM generate_series(0, 100) data;

Result:

approx_percentile

0.999

Next Steps

To learn more about how to use percentile_cont() and percentile_disc() in PostgreSQL, you can see the PostgreSQL documentation

  • To find out more about Timescale’s approx_percentile() function, you can read more about it in the Timescale documentation

For examples of how to use these functions in your queries, see these sections of the Timescale documentation: