TimescaleDB Toolkit 1.6.0
We are pleased to announce the release of our newest version of the Timescale Toolkit! This new 1.6.0 version is now live on Timescale Cloud, available as part of our HA docker image, and, of course, available as source from our github repo.
We’ve introduced a couple of new experimental features as part of this release, which we will present in more detail below. First, we’ve introduced a new TopN Aggregate to provide an estimate of the top values of a dataset. We’ve also introduced a Gauge Aggregate which will start to provide our Counter Aggregate functionality for use with non-resetting metrics.
Beyond that, 1.6 includes refinements to the State Aggregate and Frequency Aggregate experimental features that we introduced in Toolkit 1.5, enhancing the usability of these aggregates and bringing them closer to stabilization. These aggregates are also more fully detailed in the following post.
The Toolkit team strives to be as responsive to our community as possible. This is one of the main reasons we like to release experimental functionality that our users can try out and help us refine. Please take these new features out for a test drive and let us know about your experience and any improvements you’d like to see. You can drop a response to this post (or feel free to create a new post), chime in on our slack channel, or drop an issue directly into our github repo. Thank you for your help!
Frequency Aggregate
In 1.6 we’ve refined and improved the frequency aggregates we first introduced in our 1.5 release. The frequency aggregate is used to track the approximate frequency within a column of all values with a minimum frequency.
Finding most frequent rounded square roots of integers 1-100:
SELECT value, min_freq, max_freq
FROM toolkit_experimental.into_values(
(SELECT toolkit_experimental.freq_agg(0.15, ceiling(sqrt(v))::int)
FROM generate_series(1,100) v),
0::int
);
value | min_freq | max_freq
-------+----------+----------
10 | 0.19 | 0.24
9 | 0.17 | 0.2
8 | 0.15 | 0.16
7 | 0.13 | 0.13
6 | 0.11 | 0.11
5 | 0.09 | 0.09
4 | 0.07 | 0.07
This is implemented under the hood using the Space-Saving algorithm by Metwally, Agrawal, and El Abbadi (https://cs.ucsb.edu/sites/default/files/documents/2005-23.pdf). At a high level, the algorithm works by tracking counts for the most frequent values it sees with a limit on the number of values it can track. When it encounters an untracked element, it assumes the value had the same count as the least frequent element in our set (the most it could have and not be included). It then replaces this least frequent element and increments the count of the new element.
Tracking the count where an element was first tracked and the number of occurrences since then gives a min and max range for the possible frequency, and allows us to size our data structure such that any element which has at least a given target frequency is guaranteed to be included in our aggregate. The tradeoff for this is that we will likely overestimate the frequency for data that doesn’t occur often enough to stay in the tracked subset, and a value which is overrepresented at the end of a sequence may get too highly rated by the aggregate. That being said, min_freq
or max_freq
can be used depending on whether you’re more concerned with false positives or false negatives.
This will find all values which might have occurred at least 1% of the time, possibly including some false positives:
SELECT value
FROM toolkit_experimental.into_values(aggregate, type)
WHERE max_freq > 0.01;
While this will find all values which must have occurred at least 1% of the time, possibly missing some values that were too close to tell:
SELECT value
FROM toolkit_experimental.into_values(aggregate, type)
WHERE min_freq > 0.01;
Frequency aggregates are created by giving them a column and a minimum frequency that you’re interested in tracking. Once that’s done, in addition to simply dumping the tracked data with into_values
, you can also query for the most common values (based on maximum frequency) with topn
. You can also query the aggregate for the minimum and maximum frequencies for a particular value. Beyond the following examples, you can find more details in our online documentation.
Track all values occurring more than 2% of the time in a column:
SELECT toolkit_experimental.freq_agg(0.02, clmn) FROM tbl;
Get top 5 values from a frequency aggregate of integers (we’re hoping to remove the required type parameter in the next iteration of this aggregate):
SELECT toolkit_experimental.topn(5, agg, 0::int);
Find the maximum frequency of ‘Portland, OR’ in a frequency aggregate:
SELECT toolkit_experimental.max_frequency(agg, 'Portland, OR');
TopN Aggregate
TopN aggregates are an exciting new feature that we’re introducing in Toolkit 1.6.0. These aggregates are used to give an estimate for the top values present in a column.
Finding top 5 most common rounded square roots from 100,000 random numbers in the range (1,1000):
SELECT toolkit_experimental.topn(
toolkit_experimental.topn_agg(5, ceiling(sqrt(random() * 1000))::int), 0::int)
FROM generate_series(1,100000);
topn
------
31
30
29
28
27
Under the hood, the TopN aggregate is actually implemented using the same Space-Saving algorithm and data structures that we use for our frequency aggregate. This gives us the frequency agg API, including into_values
, topn
, min_frequency
, and max_frequency
. We’ve also enhanced the topn
function to automatically default to the N used to create the aggregate, if called on a topn_agg
result.
The way we’ve managed to reuse most of the implementation for frequency agg was to make certain assumptions about how skewed the data being aggregated would be and then size the underlying structures appropriately. More specifically, we’re assuming that the skew of the data we’re looking at is going to be bounded by some zeta distribution. By default, we use a zeta distribution with an s-value of 1.1, which looks like this:
N most common values | Percentage of column with these values |
---|---|
5 | 20% |
10 | 25% |
20 | 30% |
50 | 36% |
100 | 40% |
However, you can also construct a TopN aggregate using a custom s-value.
Using a s-value of 1.05 to track the top 10 values for data that is less skewed (more uniform) than the default assumption:
SELECT toolkit_experimental.topn_agg(10, 1.05, clmn) FROM tbl
If this functionality sounds useful to you, please try it out and let us know about your experience. In particular, if you’re finding the default behavior isn’t working for you, please let us know so we can further tune it.
State Aggregate
The new Toolkit version also includes some improvements to the State Aggregate we introduced in 1.5. The state aggregate is used to accumulate the total time spent in different states, given a time and state column:
WITH example (time, state) AS (
SELECT * FROM (VALUES
('2020-01-01 00:00'::timestamptz, 'STARTUP'),
('2020-01-01 00:05'::timestamptz, 'RUNNING'),
('2020-01-01 12:00'::timestamptz, 'RESTARTING'),
('2020-01-01 12:03:30'::timestamptz, 'RUNNING'),
('2020-01-02 06:00'::timestamptz, 'SHUTDOWN')
) AS v
)
SELECT state, duration / 1000000 AS seconds
FROM toolkit_experimental.into_values(
(SELECT toolkit_experimental.state_agg(time, state) FROM example)
);
state | seconds
------------+---------
RESTARTING | 210
RUNNING | 107490
STARTUP | 300
(3 rows)
Unlike frequency aggregates, state aggregates are not approximations and will total the time spent in each distinct state. This makes it ideal for cases where you want to track frequent shifts between a relatively small number of states, but be careful using this on a high cardinality state column, as it may result in a relatively large data object.
The state aggregate is very simple to use, simply create it from a time and state column as seen in the example above. You can then pass the aggregate to into_values
to get the states and microseconds spent in that state, or query a specific state using duration_in
.
SELECT toolkit_experimental.duration_in(
'RESTARTING',
toolkit_experimental.state_agg(time, state))
FROM example;
For more information, please view the online documentation, and as always, please let us know if you find it useful or if there are any improvements you’d like to see.
Gauge Aggregate
We’re also very excited to be introducing Gauge aggregates in our 1.6 release. These will ultimately allow users the same API as we currently allow for Prometheus style counters (see our counter aggregate, but for gauge style metrics.
WITH example (time, value) AS (
SELECT * FROM (VALUES
('2020-01-01'::timestamptz, 45),
('2020-01-02'::timestamptz, 60),
('2020-01-03'::timestamptz, 30),
('2020-01-04'::timestamptz, 50),
('2020-01-05'::timestamptz, 35)
) AS v
)
SELECT toolkit_experimental.delta(
toolkit_experimental.gauge_agg(time, value)
) FROM example;
delta
-------
-10
Currently we support the delta
, idelta_right
, and idelta_left
APIs on our gauge aggregates, but expect to rapidly expand this to parity with counter aggregates. Have a particular API you’d like to see prioritized, let us know!