Nov 25, 2024
Posted by
Chris Engelbert
Time-series data is often collected at a much higher granularity than is later required for display or historical storage. Having too much data never sounds like a problem until it becomes one for speed and storage reasons. A materialized view is commonly used to precalculate data for faster access. So, we roll data up (or downsample it) into lower granularity datasets (from minutes to days, for example). Typically, the process continues through multiple stages of rollups, going from seconds to minutes, days, weeks, months, etc.
In the past, these several rollups had to be generated from the actual raw dataset. This meant the raw data had to be around for as long as the largest rollup window, likely increasing your storage needs.
With the release of TimescaleDB 2.9, we solved this problem by adding support for hierarchical continuous aggregates. Simply put, continuous aggregates on top of continuous aggregates.
Continuous aggregates, which can be described as incremental and automatically updated materialized views, have been part of TimescaleDB for quite a while now. They are one of the most beloved features, enabling users to pre-aggregate data in the background and making it quickly available when necessary.
A common use case for continuous aggregates is dashboards, where data is often displayed at a much lower granularity than it was recorded (we covered the particular use case of real-time analytics in this blog post). Imagine a data point like CPU usage, which is recorded at a second’s granularity. It is unlikely to display it at the same granularity level in Grafana. You’d normally use averages or percentiles over the course of a minute or even lower, such as a five-minute window.
SELECT
time_bucket('5 minutes', "time") AS "time",
avg(cpu_usage) AS "avg_cpu_usage"
FROM cpu_usage_metrics
WHERE "time" BETWEEN now() - INTERVAL '5 days' AND now()
AND "machine_id" = 42
GROUP BY 1
ORDER BY 1;
While you can query this time window directly from the raw data, depending on the amount of data and the ingress granularity, the query may not satisfy your response time requirements or deliver a “laggy” user experience.
Pre-calculating the required granularity helps give an instant feel to the dashboard for an amazing user experience.
CREATE MATERIALIZED VIEW cpu_usage_metrics_avg_5min
WITH (timescaledb.continuous) AS
SELECT
time_bucket('5 minutes', "time") AS "time",
"machine_id",
avg(cpu_usage) AS "avg_cpu_usage"
FROM cpu_usage_metrics
GROUP BY 1, 2
ORDER BY 1;
Querying the same result data as above is now as simple as any other query:
SELECT
"time",
"avg_cpu_usage"
FROM cpu_usage_metrics_avg_5min
WHERE "time" BETWEEN now() - INTERVAL '5 days' AND now()
AND "machine_id" = 42
ORDER BY 1;
If you need to work with yet another granularity level, let’s say 15 minutes, just create another continuous aggregate with the necessary rollup window, and you’ll be fine. That is, if the raw data is available for rolling up at the internal refresh window, which can lead to issues when you need to roll up data for a monthly time window. All raw data needs to be available at that point in time.
Well, not anymore!
With TimescaleDB 2.9 or later, you can roll up a continuous aggregate from a previous continuous aggregate. That means the FROM
clause can reference another continuous aggregate, which wasn’t allowed before.
Returning to the 15-minute example, we can now implement a continuous aggregate using the already pre-aggregated five-minute one. For the sake of correctness (since it uses an average, and those can be tricky when using multi-stage averages), let’s slightly change the five-minute continuous aggregate by adding an intermediate sum
and count
.
CREATE MATERIALIZED VIEW cpu_usage_metrics_avg_5min
WITH (timescaledb.continuous) AS
SELECT
time_bucket('5 minutes', "time") AS "time",
"machine_id",
avg(cpu_usage) AS "avg_cpu_usage",
sum(cpu_usage) AS "sum_cpu_usage",
count(cpu_usage) AS "count_cpu_usage"
FROM cpu_usage_metrics
GROUP BY 1, 2
ORDER BY 1;
With that out of the way, the 15-minute continuous aggregate is as simple as the following:
CREATE MATERIALIZED VIEW cpu_usage_metrics_avg_15min
WITH (timescaledb.continuous) AS
SELECT
time_bucket('15 minutes', "time") AS "time",
"machine_id",
sum(sum_cpu_usage) / sum(count_cpu_usage) AS "avg_cpu_usage"
FROM cpu_usage_metrics_avg_5min
GROUP BY 1, 2
ORDER BY 1;
As you can see, we don’t use the average function anymore, but the two additional intermediate values to build the average. For other aggregations, it may be easier or more complex depending on the multi-stage aggregation requirements of the algorithm.
Anyhow, we end up with pre-aggregated 15-minute slices per machine, just as if we’d calculated it straight from the raw data. The benefit here is that you can already expire and delete the raw data after the initial five-minute window is calculated, dropping the amount of stored data to one-third. Imagine the storage savings with something like a monthly time window.
And that’s not only true for the raw data, but every single continuous aggregate can have its own retention policy, too. Just make sure the data is further aggregated before it's retired and removed.
But there is one additional benefit: speed. It’s much faster to average over three values than 900. While it doesn’t make a massive difference at this level, more complex algorithms will be a lot faster based on the number of data points.
As a quick side note for the term hierarchical; we called the feature hierarchical continuous aggregates since you may branch out from one continuous aggregate into many. One example would be a continuous aggregate with one-day time slices, which is then aggregated into multiple continuous aggregates, such as seven days, 14 days, one month, etc. The branching can become arbitrarily complex—the limit is your imagination.
Timescale is happy to release hierarchical continuous aggregates with TimescaleDB 2.9. It is one of the most requested and wished-for features, and we love to make our users (you!) happy with the functionality that is actually needed.
Like always, this is not the only cool new addition to TimescaleDB 2.9. Other features include time zone support for time_bucket_gapfill
(an extension of the time zone support for time_bucket
in 2.8) or fixed schedule support for background jobs. For a complete list, check out the Release Notes.
For Timescale users, upgrades are automatic, and you’ll be upgraded automatically using the next maintenance window.
If you are new to Timescale, start your free 30-day trial now, no credit card required, and get your new database journey started in five minutes.
If you’re self-hosting TimescaleDB, follow the upgrade instructions in our documentation.