Hi,
I want to implement the following data strategy:
Raw data for an hour
1 Minute downsample for a day
1 Hour downsample for 14 days
1 Day downsample for 6 month
If I could put a CAGG on top of a CAGG that would be ideal, but I can’t, and even if I check the underlying materialized table I don’t see the value there, only an “internal” column called “agg_2_2” (the docs do say the underlying table can be used).
Thanks for joining the forum and asking for some input!
The answer, as with many things, depends on your data and the actual query (and aggregate functions) that you’re using. For a “traditional” continuous aggregate that utilizes common aggregation methods (MIN/MAX/AVG/LAST/etc.), you will need to keep raw data for at least 2 days using continuous aggregates and automated refresh policies. For each downsampled period (1 minute, 1 hour, 1 day) you would need to create a separate CAGG and setup the appropriate refresh policy.
Because your largest downsampled interval is 1-day, and refresh policies need raw data for at least one entire interval of time (1-day in this case) in order to materialize data, we require at least 2x the interval to ensure no data is lost because of data retention.
There are other alternatives that might be worth exploring, but it would require more understanding of your data and the aggregate queries you need to analyze the data.
For instance, we’re currently developing various hyperfunctions that do allow you to re-aggregate the data into larger bucket intervals of time. If your data analysis is in line with what these functions provide, then you have the possibility of only creating one (or maybe two) continuous aggregates and doing 2-step aggregation to larger intervals when you need it. Some examples of the types of functions that support this (with more coming):
Thanks. Keeping 3 different CAGGs is possible but requires unnecessary CPU/space and requires the client to decide which CAGG to use which I prefer to avoid (in theory we can also create a view with a CASE statement).
My problem is that keeping the “high res” data makes the database work harder - unless I miss something, even if I use stats_agg() it will still need to go over all the data (e.g. for SUM) so I don’t see the real value from performance side?
You’re correct that maintaining multiple CAGGs for different downsampled periods does require a choice. That’s the functionality we currently provide, so it might not fit with your use case.
I’m not sure what you mean by storing more data makes the database work harder. If you want an aggregate over a 1-day period, TimescaleDB needs a full day of data to aggregate at this time. If data is properly indexed and with a good schema, CAGG refresh should be very performant.
Have you had an opportunity to run a test setup in some way to see how this might function with your use case?