CAgg recalculate all unnecessary buckets between two neccessary buckets

LuxCore · October 27, 2022, 7:11am

Hi! Faced with situation when CAgg calculate aggregates long time while it have to calc fast. Assume user changed data at two days only: 2022-06-05, 2022-09-05. It equlas to two 1day-buckets to refresh. But in fact it will be written into “_timescaledb_catalog”.continuous_aggs_hypertable_invalidation_log by a single entry and period for 92 days to recalculate instead of two records, one day for each basket.

Script to reproduce:

CREATE TABLE public.my_hyper(
	id NUMERIC,
	ts TIMESTAMP,
	val1 NUMERIC
);
CREATE INDEX my_hyper_idx ON public.my_hyper USING BTREE(id, ts);

SELECT create_hypertable('public.my_hyper', 'ts', chunk_time_interval => INTERVAL '1 day');

INSERT INTO public.my_hyper 
SELECT DISTINCT ROUND(random() * 1000), a1.col1, RANDOM() * 10 FROM generate_series('2022-06-01'::TIMESTAMP, '2022-09-30', INTERVAL '12 hours') AS a1(col1);

CREATE MATERIALIZED VIEW my_hyper_cagg
WITH (timescaledb.continuous) AS
SELECT
	time_bucket('1 day', ts) AS bucket,
	SUM(val1) AS val1
FROM
	public.my_hyper
GROUP BY 1;

INSERT INTO public.my_hyper VALUES
	(2000, TO_TIMESTAMP('2022-06-05', 'yyyy-mm-dd'), RANDOM() * 10),
	(2001, TO_TIMESTAMP('2022-09-05', 'yyyy-mm-dd'), RANDOM() * 10);

SELECT
	hypertable_id,
	"_timescaledb_internal".to_timestamp(lowest_modified_value),
	"_timescaledb_internal".to_timestamp(greatest_modified_value),
	"_timescaledb_internal".to_timestamp(greatest_modified_value) - _timescaledb_internal.to_timestamp(lowest_modified_value)
FROM "_timescaledb_catalog".continuous_aggs_hypertable_invalidation_log
WHERE hypertable_id = (SELECT raw_hypertable_id FROM "_timescaledb_catalog".continuous_agg WHERE user_view_name = 'my_hyper_cagg')
ORDER BY lowest_modified_value;

tsdb_invalidation_log

TimescaleDB version affected
2.8.1

PostgreSQL version used
PostgreSQL 12.8
PostgreSQL 12.8 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-44), 64-bit

noctarius · November 2, 2022, 8:24am

I guess what you’re looking for would be only refreshing the two actual modified days, right? That would require a lot more management overhead (with multiple rows per hypertable). Could you create a feature request on github. I totally see how that’d be valuable (especially with large refresh periods as in the case presented) but it needs to be prioritized.

LuxCore · November 2, 2022, 12:15pm

Right. If the dates that are not next to each other for the changed data are written separately into “_timescaledb_catalog”.continuous_aggs_hypertable_invalidation_log (for blocks with an interval of 1 day), and not with a period from the minimum date to the maximum, then the recalculation will be faster, because there will be no recalculation of blocks that do not contain unchanged data. In this case also dates will be written correctly next to the “_timescaledb_catalog”.continuous_aggs_materialization_invalidation_log.

This situation is due to the fact that data gets into the hypertable by replication and, therefore, the periods do not correspond to reality.

Thank for reply. I will create feature-request on github soon.

alexsunny123 · November 3, 2022, 9:16am

thanks for the awesome information.