Queries for continuous aggregates on distributed hypertables are slow

whilepy · August 25, 2023, 4:05am

Hi, I’m not very familiar with PostgreSQL, but I want to know why the query for continuous aggregation in my deployment is very slow.
My TimescaleDB deployment is as follows:

I have 1 AN and 3 DNs, they all have 8 CPU cores, and 32G memory.
I have a hypertable like this:

CREATE TABLE sample
(
  smpl_time TIMESTAMPTZ NOT NULL,
  nanosecs INTEGER  NOT NULL,
  channel_id BIGINT NOT NULL,
  severity_id SMALLINT NOT NULL,
  status_id SMALLINT  NOT NULL,
  float_val DOUBLE PRECISION NULL,
);

SELECT create_distributed_hypertable('sample', 'smpl_time', 'channel_id',
    data_nodes => '{ "dn1", "dn2", "dn3"}');

SELECT set_chunk_time_interval('sample', INTERVAL '2 hours');

SELECT set_replication_factor('sample', 2);

ALTER TABLE sample SET (
  timescaledb.compress,
  timescaledb.compress_segmentby = 'channel_id'
);

SELECT add_compression_policy('sample', INTERVAL '2 hours');

I want to configure continuous aggregation strategies with three granularities: 1 minute, 1 hour, and 1 day. Regarding continuous aggregation, I set it up like this:

CREATE MATERIALIZED VIEW sample_summary_minute
WITH (timescaledb.continuous, timescaledb.create_group_indexes=true) AS
SELECT channel_id,
   time_bucket(INTERVAL '1 minute', smpl_time) AS bucket,
	 COUNT(1) AS total_nums,
   AVG(float_val) AS avg_value,
   MAX(float_val) AS max_value,
   MIN(float_val) AS min_value
FROM sample
GROUP BY channel_id, bucket;
-- -- 	
SELECT add_continuous_aggregate_policy('sample_summary_minute',
  start_offset => INTERVAL '1 hours',
  end_offset => INTERVAL '10 minutes',
  schedule_interval => INTERVAL '10 minutes');
-- -- 	
ALTER MATERIALIZED VIEW sample_summary_minute set (timescaledb.compress = true);
-- -- 	
-- -- 	
CREATE MATERIALIZED VIEW sample_summary_hour
WITH (timescaledb.continuous, timescaledb.create_group_indexes=true) AS
SELECT channel_id,
   time_bucket(INTERVAL '1 hour', smpl_time) AS bucket,
	 COUNT(1) AS total_nums,
   AVG(float_val) AS avg_value,
   MAX(float_val) AS max_value,
   MIN(float_val) AS min_value
FROM sample
GROUP BY channel_id, bucket;
-- -- 	
SELECT add_continuous_aggregate_policy('sample_summary_hour',
  start_offset => INTERVAL '24 hours',
  end_offset => INTERVAL '1 hour',
  schedule_interval => INTERVAL '1 hour');
-- -- 	
ALTER MATERIALIZED VIEW sample_summary_hour set (timescaledb.compress = true);
-- -- 	
-- -- 	
CREATE MATERIALIZED VIEW sample_summary_day
WITH (timescaledb.continuous, timescaledb.create_group_indexes=true) AS
SELECT channel_id,
   time_bucket(INTERVAL '1 day', smpl_time) AS bucket,
	 COUNT(1) AS total_nums,
   AVG(float_val) AS avg_value,
   MAX(float_val) AS max_value,
   MIN(float_val) AS min_value
FROM sample
GROUP BY channel_id, bucket;
-- -- 
SELECT add_continuous_aggregate_policy('sample_summary_day',
  start_offset => INTERVAL '3 day',
  end_offset => INTERVAL '1 hour',
  schedule_interval => INTERVAL '1 day');
-- -- 	
ALTER MATERIALIZED VIEW sample_summary_day set (timescaledb.compress = true);

My data volume is about 18000 rows per second, written every 30 seconds.However, querying three consecutively aggregated tables is very slow, even if the query data is only a few rows.
Here’s the query plan for one of the queries:

explain(analyze, buffers) select * from sample_summary_day where channel_id=110 and bucket >= '2023-07-28 20:00:00.000000+08' and bucket <= '2023-08-01 23:00:00.000000+08';

The execution results are as follows:

 GroupAggregate  (cost=79510824.54..79511135.29 rows=1636 width=48) (actual time=1520.780..1966.610 rows=3 loops=1)
   Group Key: sample.channel_id, (time_bucket('1 day'::interval, sample.smpl_time))
   Buffers: temp read=8336 written=8353
   ->  Sort  (cost=79510824.54..79510865.43 rows=16355 width=24) (actual time=1407.248..1638.489 rows=2000644 loops=1)
         Sort Key: (time_bucket('1 day'::interval, sample.smpl_time))
         Sort Method: external merge  Disk: 66688kB
         Buffers: temp read=8336 written=8353
         ->  Result  (cost=10000.00..79509679.90 rows=16355 width=24) (actual time=52.508..1004.774 rows=2000644 loops=1)
               ->  Custom Scan (DataNodeScan) on sample  (cost=10000.00..79509475.46 rows=16355 width=24) (actual time=52.505..740.500 rows=2000644 loops=1)
                     Filter: (smpl_time >= COALESCE(_timescaledb_internal.to_timestamp(_timescaledb_internal.cagg_watermark(123)), '-infinity'::timestamp with time zone))
 Planning:
   Buffers: shared hit=1678
 Planning Time: 15.020 ms
 Execution Time: 1981.185 ms

I would like to know where am I going wrong and how can I modify it to improve performance.

jonatasdp · August 28, 2023, 5:37pm

Is the 18k rows inserts every 30 seconds the total overload of your system?

Even on my Raspberry 3 I got 20k rows per second with 10 metrics per row, so certainly you can save a lot of money and get better performance using a single node.

Probably the network overhead for the scale you have does not make sense and a single node will perform better. Give a try and report back how it works on a single node.

whilepy · August 29, 2023, 3:01am

I found a possible error message in the background. At the same time, I also have this error when I use manual aggregation:

ERROR:  no aggref found in targetlist or HAVING clause
CONTEXT:  SQL statement "SELECT pg_catalog.max(bucket) FROM _timescaledb_internal._partial_view_162 AS I WHERE I.bucket >= '2023-08-15 08:00:00+08' AND I.bucket < '2023-08-15 08:06:00+08' ;" 

SQL state: XX000

Also I found a similar issue on GitHub.

[Bug]: Continuous aggregate fails on the multinode. ERROR: no aggref found in targetlist or HAVING clause · Issue #5916 · timescale/timescaledb (github.com)
[Bug]: Inexplicably slow fetching rows of real time agregates · Issue #5704 · timescale/timescaledb (github.com)

This error may cause the aggregation to fail, resulting in no data in the materialized table. If the continuous aggregation parameter timescaledb.materialized_only is set to true, no data will be queried. The default setting is false. I doubt it will be aggregated during the query operation. This could be the reason for the slow query.