Query time increase significantly with just a few values added to WHERE IN clause

calmChowder · January 3, 2025, 5:57am

I was running a simple SELECT query

SELECT
  time_bucket('5m', time) AS bucket,
  AVG(value)
FROM
  data_points
WHERE
  run_id IN (1884, 1885, 1886, 1887, 1888, 1889, 1890, 1891, 1892)
  AND time >= NOW() - INTERVAL '60m'
  AND time <= NOW() - INTERVAL '30m'
GROUP BY
  bucket

Which took over three minutes just to get several rows. But then when reduce the number of values in the line run_id IN to under 8 values, the query completes in a split second. Here are the explain analyze results:
Fast:

HashAggregate  (cost=354234.98..363680.07 rows=381379 width=44) (actual time=1.249..1.251 rows=0 loops=1)
  Group Key: time_bucket('00:05:00'::interval, d.""time""), d.run_id
  Planned Partitions: 4  Batches: 1  Memory Usage: 3097kB
  ->  Result  (cost=0.58..339650.21 rows=381379 width=16) (actual time=0.323..0.325 rows=0 loops=1)
        ->  Custom Scan (ChunkAppend) on data_points d  (cost=0.58..334882.97 rows=381379 width=16) (actual time=0.322..0.323 rows=0 loops=1)
              Chunks excluded during startup: 0
              ->  Index Scan using _hyper_1_143_chunk_ix_device_time on _hyper_1_143_chunk d_1  (cost=0.58..334882.97 rows=381379 width=16) (actual time=0.317..0.317 rows=0 loops=1)
                    Index Cond: ((run_id = ANY ('{1884,1885,1886,1887,1888,1889,1890}'::integer[])) AND (""time"" >= (now() - '01:00:00'::interval)) AND (""time"" <= (now() - '00:30:00'::interval)))
Planning Time: 1.811 ms
Execution Time: 4.623 ms

Slow:

Finalize GroupAggregate  (cost=345807.08..387824.05 rows=425172 width=44) (actual time=208425.622..208532.798 rows=0 loops=1)
  Group Key: (time_bucket('00:05:00'::interval, d.""time"")), d.run_id
  ->  Gather Merge  (cost=345807.08..379570.72 rows=250101 width=44) (actual time=208425.605..208532.779 rows=0 loops=1)
        Workers Planned: 1
        Workers Launched: 1
        ->  Partial GroupAggregate  (cost=344807.07..350434.34 rows=250101 width=44) (actual time=208201.992..208201.997 rows=0 loops=2)
              Group Key: (time_bucket('00:05:00'::interval, d.""time"")), d.run_id
              ->  Sort  (cost=344807.07..345432.32 rows=250101 width=16) (actual time=208201.987..208201.991 rows=0 loops=2)
                    Sort Key: (time_bucket('00:05:00'::interval, d.""time"")), d.run_id
                    Sort Method: quicksort  Memory: 25kB
                    Worker 0:  Sort Method: quicksort  Memory: 25kB
                    ->  Result  (cost=0.58..322382.83 rows=250101 width=16) (actual time=208201.862..208201.865 rows=0 loops=2)
                          ->  Parallel Custom Scan (ChunkAppend) on data_points d  (cost=0.58..319256.56 rows=250101 width=16) (actual time=208201.859..208201.862 rows=0 loops=2)
                                Chunks excluded during startup: 0
                                ->  Parallel Index Scan using _hyper_1_143_chunk_data_points_time_idx on _hyper_1_143_chunk d_1  (cost=0.58..319256.56 rows=250101 width=16) (actual time=208201.823..208201.824 rows=0 loops=2)
                                      Index Cond: ((""time"" >= (now() - '01:00:00'::interval)) AND (""time"" >= '2025-01-03 11:51:27.887799+07'::timestamp with time zone) AND (""time"" <= (now() - '00:30:00'::interval)))
                                      Filter: (run_id = ANY ('{1884,1885,1886,1887,1888,1889,1890,1891}'::integer[]))
                                      Rows Removed by Filter: 4907252
Planning Time: 1.254 ms
Execution Time: 208533.089 ms

Why does adding a few more values make the query engine do so much more work? Any help is appreciated.

Ventsislav · January 14, 2025, 11:14am

Hi @calmChowder,

Could you please share and CREATE SCRIPT for table data_points

calmChowder · January 15, 2025, 10:07am

Hi @Ventsislav, thanks for replying. Here is the script:

CREATE TABLE
  data_points (
    "time" TIMESTAMPTZ NOT NULL,
    "run_id" INTEGER NOT NULL REFERENCES another_table (run_id),
    "value" INTEGER NOT NULL
);

SELECT
  create_hypertable(
    'data_points',
    by_range ('time', INTERVAL '12 hours')
);

CREATE INDEX ix_run_id_time ON data_points(run_id, time DESC);

where another_table is not a hypertable.
Is the problem related to partitioning? Would love to understand more about the inner-workings to TimescaleDB.

Ventsislav · January 20, 2025, 1:41pm

Hi,
Could you please try to recreate the index like this: CREATE INDEX ix_run_id_time_asc ON data_points(run_id, time);? Using time DESC might force PostgreSQL to scan all the chunks and then apply the run_id filter, which could be less efficient.