Chunk_skipping is not working

LuxCore · January 23, 2025, 1:28pm

Hi all!
I tested the chunk skipping indexes feature in the test database and everything was successful: data for multiple id values is selected very quickly.
In a productive database, data is not selected as quickly when the option is enabled.
I noticed differences in the table "_timescaledb_catalog".chunk_column_stats': on the test database the valid field contains the values false, and on the production database for chunks the valid field contains the values true.
Implementation steps:

The timescaledb.enable_chunk_skipping = on setting is enabled.
A new table has been created and the option SELECT enable_chunk_skipping('hypertable', 'id') is enabled;.
The table were populated and its chunks were compressed.
Execution of a query specifying the id values from different chunks:
```
SELECT * FROM hypertable WHERE id IN (1, 22222, 333333333);
```

Everything is fine in the test database, but not on the prod.

jonatasdp · January 23, 2025, 6:25pm

Hi @LuxCore , is this a sequential ID that is incremental over time? This would be necessary.

Do you see any difference in the explain analyze between prod and dev query?

LuxCore · January 23, 2025, 8:02pm

@jonatasdp , yes, this is a sequantial ID that is incremental over time.

On the test db explain analyze query shows the following:

Append  (cost=0.04..15333.86 rows=28215056 width=496)
  ->  Custom Scan (DecompressChunk) on _hyper_564_43152_chunk  (cost=0.04..51.44 rows=1439000 width=510)
        Vectorized Filter: (id = ANY ('{1, 22222, 333333333}'::bigint[]))
        Rows Removed by Filter: 1432686
        ->  Seq Scan on compress_hyper_565_43167_chunk  (cost=0.00..51.44 rows=1439 width=616)
  ->  Index Scan using _hyper_564_43152_chunk_hypertable_uidx on _hyper_564_43152_chunk  (cost=0.06..8.97 rows=4 width=510)
        Index Cond: (id = ANY ('{1, 22222, 333333333}'::bigint[]))

On the prod db explain query shows the following (without analyze because it is very long time used):

Append  (cost=0.04..15252.80 rows=28295000 width=508)
  ->  Custom Scan (DecompressChunk) on _hyper_566_43337_chunk  (cost=0.04..51.44 rows=1441000 width=508)
        Vectorized Filter: (id = ANY ('{1, 22222, 333333333}'::bigint[]))
        ->  Seq Scan on compress_hyper_567_43352_chunk  (cost=0.00..51.44 rows=1441 width=616)

jonatasdp · January 27, 2025, 4:28pm

@sven may have some ideas here. Let’s see if he can help.

LuxCore · January 27, 2025, 4:49pm

Upd.
The function is working on the prod. If I run query like SELECT * FROM hypertable WHERE id = 123456789, then explain plan show only needed chunk.
May be was I wrong in my thoughts that chunk skipping helps to query tuples by more than one value in needed column?

sven · January 27, 2025, 8:17pm

Chunk skipping only works for chunks compressed after setting it up. Did you compress those chunks on prod after setting up chunk skipping?

LuxCore · January 28, 2025, 5:35am

Yes, all queried ids are from compressed chunks.
A query

EXPLAIN ANALYZE
SELECT * FROM hypertable WHERE id IN (5, 40);

shows SeqScan by all compressed chunks

Append  (cost=0.04..15252.80 rows=28295000 width=505) (actual time=26.816..37794.588 rows=2 loops=1)
  ->  Custom Scan (DecompressChunk) on _hyper_566_43337_chunk t_1  (cost=0.04..51.44 rows=1441000 width=508) (actual time=26.815..1062.049 rows=2 loops=1)
        Vectorized Filter: (id = ANY ('{5,40}'::bigint[]))
        Rows Removed by Filter: 1435851
        ->  Seq Scan on compress_hyper_567_43352_chunk  (cost=0.00..51.44 rows=1441 width=616) (actual time=4.273..4.888 rows=1441 loops=1)
  ->  Custom Scan (DecompressChunk) on _hyper_566_43338_chunk t_2  (cost=0.04..61.66 rows=1655000 width=508) (actual time=1072.343..1072.344 rows=0 loops=1)
        Vectorized Filter: (id = ANY ('{5,40}'::bigint[]))
        Rows Removed by Filter: 1651636
        ->  Seq Scan on compress_hyper_567_43353_chunk  (cost=0.00..61.66 rows=1655 width=616) (actual time=0.551..1.321 rows=1655 loops=1)
  ->  Custom Scan (DecompressChunk) on _hyper_566_43339_chunk t_3  (cost=0.04..65.67 rows=1673000 width=508) (actual time=1234.441..1234.442 rows=0 loops=1)
and so on.......

The only option to speed up query is to convert it into:

SELECT *
FROM hypertable
WHERE id = 5
UNION ALL
SELECT *
FROM hypertable
WHERE id = 40

And this is not convenient in the case when there will be a lot of IDs.
Following this idea I created the test function:

CREATE OR REPLACE FUNCTION hypertable_by_ids(pi_ids bigint[])
RETURNS TABLE(<the_same_structure_as_the_hypertable>)
LANGUAGE plpgsql
AS $$
DECLARE
	x BIGINT;
	r record;
	i INTEGER := 0;
BEGIN
	FOREACH x IN ARRAY $1
	LOOP
		RETURN QUERY SELECT t.* FROM hypertable AS t WHERE t.id = x;
	END LOOP;
END;$$;

SELECT * FROM hypertable_by_ids((SELECT array_agg(t.id ORDER BY t.id) FROM other_table  as t));

But this approach has a disadvantage: if there are a lot of IDs, for example, 500000 or more, you will have to wait for some time.

Maybe there are other ways to query a compressed hypertable for many values for a single field?

LuxCore · February 5, 2025, 7:15am

Thank you! I close this topic because chunk_skipping is working.

At the moment, there is only one question left: how can data be selected from a compressed hypertable according to the IN condition with great performance?