Hi all!
I tested the chunk skipping indexes feature in the test database and everything was successful: data for multiple id values is selected very quickly.
In a productive database, data is not selected as quickly when the option is enabled.
I noticed differences in the table "_timescaledb_catalog".chunk_column_stats': on the test database the valid field contains the values false, and on the production database for chunks the valid field contains the values true.
Implementation steps:
The timescaledb.enable_chunk_skipping = on setting is enabled.
A new table has been created and the option SELECT enable_chunk_skipping('hypertable', 'id') is enabled;.
The table were populated and its chunks were compressed.
Execution of a query specifying the id values from different chunks:
SELECT * FROM hypertable WHERE id IN (1, 22222, 333333333);
Everything is fine in the test database, but not on the prod.
@jonatasdp , yes, this is a sequantial ID that is incremental over time.
On the test db explain analyze query shows the following:
Append (cost=0.04..15333.86 rows=28215056 width=496)
-> Custom Scan (DecompressChunk) on _hyper_564_43152_chunk (cost=0.04..51.44 rows=1439000 width=510)
Vectorized Filter: (id = ANY ('{1, 22222, 333333333}'::bigint[]))
Rows Removed by Filter: 1432686
-> Seq Scan on compress_hyper_565_43167_chunk (cost=0.00..51.44 rows=1439 width=616)
-> Index Scan using _hyper_564_43152_chunk_hypertable_uidx on _hyper_564_43152_chunk (cost=0.06..8.97 rows=4 width=510)
Index Cond: (id = ANY ('{1, 22222, 333333333}'::bigint[]))
On the prod db explain query shows the following (without analyze because it is very long time used):
Append (cost=0.04..15252.80 rows=28295000 width=508)
-> Custom Scan (DecompressChunk) on _hyper_566_43337_chunk (cost=0.04..51.44 rows=1441000 width=508)
Vectorized Filter: (id = ANY ('{1, 22222, 333333333}'::bigint[]))
-> Seq Scan on compress_hyper_567_43352_chunk (cost=0.00..51.44 rows=1441 width=616)
Upd.
The function is working on the prod. If I run query like SELECT * FROM hypertable WHERE id = 123456789, then explain plan show only needed chunk.
May be was I wrong in my thoughts that chunk skipping helps to query tuples by more than one value in needed column?
Yes, all queried ids are from compressed chunks.
A query
EXPLAIN ANALYZE
SELECT * FROM hypertable WHERE id IN (5, 40);
shows SeqScan by all compressed chunks
Append (cost=0.04..15252.80 rows=28295000 width=505) (actual time=26.816..37794.588 rows=2 loops=1)
-> Custom Scan (DecompressChunk) on _hyper_566_43337_chunk t_1 (cost=0.04..51.44 rows=1441000 width=508) (actual time=26.815..1062.049 rows=2 loops=1)
Vectorized Filter: (id = ANY ('{5,40}'::bigint[]))
Rows Removed by Filter: 1435851
-> Seq Scan on compress_hyper_567_43352_chunk (cost=0.00..51.44 rows=1441 width=616) (actual time=4.273..4.888 rows=1441 loops=1)
-> Custom Scan (DecompressChunk) on _hyper_566_43338_chunk t_2 (cost=0.04..61.66 rows=1655000 width=508) (actual time=1072.343..1072.344 rows=0 loops=1)
Vectorized Filter: (id = ANY ('{5,40}'::bigint[]))
Rows Removed by Filter: 1651636
-> Seq Scan on compress_hyper_567_43353_chunk (cost=0.00..61.66 rows=1655 width=616) (actual time=0.551..1.321 rows=1655 loops=1)
-> Custom Scan (DecompressChunk) on _hyper_566_43339_chunk t_3 (cost=0.04..65.67 rows=1673000 width=508) (actual time=1234.441..1234.442 rows=0 loops=1)
and so on.......
The only option to speed up query is to convert it into:
SELECT *
FROM hypertable
WHERE id = 5
UNION ALL
SELECT *
FROM hypertable
WHERE id = 40
And this is not convenient in the case when there will be a lot of IDs.
Following this idea I created the test function:
CREATE OR REPLACE FUNCTION hypertable_by_ids(pi_ids bigint[])
RETURNS TABLE(<the_same_structure_as_the_hypertable>)
LANGUAGE plpgsql
AS $$
DECLARE
x BIGINT;
r record;
i INTEGER := 0;
BEGIN
FOREACH x IN ARRAY $1
LOOP
RETURN QUERY SELECT t.* FROM hypertable AS t WHERE t.id = x;
END LOOP;
END;$$;
SELECT * FROM hypertable_by_ids((SELECT array_agg(t.id ORDER BY t.id) FROM other_table as t));
But this approach has a disadvantage: if there are a lot of IDs, for example, 500000 or more, you will have to wait for some time.
Maybe there are other ways to query a compressed hypertable for many values for a single field?
Thank you! I close this topic because chunk_skipping is working.
At the moment, there is only one question left: how can data be selected from a compressed hypertable according to the IN condition with great performance?