Partitioning questions

MasterKraft · March 30, 2024, 10:47am

I have a couple of tables that are pretty big, 2+ billion rows. Now, querying by time is fast and there is no problem. But sometimes I want to use these tables to make queries where I need to, for example select all the rows that match an ID, regardless of time. Making indexes on these columns doesn’t make it much faster. Is this where partitioning comes in ? Since I can’t find much documentation on partitioning :

Will partitioning the table by default ( time ) and also by the said ID work better ?
When you partition a table by more columns, what are the space consumption implications ?
Can you further partition a table once it’s been populated ?

Thanks !

jonatasdp · April 1, 2024, 2:05pm

Hey @MasterKraft, very good question!

Yes, you can have an extra dimension to partition your data and it will make sense if you have a large cardinality and needs to better group the data.
It will generate more chunks, which will generate more metadata around it. From previous interactions I see folks reporting under 10k chunks generally should not affect performance heavily. but I even did a few tests with 5M chunks so, it depends on your business, it will just be another fold of your data.
I think yes, but only new data will start using the new partition dimension. Tbh never tried, maybe it will not be allowed for consistency.

ts101 · April 3, 2024, 7:25am

Creating index on ID and making select only on ID column should be much faster then without index. Be careful with such queries, because they must touch every chunk table that can take some time.
Did you perform PostgreSQL ANALYZE command on table after index creation?
What does EXPLAIN show? Is query using index or not?

and

It depends on partitioning type. For hash space partitioning new space dimension can be added and new chunks will be created with additional partitioning. Old data will stay untouched. But… I don’t really see much of the point of using “hash” partitioning, it was developed for multi-node partitioning (TimescaleDB installed on several computers like cluster) that is discontinued feature in TimescaleDB v2.14.0. Maybe hash partitioning is useful if you can make sure every chunk hits its own physical disk with tablespace management. I haven’t tested this.

Adding “range” space partitioning (new in TimescaleDB v2.13.0) can’t be performed performed after hypertable is filled with the data. I tested this and it does not work. New “range” space partition must be added after creating hypertable and before inserting any data in it. This is actually also explained in adding new space dimension documentation, see red warning at top of web page.

jonatasdp · April 3, 2024, 1:27pm

But… I don’t really see much of the point of using “hash” partitioning, it was developed for multi-node partitioning (TimescaleDB installed on several computers like cluster) that is discontinued feature in TimescaleDB v2.14.0

one idea that could be useful for hash dimensions is if you have like a company_id or something that segments your business queries most of the time. You can have dedicated chunks for then but still have all aggregated data in the same table. So, instead of using different schemas or dbs, you use the extra dimension. It would allow to delete all data from a company only by touching “their chunks”.

ts101 · March 31, 2025, 11:54am

I did some tests. I created hyper-table and added dimension partitioned by_hash:

Commands to try out a sample.

DROP TABLE tab1;

CREATE TABLE tab1 (
time timestamp with time zone not null,
company_id   smallint         not null,
measurement  numeric(15,4)
);

SELECT create_hypertable('tab1', by_range('time'));
SELECT add_dimension('tab1', by_hash('company_id', 7));

INSERT INTO tab1 VALUES (CURRENT_TIMESTAMP,                     1, 100);
INSERT INTO tab1 VALUES (CURRENT_TIMESTAMP - INTERVAL '1 HOUR', 1, 100);
INSERT INTO tab1 VALUES (CURRENT_TIMESTAMP - INTERVAL '2 HOUR', 1, 100);
INSERT INTO tab1 VALUES (CURRENT_TIMESTAMP - INTERVAL '3 HOUR', 1, 100);
INSERT INTO tab1 VALUES (CURRENT_TIMESTAMP - INTERVAL '4 HOUR', 1, 100);
INSERT INTO tab1 VALUES (CURRENT_TIMESTAMP,                     2, 100);
INSERT INTO tab1 VALUES (CURRENT_TIMESTAMP - INTERVAL '1 HOUR', 2, 100);
INSERT INTO tab1 VALUES (CURRENT_TIMESTAMP - INTERVAL '2 HOUR', 2, 100);
INSERT INTO tab1 VALUES (CURRENT_TIMESTAMP - INTERVAL '3 HOUR', 2, 100);
INSERT INTO tab1 VALUES (CURRENT_TIMESTAMP - INTERVAL '4 HOUR', 2, 100);
INSERT INTO tab1 VALUES (CURRENT_TIMESTAMP,                     3, 100);
INSERT INTO tab1 VALUES (CURRENT_TIMESTAMP - INTERVAL '1 HOUR', 3, 100);
INSERT INTO tab1 VALUES (CURRENT_TIMESTAMP - INTERVAL '2 HOUR', 3, 100);
INSERT INTO tab1 VALUES (CURRENT_TIMESTAMP - INTERVAL '3 HOUR', 3, 100);
INSERT INTO tab1 VALUES (CURRENT_TIMESTAMP - INTERVAL '4 HOUR', 3, 100);
INSERT INTO tab1 VALUES (CURRENT_TIMESTAMP,                     4, 100);
INSERT INTO tab1 VALUES (CURRENT_TIMESTAMP - INTERVAL '1 HOUR', 4, 100);
INSERT INTO tab1 VALUES (CURRENT_TIMESTAMP - INTERVAL '2 HOUR', 4, 100);
INSERT INTO tab1 VALUES (CURRENT_TIMESTAMP - INTERVAL '3 HOUR', 4, 100);
INSERT INTO tab1 VALUES (CURRENT_TIMESTAMP - INTERVAL '4 HOUR', 4, 100);
INSERT INTO tab1 VALUES (CURRENT_TIMESTAMP,                     5, 100);
INSERT INTO tab1 VALUES (CURRENT_TIMESTAMP - INTERVAL '1 HOUR', 5, 100);
INSERT INTO tab1 VALUES (CURRENT_TIMESTAMP - INTERVAL '2 HOUR', 5, 100);
INSERT INTO tab1 VALUES (CURRENT_TIMESTAMP - INTERVAL '3 HOUR', 5, 100);
INSERT INTO tab1 VALUES (CURRENT_TIMESTAMP - INTERVAL '4 HOUR', 5, 100);

Then I have checked chunks created:
\d+ tab1
and to my surprise only 4 chunks were created:

List of chunks

                                                     Table "public.tab1"
   Column    |           Type           | Collation | Nullable | Default | Storage | Compression | Stats target | Description
-------------+--------------------------+-----------+----------+---------+---------+-------------+--------------+-------------
 time        | timestamp with time zone |           | not null |         | plain   |             |              |
 company_id  | smallint                 |           | not null |         | plain   |             |              |
 measurement | numeric(15,4)            |           |          |         | main    |             |              |
Indexes:
    "tab1_time_idx" btree ("time" DESC)
Triggers:
    ts_insert_blocker BEFORE INSERT ON tab1 FOR EACH ROW EXECUTE FUNCTION _timescaledb_functions.insert_blocker()                                                                       
Child tables: _timescaledb_internal._hyper_8_15674_chunk,
              _timescaledb_internal._hyper_8_15675_chunk,
              _timescaledb_internal._hyper_8_15676_chunk,
              _timescaledb_internal._hyper_8_15677_chunk
Access method: heap

Then I have selected data from each of chunk:

Selects on table

select * from _timescaledb_internal._hyper_8_15674_chunk;
select * from _timescaledb_internal._hyper_8_15675_chunk;
select * from _timescaledb_internal._hyper_8_15676_chunk;
select * from _timescaledb_internal._hyper_8_15677_chunk;

and the result is:

Result of select statements from each of chunks


             time              | company_id | measurement
-------------------------------+------------+-------------
 2025-03-31 13:38:29.371534+02 |          1 |    100.0000
 2025-03-31 12:38:29.384496+02 |          1 |    100.0000
 2025-03-31 11:38:29.388986+02 |          1 |    100.0000
 2025-03-31 10:38:29.392051+02 |          1 |    100.0000   
 2025-03-31 09:38:29.394134+02 |          1 |    100.0000


             time              | company_id | measurement
-------------------------------+------------+-------------
 2025-03-31 13:38:29.396605+02 |          2 |    100.0000
 2025-03-31 12:38:29.405172+02 |          2 |    100.0000
 2025-03-31 11:38:29.409638+02 |          2 |    100.0000
 2025-03-31 10:38:29.41243+02  |          2 |    100.0000   
 2025-03-31 09:38:29.41703+02  |          2 |    100.0000
 2025-03-31 13:38:29.446693+02 |          4 |    100.0000
 2025-03-31 12:38:29.44905+02  |          4 |    100.0000
 2025-03-31 11:38:29.451623+02 |          4 |    100.0000
 2025-03-31 10:38:29.454177+02 |          4 |    100.0000
 2025-03-31 09:38:29.45675+02  |          4 |    100.0000
(10 rows)

             time              | company_id | measurement
-------------------------------+------------+-------------
 2025-03-31 13:38:29.421306+02 |          3 |    100.0000
 2025-03-31 12:38:29.429895+02 |          3 |    100.0000
 2025-03-31 11:38:29.436674+02 |          3 |    100.0000
 2025-03-31 10:38:29.441035+02 |          3 |    100.0000   
 2025-03-31 09:38:29.44416+02  |          3 |    100.0000
(5 rows)

             time              | company_id | measurement
-------------------------------+------------+-------------
 2025-03-31 13:38:29.459023+02 |          5 |    100.0000
 2025-03-31 12:38:29.468107+02 |          5 |    100.0000
 2025-03-31 11:38:29.470963+02 |          5 |    100.0000
 2025-03-31 10:38:29.475036+02 |          5 |    100.0000   
 2025-03-31 09:38:29.477599+02 |          5 |    100.0000   
(5 rows)

In second table company_id 2 and 4 are saved in the same table. @jonatasdp, it is not really true that each of the company_id is going to be stored in separate table.

How does function by_hash really works? It looks to me some hash is generated like returning integer value and then some modulus is created. In this case both company 2 and 4 generate the same value to be saved into the same partition.

QUESTION: We have 6 companies. Is there a way using additional dimensions I can force that data in individual chunk only data from single company is stored?

jonatasdp · April 16, 2025, 3:02pm

Hi @ts101 , my misinterpretation too. I always thought all these optimizations and details were optimized and already brought the data partitioned by this schema. It doesn’t seem to be the case.

I tested here and all partitions brings the data.

I also tried to add a partitioning func:

-- Create hash function for company_id partitioning that maps to 0-4 range
CREATE OR REPLACE FUNCTION hash_company_id(value smallint)
  RETURNS integer
  LANGUAGE SQL
  IMMUTABLE
  PARALLEL SAFE AS
$func$
  SELECT value % 5;  -- This will map any company_id to 0-4 range
$func$;

-- Create hypertable with time as primary dimension
SELECT create_hypertable('tab1', 'time', chunk_time_interval => INTERVAL '1 day');

-- Add company_id as secondary dimension with hash partitioning
SELECT add_dimension('tab1', 
    by_hash('company_id', 5, 'hash_company_id'));

But it also distributes the data over the days.

Let’s see if @matski can help us with other ideas here.