Database size compared to InfluxDB

toim · September 18, 2023, 1:19pm

We are assessing migration from InfluxDB to Timescale and are testing different aspects of that. This question is about data size on disk.

So. Out data has ~1000 different “columns” or “measurements” depending how you would design the tables. We currently have 9 months of 1 second precision data. 5 year of 1 minute precision data (avg+min+max).

This amount of data takes usually <10Gb by IndfluxDB (probably a lot of similar values).

I am trying to measure our size with Timescale. I am using timescale/timescaledb:2.11.2-pg15 as Docker image.

Used single table for all measurements CREATE TABLE ts.measurements ( time TIMESTAMPTZ NOT NULL, measurement varchar, value double precision);. Created hypertable for it SELECT create_hypertable('ts.measurements', 'time', chunk_time_interval => INTERVAL '1 day');

and tested by adding 12 months of values

INSERT INTO ts.measurements SELECT time, 'SYMBOL', (random()*30)::int FROM generate_series(TIMESTAMP '2023-01-01 00:00:00', TIMESTAMP '2023-02-01 00:00:00' + INTERVAL '-1 second', INTERVAL '1 second') AS time;
INSERT INTO ts.measurements SELECT time, 'SYMBOL', (random()*30)::int FROM generate_series(TIMESTAMP '2023-02-01 00:00:00', TIMESTAMP '2023-04-01 00:00:00' + INTERVAL '-1 second', INTERVAL '1 second') AS time;
INSERT INTO ts.measurements SELECT time, 'SYMBOL', (random()*30)::int FROM generate_series(TIMESTAMP '2023-04-01 00:00:00', TIMESTAMP '2023-06-01 00:00:00' + INTERVAL '-1 second', INTERVAL '1 second') AS time;
INSERT INTO ts.measurements SELECT time, 'SYMBOL', (random()*30)::int FROM generate_series(TIMESTAMP '2023-06-01 00:00:00', TIMESTAMP '2023-08-01 00:00:00' + INTERVAL '-1 second', INTERVAL '1 second') AS time;
INSERT INTO ts.measurements SELECT time, 'SYMBOL', (random()*30)::int FROM generate_series(TIMESTAMP '2023-08-01 00:00:00', TIMESTAMP '2023-10-01 00:00:00' + INTERVAL '-1 second', INTERVAL '1 second') AS time;
INSERT INTO ts.measurements SELECT time, 'SYMBOL', (random()*30)::int FROM generate_series(TIMESTAMP '2023-10-01 00:00:00', TIMESTAMP '2023-12-01 00:00:00' + INTERVAL '-1 second', INTERVAL '1 second') AS time;
INSERT INTO ts.measurements SELECT time, 'SYMBOL', (random()*30)::int FROM generate_series(TIMESTAMP '2023-12-01 00:00:00', TIMESTAMP '2024-01-01 00:00:00' + INTERVAL '-1 second', INTERVAL '1 second') AS time;

This is ~31m rows.

Added compression

ALTER TABLE ts.measurements SET (
  timescaledb.compress,
  timescaledb.compress_segmentby = 'measurement'
);
SELECT add_compression_policy('ts.measurements', INTERVAL '7 days');

Now Postgress pgdata is 3.1gb. After full vacuum it is 2.1gb.

SELECT hypertable_name, hypertable_size(format('%I.%I', hypertable_schema, hypertable_name)::regclass)
  FROM timescaledb_information.hypertables;
 hypertable_name | hypertable_size 
-----------------+-----------------
 measurements    |      2170134528
(1 row)

If this is size is linear we would need 2100Gb of data just for 1 year of data (2.1gb * 1000). This would be order of magnitudes bigger than InfluxDB had.

Am I doing something wrong?

jonatasdp · September 18, 2023, 1:44pm

Hi @toim , I think it’s not compressing all your data.

Count compressed chunks:

select count(1) from timescaledb_information.chunks where hypertable_name = '<YOUR_HYPERTABLE_NAME>' and is_compressed;

Then do the same for NOT is_compressed and try to check if it’s not compressing the data. Because I see part of your data is still in the future from now.

toim · September 18, 2023, 6:31pm

seems that out of 365 chunks only 31 are compressed

vm=> select count(1) from timescaledb_information.chunks where hypertable_name = 'measurements';
 count 
-------
   365
(1 row)

vm=> select count(1) from timescaledb_information.chunks where hypertable_name = 'measurements' and is_compressed;
 count 
-------
    31
(1 row)

I searched for a wait to do manual compression and found this issue [Bug]: Compression job may leave several chunks uncompressed · Issue #5353 · timescale/timescaledb · GitHub

which has this script

DO
$BODY$
DECLARE
    job_id  INT;                
BEGIN
    SET client_min_messages TO DEBUG1;

    select job.id into job_id
    from _timescaledb_config.bgw_job job
    inner join _timescaledb_catalog.hypertable hypertable on hypertable.id = job.hypertable_id
    where hypertable.table_name = 'measurements';

    CALL run_job(job_id);
END;
$BODY$;

I ran that and did full vacuum.

now it shows me

vm=> SELECT hypertable_name, hypertable_size(format('%I.%I', hypertable_schema, hypertable_name)::regclass)
  FROM timescaledb_information.hypertables;
DEBUG:  time to inline: 0.037s, opt: 0.176s, emit: 0.000s
 hypertable_name | hypertable_size 
-----------------+-----------------
 measurements    |       773439488
(1 row)

vm=> select count(1) from timescaledb_information.chunks where hypertable_name = 'measurements' and is_compressed;
 count 
-------
   253
(1 row)

vm=> select count(1) from timescaledb_information.chunks where hypertable_name = 'measurements';
 count 
-------
   365
(1 row)

This 253 compressed chunks is OK as it is 253 days from start of 2023 till today-7d.

I probably should test with older time range than 2023-01-01 - 2023-12-31 as it many days in future which is not that realistic for our data.

jonatasdp · September 18, 2023, 9:54pm

Very good that you found the issue! I’ll check how we can make the docs clear about this aspect.

toim · September 19, 2023, 7:26am

just a little recap for anyone doing similar tests in the future. It seems that for table like

CREATE TABLE ts.measurements ( time TIMESTAMPTZ NOT NULL, measurement varchar, value double precision);

12 months of data with 1 second precision (28857600 rows) would take 71155712 bytes in compressed form.
I assume (have not tested) but that 71MB is probably even smaller as this example used random data. In real world usually are no that random and therefore probably compresses even better. I have to test this for our case to see how big actually 1000 different measurements with our data would be.

These are commands to test it out:

Create table+hypertable for our data. We use single table for all “measurements”. NB: I use schema names “ts” for that table.

CREATE TABLE ts.measurements ( time TIMESTAMPTZ NOT NULL, measurement varchar, value double precision);
SELECT create_hypertable('ts.measurements', 'time', chunk_time_interval => INTERVAL '1 day');

Enable compression on the table

ALTER TABLE ts.measurements SET (
  timescaledb.compress,
  timescaledb.compress_segmentby = 'measurement'
);
SELECT add_compression_policy('measurements', INTERVAL '7 days');

Insert 12 months of data for year 2022 with 1 second precision (it is 28857600 rows)

INSERT INTO ts.measurements SELECT time, 'SYMBOL', (random()*30)::int FROM generate_series(TIMESTAMP '2022-01-01 00:00:00', TIMESTAMP '2022-02-01 00:00:00' + INTERVAL '-1 second', INTERVAL '1 second') AS time;
INSERT INTO ts.measurements SELECT time, 'SYMBOL', (random()*30)::int FROM generate_series(TIMESTAMP '2022-02-01 00:00:00', TIMESTAMP '2022-04-01 00:00:00' + INTERVAL '-1 second', INTERVAL '1 second') AS time;
INSERT INTO ts.measurements SELECT time, 'SYMBOL', (random()*30)::int FROM generate_series(TIMESTAMP '2022-04-01 00:00:00', TIMESTAMP '2022-06-01 00:00:00' + INTERVAL '-1 second', INTERVAL '1 second') AS time;
INSERT INTO ts.measurements SELECT time, 'SYMBOL', (random()*30)::int FROM generate_series(TIMESTAMP '2022-06-01 00:00:00', TIMESTAMP '2022-08-01 00:00:00' + INTERVAL '-1 second', INTERVAL '1 second') AS time;
INSERT INTO ts.measurements SELECT time, 'SYMBOL', (random()*30)::int FROM generate_series(TIMESTAMP '2022-08-01 00:00:00', TIMESTAMP '2022-10-01 00:00:00' + INTERVAL '-1 second', INTERVAL '1 second') AS time;
INSERT INTO ts.measurements SELECT time, 'SYMBOL', (random()*30)::int FROM generate_series(TIMESTAMP '2022-10-01 00:00:00', TIMESTAMP '2022-12-01 00:00:00' + INTERVAL '-1 second', INTERVAL '1 second') AS time;
INSERT INTO ts.measurements SELECT time, 'SYMBOL', (random()*30)::int FROM generate_series(TIMESTAMP '2022-12-01 00:00:00', TIMESTAMP '2023-01-01 00:00:00' + INTERVAL '-1 second', INTERVAL '1 second') AS time;

Force compression job to run. This seem to be necessary as chunks are not immediately compressed on insert but instead this is done in background with scheduled jobs

DO
$BODY$
DECLARE
    job_id  INT;                
BEGIN
    select job.id into job_id
    from _timescaledb_config.bgw_job job
    inner join _timescaledb_catalog.hypertable hypertable on hypertable.id = job.hypertable_id
    where hypertable.table_name = 'measurements';

    CALL run_job(job_id);
END;
$BODY$;

Query number of chunks for our table and number in compressed chunks. These should be the same as all our data is in the past

select count(1) from timescaledb_information.chunks where hypertable_name = 'measurements';
select count(1) from timescaledb_information.chunks where hypertable_name = 'measurements' and is_compressed;

Query our hypertable size in bytes

SELECT hypertable_name, hypertable_size(format('%I.%I', hypertable_schema, hypertable_name)::regclass) FROM timescaledb_information.hypertables;

jonatasdp · September 19, 2023, 3:25pm

Very good! Thanks for reinforcing the knowledge here! Loved it! I wrote a blog post in the past using very similar steps but using ruby

toim · September 19, 2023, 8:26pm

I think I copy/pasted commands from it. trying things out in Postgres client/CLI seemed easier than learning Ruby

jonatasdp · September 19, 2023, 9:12pm

OMG, I just discovered a blog reader

Very nice Toim! Happy to see that the SQL part is self explanatory and it was useful even for non ruby folks