I’ve got about 435 million rows of the form:
time, sensor_id, sensor_value
where time
is a unix timestamp rounded to one decimal point, sensor_id
is an integer, and the sensor_value
is a float. The timestamps are one second apart, and for each timestamp, there are around 250 unique sensor_ids.
In Python, I found that I was getting the best compression ratios if I compressed using multiple parquet files, spaced out by one hour at a time. Using this for each file:
df.sort_values(by=['sensor_id', 'time']).reset_index(drop=True).to_parquet('out.parquet', compression='brotli')
the total file size of all the parquet files combined is 120 MB.
I didn’t expect to get quite as good of compression ratios in TimescaleDB in PostgreSQL since it doesn’t use brotli compression, but I’m still getting compressed sizes that are a fair bit larger than I would like. The best I could get was 964 MB, which is about 8x bigger.
Here’s how I set up the table:
CREATE TABLE IF NOT EXISTS public.sensor_data
(
"time" timestamp(1) without time zone NOT NULL,
sensor_id integer NOT NULL,
sensor_value NUMERIC(20, 3) NOT NULL
);
SELECT create_hypertable('sensor_data', 'time');
-- (insert all data)
ALTER TABLE sensor_data
SET
(
timescaledb.compress,
timescaledb.compress_segmentby='sensor_id',
timescaledb.compress_orderby='time'
);
SELECT compress_chunk(c) from show_chunks('sensor_data') c;
I’ve tried the following variations:
- Using a
double precision
type instead of anumeric(20, 3)
forsensor value
- Using chunk sizes of one hour (similar to what I’m doing in Python) rather than the default 7 days
- Using
numeric(10, 3)
instead ofnumeric(20, 3)
However, these changes either did nothing or made the compression ratios worse:
method | before_total | after_total | before_table | after_table | before_index | after_index | before_toast | after_toast | total_chunks | number_compressed_chunks |
---|---|---|---|---|---|---|---|---|---|---|
sensor_data | 25 GB | 964 MB | 21 GB | 43 MB | 3990 MB | 13 MB | 0 bytes | 908 MB | 4 | 4 |
sensor_data_double | 25 GB | 1498 MB | 21 GB | 380 MB | 3988 MB | 13 MB | 0 bytes | 1105 MB | 4 | 4 |
sensor_data_1hr_chunk | 25 GB | 1033 MB | 21 GB | 72 MB | 4018 MB | 25 MB | 0 bytes | 936 MB | 455 | 455 |
sensor_data_10pnt | 25 GB | 964 MB | 21 GB | 43 MB | 3988 MB | 13 MB | 0 bytes | 908 MB | 4 | 4 |
Do you have any suggestions for things I can try to improve the compression ratio, or is this as good as it’s going to get?