Improving TimescaleDB Compression Ratios; Using Brotli Parquet to Compare

John_Smith · September 10, 2024, 1:40pm

I’ve got about 435 million rows of the form:

time, sensor_id, sensor_value

where time is a unix timestamp rounded to one decimal point, sensor_id is an integer, and the sensor_value is a float. The timestamps are one second apart, and for each timestamp, there are around 250 unique sensor_ids.

In Python, I found that I was getting the best compression ratios if I compressed using multiple parquet files, spaced out by one hour at a time. Using this for each file:

df.sort_values(by=['sensor_id', 'time']).reset_index(drop=True).to_parquet('out.parquet', compression='brotli')

the total file size of all the parquet files combined is 120 MB.

I didn’t expect to get quite as good of compression ratios in TimescaleDB in PostgreSQL since it doesn’t use brotli compression, but I’m still getting compressed sizes that are a fair bit larger than I would like. The best I could get was 964 MB, which is about 8x bigger.

Here’s how I set up the table:

CREATE TABLE IF NOT EXISTS public.sensor_data
(
    "time" timestamp(1) without time zone NOT NULL,
    sensor_id integer NOT NULL,
    sensor_value NUMERIC(20, 3) NOT NULL
);

SELECT create_hypertable('sensor_data', 'time');

-- (insert all data)

ALTER TABLE sensor_data
SET
(
timescaledb.compress,
timescaledb.compress_segmentby='sensor_id', 
timescaledb.compress_orderby='time'
);

SELECT compress_chunk(c) from show_chunks('sensor_data') c;

I’ve tried the following variations:

Using a double precision type instead of a numeric(20, 3) for sensor value
Using chunk sizes of one hour (similar to what I’m doing in Python) rather than the default 7 days
Using numeric(10, 3) instead of numeric(20, 3)

However, these changes either did nothing or made the compression ratios worse:

method	before_total	after_total	before_table	after_table	before_index	after_index	after_toast	total_chunks	number_compressed_chunks
sensor_data	25 GB	964 MB	21 GB	43 MB	3990 MB	13 MB	908 MB	4	4
sensor_data_double	25 GB	1498 MB	21 GB	380 MB	3988 MB	13 MB	1105 MB	4	4
sensor_data_1hr_chunk	25 GB	1033 MB	21 GB	72 MB	4018 MB	25 MB	936 MB	455	455
sensor_data_10pnt	25 GB	964 MB	21 GB	43 MB	3988 MB	13 MB	908 MB	4	4

Do you have any suggestions for things I can try to improve the compression ratio, or is this as good as it’s going to get?

jonatasdp · September 11, 2024, 5:49pm

Hi John, thanks for this analyzis!

I don’t think it will be a solid apple to apple comparison make postgresql database storage compare to parquet files.

It’s worth noting that achieving the same compression ratio as Parquet with Brotli compression might be challenging in a database system designed for both storage and quick data access. The 8x difference you’re seeing isn’t unusual when comparing specialized file formats to database storage.

You may compare to the original size in the postgresql database as a reference but still hard to bring a totally isolated schema as a solution.

John_Smith · September 13, 2024, 2:53am

OK. I was hoping I had something configured wrong or there was something else I could try. Guess not.

I’m not entirely sure what you mean by your last sentence. The hypertable was 25 GB before compressing, as shown above. I also inserted the data into another regular table (not a hypertable) and it was also reported as 25 GB.

If instead you were implying I should compare compression ratios, this is what I found:
Timescaledb in postgres:
25 GB / 964 MB = 26.6x ratio

Original data in a csv format versus parquet with brotli:
10.6 GB / 120 MB = 90.5 x
(In the CSV file, the time is stored as a unix timestamp with one decimal point).

(Example rows in the csv)

unix_timestamp,sensor_id,sensor_value
1725494400.9,0,600.000
1725494401.9,0,600.000
1725494402.9,0,600.000

So I guess this is multifaceted; the table’s uncompressed size is almost 2.5x larger (25 versus 10.6 GB), but regardless, the compression ratio is still significantly lower (26.6x versus 90.5x).

jonatasdp · September 16, 2024, 1:56pm

That’s true John!

That’s great because we have a lot of room for improvements

I hope algorithms will get better and better on compression. I remember a few months ago, a friend generated a sin wave into the database and compressed it and the compression ratio was very high because it was really a pattern but still not reaching such 90x.

This looks really great and in somehow, we’re continuously looking for improvements and even explore new algorithms. We store parquet files in our tiered storage system, I’m not sure if we use brotoli to compress but that would be a good case.

@sven is always improving our compression, he may like to get involved on this conversation