Improving TimescaleDB Compression Ratios; Using Brotli Parquet to Compare

I’ve got about 435 million rows of the form:

time, sensor_id, sensor_value

where time is a unix timestamp rounded to one decimal point, sensor_id is an integer, and the sensor_value is a float. The timestamps are one second apart, and for each timestamp, there are around 250 unique sensor_ids.

In Python, I found that I was getting the best compression ratios if I compressed using multiple parquet files, spaced out by one hour at a time. Using this for each file:

df.sort_values(by=['sensor_id', 'time']).reset_index(drop=True).to_parquet('out.parquet', compression='brotli')

the total file size of all the parquet files combined is 120 MB.

I didn’t expect to get quite as good of compression ratios in TimescaleDB in PostgreSQL since it doesn’t use brotli compression, but I’m still getting compressed sizes that are a fair bit larger than I would like. The best I could get was 964 MB, which is about 8x bigger.

Here’s how I set up the table:

CREATE TABLE IF NOT EXISTS public.sensor_data
(
    "time" timestamp(1) without time zone NOT NULL,
    sensor_id integer NOT NULL,
    sensor_value NUMERIC(20, 3) NOT NULL
);

SELECT create_hypertable('sensor_data', 'time');

-- (insert all data)

ALTER TABLE sensor_data
SET
(
timescaledb.compress,
timescaledb.compress_segmentby='sensor_id', 
timescaledb.compress_orderby='time'
);

SELECT compress_chunk(c) from show_chunks('sensor_data') c;

I’ve tried the following variations:

  • Using a double precision type instead of a numeric(20, 3) for sensor value
  • Using chunk sizes of one hour (similar to what I’m doing in Python) rather than the default 7 days
  • Using numeric(10, 3) instead of numeric(20, 3)

However, these changes either did nothing or made the compression ratios worse:

method before_total after_total before_table after_table before_index after_index before_toast after_toast total_chunks number_compressed_chunks
sensor_data 25 GB 964 MB 21 GB 43 MB 3990 MB 13 MB 0 bytes 908 MB 4 4
sensor_data_double 25 GB 1498 MB 21 GB 380 MB 3988 MB 13 MB 0 bytes 1105 MB 4 4
sensor_data_1hr_chunk 25 GB 1033 MB 21 GB 72 MB 4018 MB 25 MB 0 bytes 936 MB 455 455
sensor_data_10pnt 25 GB 964 MB 21 GB 43 MB 3988 MB 13 MB 0 bytes 908 MB 4 4

Do you have any suggestions for things I can try to improve the compression ratio, or is this as good as it’s going to get?

Hi John, thanks for this analyzis!

I don’t think it will be a solid apple to apple comparison make postgresql database storage compare to parquet files.

It’s worth noting that achieving the same compression ratio as Parquet with Brotli compression might be challenging in a database system designed for both storage and quick data access. The 8x difference you’re seeing isn’t unusual when comparing specialized file formats to database storage.

You may compare to the original size in the postgresql database as a reference but still hard to bring a totally isolated schema as a solution.

OK. I was hoping I had something configured wrong or there was something else I could try. Guess not.

I’m not entirely sure what you mean by your last sentence. The hypertable was 25 GB before compressing, as shown above. I also inserted the data into another regular table (not a hypertable) and it was also reported as 25 GB.

If instead you were implying I should compare compression ratios, this is what I found:
Timescaledb in postgres:
25 GB / 964 MB = 26.6x ratio

Original data in a csv format versus parquet with brotli:
10.6 GB / 120 MB = 90.5 x
(In the CSV file, the time is stored as a unix timestamp with one decimal point).

(Example rows in the csv)

unix_timestamp,sensor_id,sensor_value
1725494400.9,0,600.000
1725494401.9,0,600.000
1725494402.9,0,600.000

So I guess this is multifaceted; the table’s uncompressed size is almost 2.5x larger (25 versus 10.6 GB), but regardless, the compression ratio is still significantly lower (26.6x versus 90.5x).

1 Like

That’s true John!

That’s great because we have a lot of room for improvements :smiley:

I hope algorithms will get better and better on compression. I remember a few months ago, a friend generated a sin wave into the database and compressed it and the compression ratio was very high because it was really a pattern but still not reaching such 90x.

This looks really great and in somehow, we’re continuously looking for improvements and even explore new algorithms. We store parquet files in our tiered storage system, I’m not sure if we use brotoli to compress but that would be a good case.

@sven is always improving our compression, he may like to get involved on this conversation :nerd_face: