Sep 19, 2024
Posted by
Sven Klemm
When we introduced columnar compression for Postgres in 2019, our goal was to help developers scale Postgres and efficiently manage growing datasets, such as IoT sensors, financial ticks, product metrics, and even vector data. Compression quickly became a game-changer, saving users significant storage costs and boosting query performance—all while keeping their data in Postgres. With many seeing over 95 % compression rates, the impact was immediate.
But we didn’t stop there. Recognizing that many real-time analytics workloads demand flexibility for updating and backfilling data, we slowly but surely enhanced our compression engine to support INSERT
, UPDATE
, and DELETE
(DML) operations directly on compressed data. This allowed users to work with compressed data almost as easily as they do with uncompressed data.
However, it also created a problem. While we had originally intended mutating compressed chunks to be a rare event, people were now pushing its limits with frequent inserts, updates, and deletes. Seeing our customers go all in on this feature confirmed that we were on the right track, but we had to double down on performance.
Today, we’re proud to announce significant improvements as of TimescaleDB 2.16.0, delivering up to 500x faster updates and deletes and 10x faster upserts on compressed data. These optimizations make compressed data behave even more like uncompressed data—without sacrificing performance or flexibility.
Let’s dive into how we achieved these performance gains and what they mean for you. To check this week’s previous launches and keep track of upcoming ones, head to this blog post or our launch page.
To understand our latest improvements, it helps to revisit how we initially threw away the rule book and enabled DML operations on compressed data.
Working with compressed data is tricky. Imagine trying to update a zipped file. You’d need to unzip the file, make your changes, and then zip it back up. Similarly, updating or deleting data in a compressed database often involves decompressing and reprocessing large chunks (potentially full tables) of data, which can slow things down significantly.
Our solution was to segment and group records into batches of 1,000, so instead of working with millions of records at once, we operate on smaller, more manageable groups. We also used techniques like segment indexes and sparse indexes on batch metadata to identify only the relevant batches of compressed data that contain the values to be updated or deleted.
However, a challenge remained: if no metadata is stored for a batch, we have to assume there could be a row inside that needs to be updated. This assumption requires decompressing the batch and materializing it into an uncompressed row format, which takes up precious time and disk space.
With the release of TimescaleDB 2.16.0, we focused on reducing the number of batches that need to be decompressed and materialized. This improvement, known as compression tuple filtering, allows us to filter out unnecessary data at various processing stages, dramatically speeding up DML operations.
Here’s what that looks like in practice:
UPDATE
and DELETE
can now be completed significantly faster.INSERT
and UPDATE
) are optimized to avoid unnecessary decompression (and that’s not the only boost we made for upserts).These gains translate to major real-world performance improvements, particularly for users dealing with large datasets that require high-frequency updates of data that has already been compressed.
To achieve these optimizations, we filter data at multiple stages during DML operations. Previously, filtering was only possible using constraints on segment_by
columns or columns with metadata, such as orderby
or columns with a chunk skipping index.
With TimescaleDB 2.16.0, we’ve taken a microscope to our decompression pipeline and added an additional layer of inline filtering. When running DML operations, if a column doesn’t have metadata, we now apply constraints incrementally during decompression. If a batch is fully filtered out by the constraint, it’s skipped entirely and never materialized to disk or passed to Postgres to continue evaluation. This saves significant resources and time by reducing the amount of work to do further down the query pipeline.
Let’s break it down by DML operation, starting with INSERT
.
INSERT
operations: When inserting data into compressed chunks, no decompression is required unless you have a UNIQUE
constraint on the hypertable. This makes inserts almost as fast as with uncompressed data.UNIQUE
constraints: When a UNIQUE
constraint is in place, there are three possible scenarios:ON CONFLICT
clause: We check for constraint violations during decompression. If a violation is found, it’s flagged before any data is materialized to disk.ON CONFLICT DO NOTHING
: Similar to the first case, violations are flagged during decompression, and the INSERT
is skipped without materializing data.ON CONFLICT DO UPDATE
(UPSERT
): In cases of conflict, we decompress only the batch containing the conflicting tuple. If no conflict is detected, there’s no need to materialize anything.For UPDATE
and DELETE
operations, we first check if any tuples in a batch match the query constraints. If none do, the batch is skipped, and no decompression or materialization is needed. This skip leads to dramatically faster update and delete operations.
Let’s look at a concrete example to illustrate the performance difference with tuple filtering.
Assume you have a hypertable with 1,000 batches, each containing 1,000 tuples, for a total of 1,000,000 rows. Now, you want to delete all rows where value < 10
. In this case, these rows are all contained within a single batch (maybe values lower than 10 are very rare and happen only for a short period of time).
DELETE FROM metrics WHERE value < 10;
Before TimescaleDB 2.16.0, we would have to decompress all 1,000 batches and materialize 1,000,000 tuples to disk—even if only a small portion matched the query constraints.
In TimescaleDB 2.16.0, this changes with compressed tuple filtering. We now only need to decompress the relevant batches, avoiding unnecessary materialization.
In our example query, this reduces the total query time from 829517.741 ms to 1487.494 ms, 557x faster! As you can see, tuple filtering allows us to drastically reduce the work needed to execute this query, resulting in a huge speed-up. This accelerating power is also variable: the more tuples you discard, the faster your query will become!
When working with massive datasets, every millisecond counts and resource efficiency becomes crucial. The improvements introduced in TimescaleDB 2.16.0 directly address these needs by minimizing the amount of data that must be decompressed, written to disk, and then evaluated by Postgres. This not only reduces disk I/O and CPU usage but also significantly lowers execution time. The result? More headroom to handle larger datasets, scale your systems seamlessly and improve overall application performance.
For developers managing frequent updates, inserts, or deletes in TimescaleDB hypertables, these optimizations mean that less thought has to be given to the current format of the data. Regardless of whether the data is currently stored on disk as rows or compressed columns, DML operations will work as expected.
With the release of TimescaleDB 2.16.0, our columnar compression engine takes another big leap forward. Users can now benefit from up to 500x faster updates and deletes and 10x faster upserts—all while continuing to enjoy the storage savings and performance gains of compression.
Looking ahead, we’re committed to further enhancing our compression engine, delivering even more flexibility and performance gains. Stay tuned—there’s more to come.
Want to see these improvements in action? Sign up for Timescale today and experience the power of our compression engine for your real-time analytics workloads.