Oct 23, 2024
Time-series data (in the form of IoT sensor data, energy metrics, financial tick data, crypto prices, IT metrics, or vector data) has become the lifeblood of numerous industries, causing a bit of a data management crisis. If you’re running a time-series-heavy use case in PostgreSQL and you’re seeing your dataset growing at each passing second, minute, or hour, you’ll soon experience two challenges:
PostgreSQL doesn’t offer a good native solution for these problems, so we at Timescale created one by developing high-performance columnar compression for PostgreSQL all the way back in 2019.
Our columnar compression, available to all PostgreSQL databases via the TimescaleDB extension, transforms automatically created time-based partitions into a columnar format, optimizing storage space and query performance. By storing data values from the same column together and using specialized algorithms for compression based on the data type of such columns, this method capitalizes on the natural tendencies of time-series data.
This approach resulted in a tremendously efficient compression (95%+ compression rates) that allowed developers to run very fast analytical queries while storing large volumes of time-series data for cheap in PostgreSQL.
There was only one problem with our original method: developers couldn’t run INSERT
/ UPDATE
/ DELETE
(the so-called DML operations) over compressed data—a.k.a. compressed data was immutable. And even if time-series data is mostly append-only on paper, the reality of modern time-series data use cases proved to be, ::ahem::—slightly more complex.
Let’s elaborate.
Time-series data is usually the quintessential example of append-only. Whether it's tracking stock prices, sensor readings, or website metrics, once a data point is recorded for a specific timestamp in a database, it usually remains unchanged, as the main mission of time-series data is to provide a chronological account.
Many time-series storage systems take advantage of this theoretical immutability to compress the data on the way into the system. The designers assume the data is immutable—therefore, a fairly high compression rate should be possible without too many side effects on performance. Once the data has been serialized to disk, it is considered settled.
But the reality is that when you’re running a database in production, there are things that happen outside of that cycle. While it's true that time-series data predominantly remains append-only and immutable, most production use cases come with scenarios where backfilling is necessary.
A sensor might transmit corrupted data due to a temporary malfunction, and once fixed, there might be a need to correct the historical data with accurate values; or, in financial settings, there might be restatements or corrections to historical data; or in a temperature sensor, a calibration error might require to backfill the previously recorded data with accurate temperature readings; or perhaps new IoT devices require older data to be replaced with quality controlled data as it was the case for @srstsavage in this GitHub issue.
Another classic example of backfilling is production migrations. To migrate large databases with minimal downtime, we at Timescale often recommend following what we call the dual-write and backfill migration method, which consists of writing to the target and source database for some time while backfilling the necessary time-series data to run the production application (e.g., data going back three months to enable user analytics) into the target database. Once the user is ready for the switch, they can do it with minimal downtime.
Backfilling scenarios challenge the very essence of traditional compression methods for time series. They often introduce immutability to the data, and if data is immutable post-compression, then any necessary corrections or additions would require manual compression and decompression or complex workarounds.
If we want to add a modern compression mechanism for time-series data in PostgreSQL that truly helps developers, it has to account for these disruptions in the traditional time-series data lifecycle, painting a more realistic picture according to what happens in practice in a production setting.
As we said previously, we’ve been working on this problem for a while at Timescale.
We launched the first version of compression at the end of 2019 with TimescaleDB 1.5. This release laid down our foundational compression design: recognizing that time-series workloads access data in temporal order, we built an efficient columnar storage system by converting many wide rows of data (1,000) into a single row of data, compressing each field (column) using dedicated algorithms.
This columnar compression engine is based on hypertables, which automatically partition your PostgreSQL tables by time. At the user level, you would simply indicate which partitions (chunks in Timescale terminology) are ready to be compressed by defining a compression policy.
To allow efficient and convenient querying, this compression design also implemented groupings:
SEGMENT BY
defines columns that remain uncompressed to act as metadata, enabling more efficient query filtering.ORDER BY
determines the order of rows within a compressed segment, optimizing data patterns for better compression ratios.This first version of PostgreSQL columnar compression performed wonderfully: users saw 95%+ compression rates with query performance improvements on top of it.
On its first version, though, Timescale compression didn’t allow any modifications of the data once compressed (e.g., no DML operations such as INSERTS
, UPDATES
, DELETES
, and no schema modifications)—but this was rapidly a highly-requested addition by the community, so we got to work.
In TimescaleDB 2.3, we started to improve the flexibility of this high-performing columnar compression engine by allowing INSERTS directly into compressed data. The way we did this at first was by doing the following:
With this approach, when new rows were inserted into a previously compressed chunk, they were immediately compressed row-by-row and stored in the internal chunk. The new data compressed as individual rows was periodically merged with existing compressed data and recompressed. This batched, asynchronous recompression was handled automatically within TimescaleDB's job scheduling framework, ensuring that the compression policy continued to run efficiently.
This meant a much better user experience if compared to manually decompressing and compressing chunks to insert data, but there was still some room for improvement. For example, insert performance over compressed data was not bad but not optimal (there was a performance penalization), and updates and deletes still required to manually pause the compression policy, decompressing the data, doing the alterations, and re-enabling the compression policy—a.k.a. too much work.
The game-changer improvement in the experience came recently with TimescaleDB 2.11. To make data modification over compressed data as seamless as possible for the developer, we changed our compression approach by introducing a “staging area”—essentially, a chunk that remains uncompressed and in which we do the operations “over uncompressed data” under the hood.
From the user’s point of view, the DML operations are done over compressed data. All the operations needed to compress/decompress/recompress are automated via our job scheduler, working in sync with the compression policy. As a user, you don’t have to do anything manually: you can modify your data directly while our engine takes care of everything automatically under the covers.
This new design made INSERTS
over compressed data very performant. They’re now as fast as inserting into uncompressed chunks, for that matter, since this is really what is happening: when you insert into a compressed chunk, you’re now writing to the staging area.
This staging area approach also allowed us to introduce:
UPDATE
/DELETE
support on compressed dataON CONFLICT DO UPDATE
ON CONFLICT DO NOTHING
The newly introduced ability to make changes to data that is compressed breaks the traditional trade-off of having to plan your compression strategy around your data lifecycle. You can now change already-compressed data without largely impacting data ingestion, database designers no longer need to consider updates and deletes when creating a data model, and the data is now directly accessible to application developers without post-processing.
You can now have time-series data (with compression enabled) in your OLTP database. The traditional paradigm of time-series data is to conceive it as part of OLAP data, mainly due to the nature of immutable data requirements to get the technology to work at scale.
Such implies (or just flat demands) that the data is first staged elsewhere—whether part of your OLTP solution or secondarily staged in OLAP. Thanks to the flexibility added to our compression design, this staging is no longer required, and any running application can immediately access the data.
This, of course, comes with storage savings. This was the main problem we were aiming to solve when we started working on compression years ago: developers need a way to handle their ever-accumulating volumes of time-series data without breaking the bank. This updated columnar compression makes it easier for our customers to do more with fewer compute and storage resources. It is not uncommon to see 95%+ compression rates among our customers, which in Timescale’s usage-based storage model means an immediate reduction of the database bill.
In practice, this not only means a much better developer experience but also that our columnar compression can be used in many more use cases. We'll now dive into a few examples inspired by our users.
Consider a manufacturing company that has an IoT infrastructure in place to monitor machine efficiency in real time. The company decides to expand its IoT network by adding new sensors to monitor additional parameters, such as machine vibration, noise levels, and air quality.
These new IoT devices are added mid-quarter, but for consistency in analysis and reporting, the company wants its datasets to reflect a full quarter’s worth of data from these new sensors. In a traditional setting with immutable compressed data, this would mean either keeping separate datasets or dealing with the complexities of decompressing, modifying, and recompressing data to incorporate the new readings.
However, with the advanced capabilities of TimescaleDB 2.11, backfilling becomes a straightforward process. The company can simulate or estimate the data for the new parameters for the preceding months and seamlessly insert this data into the already compressed historical dataset.
A national grid system collects energy consumption metrics from every household and industry, aiming to optimize power distribution and predict future demands. At some point, recalibrations in some sectors reveal that energy consumption was overestimated for a certain period.
Traditional immutable compression would mean this data remains unadjusted since there’d be way too much work and too much data to change manually. Now, the company can directly modify the compressed data, ensuring that analysts keep working with the most accurate datasets for their predictive models and optimization algorithms.
Imagine a global financial institution planning to migrate huge volumes of historical trading data from an outdated system to a new, more scalable Timescale database. Given the sheer volume and complexity of the financial data, a direct "big bang" migration is deemed risky.
The financial institution, therefore, adopts a "dual-write and backfill" migration strategy, ensuring minimal downtime and data integrity. As the first step, the institution begins by duplicating writes; every transaction is recorded simultaneously in both the old and the new databases. This ensures that the new database is always in sync with the live data, offering an opportunity for thorough testing and validation without affecting the ongoing operations.
In the meantime, the task of backfilling years of historical trading data takes place. The use of TimescaleDB 2.11 becomes a game-changer here: the financial institution can efficiently insert historical data into the new database, thanks to Timescale’s capability to handle DML operations over compressed data seamlessly. This efficiency accelerates the backfilling process, ensuring that the migration team can focus on data integrity and validation rather than being bogged down by the technical complexities of handling compressed data.
As the historical data is progressively backfilled, the new database becomes a comprehensive repository of both real-time and historical financial data. Eventually, a seamless transition occurs where the old system is decommissioned, marking the successful completion of a complex migration that was made significantly more manageable and efficient by the flexibility of Timescale's compression.
We no longer need to stick to the age-old paradigm of immutability in time-series data management. The new capabilities of Timescale compression are paving the way for more efficient, flexible, and fair time-series data management in PostgreSQL, which now comfortably allows for the DML operations that are often needed in production settings.
Compression becomes the norm, not a nice-to-have—making it cheaper and faster to ingest and query large volumes of time-series data and allowing you to build top-notch applications for your own customers.
You no longer need to double-think. Just enable compression and start saving money.