Sep 20, 2024
Posted by
Gayathri Ayyappan
If you’re managing a PostgreSQL database, more data often means scalability hurdles. As your data expands, the stakes rise: storage costs soar, access speed slows down, and resource efficiency looks like a distant dream. Suddenly, you find yourself asking how on Earth will you scale your PostgreSQL database without breaking the bank. Data tiering, archiving, and deleting spring to mind.
Timescale's tiered storage architecture was designed to address this challenge head-on. It enables seamless management of massive datasets by moving older, less frequently accessed data to cheaper, slower storage without sacrificing query performance.
Only available in Timescale’s PostgreSQL cloud platform, Timescale Cloud, this tiered storage backend allows you to move data from the high-performance tier (block storage) to the low-cost bottomless storage tier (Amazon S3 object storage). With a simple data tiering policy, you can significantly reduce storage costs while maintaining high performance for your most recent data. And you can also transparently query across all your data, regardless of whether it’s still in PostgreSQL or S3.
However, there is a downside: compared to the blazing-fast high-performance tier, querying older, tiered data can feel a lot slower. So we decided to work on a performance boost. Today, we’re happy to announce that querying tiered data is now up to 400x faster thanks to optimizations like chunk exclusion, row-group exclusion, and column exclusion. These improvements will help you save on storage costs without compromising query speed, even as your data grows to petabytes and beyond.
Before getting into the improvements we developed, let’s understand why data tiering is so important for modern, data-intensive applications. Data tiering is the process of dynamically managing the data’s lifecycle by moving it between different storage types based on age, usage, and relevance.
In Timescale, this is the standard data life cycle (although not all steps are compulsory):
Data tiering becomes especially important when you’re handling massive amounts of time-series data or other demanding workloads. In these cases, databases will most likely be the main cost driver of your application due to increasing storage costs. But keeping all data in fast, high-performance storage is expensive and often unnecessary. Most organizations need to access recent data quickly while storing older data more cost-effectively.
This is where Timescale Cloud’s tiered storage really makes an impact, streamlining data tiering for users. Data movement happens in the background, with minimal impact on your production database. Your data remains fully queryable throughout the process, whether it's stored on the high-performance tier or cheap object storage.
By making use of different storage types and automating the data lifecycle, you can significantly reduce costs without sacrificing performance. Instead of paying a premium for expensive storage for every row of data, you only keep hot data on a high-performance disk, while cold data resides in cost-effective object storage like S3.
For those working with massive datasets, particularly in industries like IoT, financial services, or monitoring, the cost and efficiency gains of tiered storage can be game-changing. And with our recent optimizations, querying tiered data doesn’t mean taking a huge performance hit—you can still run powerful, complex queries across hot and cold data.
Let’s see how.
As mentioned, one of the key strengths of our tiered storage architecture is its automation and minimal operational overhead. Users can set up a tiering policy for a hypertable or continuous aggregate (our version of always up-to-date materialized views).
The tiering policy runs in the background, identifying data chunks that match your criteria (such as data older than a specific threshold) and adding them to a queue to be moved to Amazon S3. This is an asynchronous process, where chunks are moved one at a time, ensuring that your production database is not overwhelmed.
SELECT add_tiering_policy(hypertable REGCLASS, move_after INTERVAL);
Once the data is moved to S3, it’s stored in the Parquet format, an open-source columnar format designed for efficient querying and data storage.
Parquet organizes data into row groups, typically batches of 100,000 rows, with each row group further divided into column chunks. This columnar structure means that only the specific columns required by your query are read from storage instead of loading entire rows, significantly improving query performance. While this is similar to the in-database columnar format it's optimized for object-storage.
For instance, if a query only needs three columns from a dataset, Parquet allows the execution engine to process just those three column chunks. This exclusion is particularly useful when querying large datasets, as it reduces the amount of unnecessary data read from storage.
Each row group in Parquet also includes the minimum/maximum information for each column, allowing Timescale to apply row group exclusion—the process of filtering out entire row groups that don’t match your query conditions. For example, if you’re filtering by device_id = 1
, Timescale can skip over row groups whose device_id
values don’t fall within the query’s range.
This combination of chunk exclusion, row-group exclusion, and column exclusion makes Timescale’s tiering incredibly efficient, ensuring that even cold data in S3 can be queried quickly and cost-effectively.
Let’s look at how this works with an example query on a tiered hypertable called device_telemetry
, where observed_at
is the time partitioning column:
SELECT device_id, MAX(temperature), MIN(humidity)
FROM device_telemetry
WHERE device_id = 1
AND observed_at BETWEEN '2024-01-01' AND '2024-02-01';
Here’s how Timescale Cloud optimizes this query:
observed_at
condition.device_id = 1
. If each chunk has 20 row groups but only two are relevant, this further reduces the scope of the query.device_id
, temperature
, and humidity
, it reads just these three columns from the Parquet files on S3.By applying these optimizations, the database processes far fewer rows from object storage and columns, leading to much faster query execution.
Another performance enhancement comes from applying compression settings to tiered data. Timescale Cloud allows users to define compression settings for their hypertables, such as compress_segmentby
(which determines how data is grouped) and compress_orderby
(which defines the order of compressed data).
When moving data to S3 in Parquet format, we use these compression settings to optimize queries further. For example, if your hypertable is compressed by device_id
and ordered by observed_at
, Parquet will store the data in the same order, making it easier to apply row-group exclusion and reducing query time. In some cases, these optimizations have led to up to 400x performance improvements for point queries.
Timescale’s tiering also benefits from runtime optimizations for queries using volatile expressions like now()
. Previously, chunk exclusion wouldn’t apply if a query used dynamic expressions (e.g., observed_at <= now() - '3 months'
). However, with recent updates, we now apply chunk exclusion even for these queries, ensuring that tiered data is filtered efficiently without needing hard-coded time ranges.
For example, the following query can now take full advantage of chunk exclusion:
SELECT MAX(temperature)
FROM device_telemetry
WHERE observed_at >= NOW() - '4 months'::interval
AND observed_at < NOW() - '3 months'::interval
AND device_id = '0xxacedf';
This improvement makes it easier to write flexible queries while still benefiting from a performance boost.
Timescale Cloud’s tiered storage delivers significant cost and performance advantages by automatically managing your data's lifecycle, freeing up your time so you can focus on your application—not your database. From chunk exclusion to row-group filtering and Parquet format optimizations, you can rest assured that your queries will run efficiently (up to 400x faster!), whether the data is stored locally or in Amazon S3.
By fine-tuning your compression settings and with optimizations like runtime chunk exclusion, you can take full advantage of tiering while maintaining top-tier query performance.
Visit our docs to learn more about our multi-tiered backend architecture and check if it fits your use case. To start saving on costs while easily accessing and querying your old data, create a free Timescale account today.