Sep 16, 2024
Products like Amazon RDS for PostgreSQL are fine for smaller deployments, but scaling PostgreSQL is a different story. Once the project grows to many terabytes, these managed databases get slow and expensive, making data management much more complex. Performance suffers once tables reach billions of rows, and while there are ways to improve it, the data will carry on growing. Without proper data management, developers can only watch their disks (and bills) increase in size.
But not anymore. Today, we’re happy to announce the General Availability of tiered storage, a multi-tiered storage architecture engineered to enable infinite, low-cost scalability for your time series and analytical databases in the Timescale platform. You can now store your older, infrequently accessed data in a low-cost storage tier while still being able to access it—without ever sacrificing performance for your frequently accessed data.
This is what happens when you insert data into a Timescale time-series database with our new multi-tiered storage backend:
Developers need a cheap way to scale their large PostgreSQL databases in AWS without compromising performance. While object stores are amazingly scalable and affordable, they are not the fastest, and developers also need to get millisecond query responses for their applications.
However, once data becomes older and rarely accessed, real-time performance is often not as essential. Developers still need to be able to access this historical data for ad hoc queries, data analysis, or regulatory compliance, but they can assume some latency for these types of queries. Now, what developers want is the ability to store this historical data as affordably and efficiently as possible.
This new tiered storage architecture frees developers from choosing between storage costs and performance trade-offs for real-time applications. By keeping their recent and frequently accessed data in the high-performance tier, they’ll leverage the millisecond query speed and ingestion capabilities of Timescale, and by tiering their older data, they can keep as many TBs as they need in their PostgreSQL databases for less.
Timescale’s tiered storage architecture leverages the flexibility of PostgreSQL and hypertables for effective data management. When creating a hypertable, it can now seamlessly stretch across both storage tiers; when you run a query, Timescale seamlessly figures out what storage tiers to access to generate the response, boosting performance for your most recent data and lowering storage costs for older data. There’s no extra charge per query or data read and no hidden fees, enabling straightforward, cost-effective data management.
This storage architecture also eliminates any storage limitations in Timescale services: since our low-cost storage tier is infinite, you can store as many TBs as you’d like. For example, we’re leveraging tiered storage internally to store the huge Timescale database powering our Insights product.
This Insights database is constantly collecting and analyzing query statistics from across our customer fleet, and it’s exceeded 350 TB today and is growing quickly. From those 350 TB, 250 TB are tiered to low-cost storage.
Let’s do the math:
Our customers with large deployments are also already making use of tiered storage:
"We perform a lot of analysis on market data, and the sheer volume of data we need to store makes a normal disk-based database solution unfeasible (it's just too expensive). Timescale’s tiered storage seamlessly allows us to move large volumes of data to the object storage layer. This is a great solution to store large volumes of historical data and perform post-analysis. Without this, we'd be forced to develop a solution in-house."
Chief Technology Officer at a Proprietary Digital Assets Trading Company
Simplicity is really the key feature of tiered storage. To move your data from the high-performance tier to the low-cost object tier, all you need to do is use our simple API to define policies to tier your data as it ages on a particular hypertable. Data tiering policies work on a chunk basis (the policy will tier complete chunks once they reach the age defined by the policy). There’s no need for any ETL (extract-transform-load) processes or infrastructure changes.
For example, this policy would move all data older than one month to object storage in your events
hypertable:
SELECT add_tiering_policy('events', INTERVAL '1 month');
That’s all you need! Everything else happens automatically, including intelligent query planning that only executes any SQL queries on the appropriate tier.
To drop data currently stored in the low-cost storage tier, you can define a data retention policy so data is deleted automatically after a certain period of time (e.g., after five years). You can also manually delete specific chunks.
Also, if you want to “move back” a particular chunk from the low-cost storage tier to the high-performance tier (for example, if you need to backfill or update data), you can easily “untier” it.
-- Untier a particular chunk
CALL untier_chunk('_hyper_1_1_chunk');
You can track how much data has been tiered (and how much it will cost at the end of the month) in the Timescale console:
And speaking of billing estimations…
Our high-performance storage tier has an effective price of $0.177 per GB/month of data after compression (considering the expected compression rate we see across our fleet). This is now joined by a low-cost storage layer with a flat fee of $0.021 per GB/month of data, with the same price across all cloud regions.
When tiering data, you will only pay for the data you store, not for the queries executed or the amount of data scanned: this is truly a flat price. Our goal with this pricing structure was to provide a transparent, unambiguous, and simple way to calculate the total data storage cost, making it easier to manage your data.
As a quick example, let’s say you have a hypertable with 5.2 TB of storage, with recent hypertable chunks and other Postgres tables taking up approximately 200 GB and about 5 TB of hypertable data older than a month. You do not access or query this older data frequently, meaning you don’t need it for the daily operations of your application. Still, you’d like to keep it accessible in your database for ad hoc queries or compliance requirements (we see many such cases among our customers).
As a cost-effective data management strategy, you can tier all chunks older than a month to the low-cost tier and decrease the cost of storing that 5 TB of data from around $478/month to around $105/month, a 78% decrease in your storage bill. (For this estimation, we assume you enabled compression for your hypertable and consider the median overall compression in all Timescale services).
The savings will increase alongside your data: when moving multiple terabytes to this low-cost tier, your storage bill will decrease from thousands of dollars to a few hundred. The following reference table illustrates how affordable our low-cost storage tier actually is.
You get the point!
There is one more thing that makes tiered storage even more amazing: when you keep data in the low-cost storage tier, you pay for this data only once, independently of if you have a high-availability replica or read replicas running in your service.
The same applies to forks. In Timescale, you can create copies of your primary database (we call them forks) by clicking one button from the UI, for example, for running tests or creating dev environments. When creating one (or more) forks, you won’t be billed for data shared with the primary in the low-cost storage. If you decide to tier more data that’s not in the primary, you will pay to store it in the low-cost tier, but you will still benefit from substantial savings by moving it from the high-performance tier of the fork to the cheaper tier.
To make this crystal clear, let’s roll out an example. Imagine that you have a primary service with 6.5 TB of data and that you also have set:
From a billing perspective, if you kept the 6.5 TB of data in your primary service in the high-performance storage tier, you would see reflected [6.5 TB x 4] in your storage bill to account for the two replicas, the fork, and the primary service—26 TB in total. Assuming our median compression rate, this would be costly: around $4,602 /month.
But what happens if your application needs to actively access only the most recent 500 GB of data? You could tier 6 TB to low-cost storage and only keep 500 GB on your high-performance storage tier. And since you only pay for the data in your low-cost storage tier once, this is how your new storage bill would look like:
The above storage bill would come up to approximately $480 /month: you would save over $4,000/month! This is the savings' multiplication effect of tiered storage. Especially if you’re running replicas or forks, it is a great idea to take advantage of the low-cost storage tier—the savings you’ll see in your overall bill will be very significant.
We released an early version of our tiered storage functionality in the Timescale platform as Early Access in March. After many improvements, it has now reached General Availability. Since its first release, we have worked hard to make tiered storage stable, reliable, and faster.
We also thought long and hard about our pricing model, iterating consistently over multiple ways to price our low-cost storage tier. Finally, we leaned towards the simplest: a flat rate over the data volume pre-compression*. We’ve already mentioned how we valued simple and predictable pricing, but it was also important for us to provide a price per GB/month as low as possible. Our price landed at $0.021—as a comparison, storing files on Amazon S3 costs $0.023 per GB/month.
Personally, I (Yannis) must admit that after leading teams building cloud-native solutions for more than a decade, I still have to go back and occasionally recheck multiple rate tables on various cloud pricing pages, especially looking for extra fees, every time that I want to compute the total costs for our services accurately.
At Timescale, we believe that you should not have to build a complicated spreadsheet to be able to run a database service or make informed choices about storage layers.
Our commitment to making developers’ lives easier led us to the flat rate of $0.021 per GB/month—no guesswork, hidden costs, or charges per query or data reads.
All data inserted into Timescale is initially written into our high-performance storage layer. Using faster disks for your most recent data will bring top insert and query performance for your most recent values, a usage pattern that fits the needs of data-intensive applications.
On the contrary, the low-cost storage tier is an object store built on Amazon S3. This object store is much more than an external bucket to archive your data, though: it’s an integral part of your database. When you move data to this object storage tier, your database will remain fully aware of all the semantics and metadata, and you can keep querying as usual with standard SQL (albeit with slower performance).
Behind the scenes, we’re storing the data in a compressed columnar format (specifically, Apache Parquet). When data is tiered, chunks stored in Timescale’s native internal database format (typically in our native columnar compression) are asynchronously converted to Parquet format and stored in the underlying S3 object. We’ve built several mechanisms to ensure that the tiered chunks are durably stored in the low-cost storage tier before transactionally removing them from the high-performance tier.
When you run your SQL query, it will transparently pull data from the high-performance storage tier, the object storage tier, or both, as required. When we say transparent, we mean transparent: Timescale supports arbitrarily complex queries across its standard and tiered data, including complex predicates, JOINs, CTEs, windowing, hyperfunctions, and more.
In the example query below (with an EXPLAIN
clause), you can see how the query plan includes a Foreign Scan
when the database is accessing data from the object storage tier. In this example, devices
and sites
are standard Postgres tables (residing only in high-performance storage), while metrics
is a hypertable that stretches across both storage tiers. When executing this query against the metrics
hypertable, three of its chunks were read from standard storage, and five chunks were read from object storage.
EXPLAIN
SELECT time_bucket('1 day', ts) as day,
max(value) as max_reading,
device_id
FROM metrics
JOIN devices ON metrics.device_id = devices.id
JOIN sites ON devices.site_id = sites.id
WHERE sites.name = 'DC-1b'
GROUP BY day, device_id
ORDER BY day;
QUERY PLAN
----------------------------------------------------------
GroupAggregate
Group Key: (time_bucket('1 day'::interval, _hyper_5666_706386_chunk.ts)), _hyper_5666_706386_chunk.device_id
-> Sort
Sort Key: (time_bucket('1 day'::interval, _hyper_5666_706386_chunk.ts)), _hyper_5666_706386_chunk.device_id
-> Hash Join
Hash Cond: (_hyper_5666_706386_chunk.device_id = devices.id)
-> Append
-> Seq Scan on _hyper_5666_706386_chunk
-> Seq Scan on _hyper_5666_706387_chunk
-> Seq Scan on _hyper_5666_706388_chunk
-> Foreign Scan on osm_chunk_3334
-> Hash
-> Hash Join
Hash Cond: (devices.site_id = sites.id)
-> Seq Scan on devices
-> Hash
-> Seq Scan on sites
Filter: (name = 'DC-1b'::text)
In the above query plan, the Foreign Scan on osm_chunk_3334
corresponds to fetching data from the bottomless object storage tier.
To avoid processing chunks falling outside the query’s time window, we perform chunk exclusion to only touch the chunks that are needed to satisfy the query. The database stores various forms of metadata to build a “map” of row groups and columnar offsets within object storage.
Furthermore, when a query is run, it is further selective about the data it reads. If your query only touches a range of rows and a few columns, only that subset of the data is read from the S3 object behind the low-cost storage tier.
In the above, for example, only the device_id
and value
columns are read from the object storage tier. If an additional time-based WHERE
filter had been included, the database would first use its metadata (stored in high-performance storage and cached in memory) to further reduce which Parquet files and row groups need to be read in order to execute the query. All with the goal of reducing query latency even when accessing this bottomless storage transparently through PostgreSQL.
Storage decisions around historical data don’t have to be costly. In Timescale, you now have access to a low-cost, infinite storage tier with no pricing gotchas that allows you to scale your database storage for an affordable price without compromising performance for your application.
If you’re wondering if this is a good solution for your use case, check out our Docs for more information and use case fits. If you’re looking for a more practical perspective, you can also experiment with tiered storage yourself for free in the Timescale platform via our free trial (no credit card required).