Scaling PostgreSQL to Petabyte Scale

Scaling PostgreSQL to Petabyte Scale

Last year, we launched our Insights feature, which provides users with in-depth statistics about how the queries in their database are performing, including timing information, memory and I/O usage, and more. We discussed how that feature was powered by our largest dogfooding effort, where we scaled PostgreSQL to hundreds of terabytes of data and billions of records recorded daily. 

Since we launched the feature a year ago, we thought it would be a good time to revisit things and see where they stand now. We also thought it would be a good time to detail some of the Timescale features and architecture choices that enable us to keep running this feature smoothly (with its ever-growing dataset) on our own Timescale Cloud. Put another way: you, too, could run a PostgreSQL database at this scale with the offerings on Timescale Cloud. Nothing else is needed.

This update is part of our September Launch Week. To check this week’s previous launches and keep track of upcoming ones, head to this blog post or our launch page.

Insights Recap: Scaling PostgreSQL for Query Monitoring

To understand the scale of the problem we’re trying to solve, let’s quickly recap the feature being powered here by Timescale. As part of Timescale Cloud, we offer users statistics to help them understand how queries on their databases are performing. This feature is called Insights

Insights provides detailed monitoring information, including timing (e.g., how long it takes to plan and execute a query), memory and I/O information (e.g., bytes found in shared buffers vs. disk), and TimescaleDB features used (e.g., queries used in a hypertable, continuous aggregate, or compressed chunks, etc.). 

To do this, we capture every query that runs in an instance in our Cloud (in normalized form to remove sensitive data), gather the relevant statistics, and store them in a Timescale database service that can be queried as part of the Insights page. When we launched last year, we were capturing about a dozen metrics per query; that number has tripled as we look to improve the amount of information users can get about the queries on their instances.

Given all that, you can see that the volume of data has about three axes on which it can increase: as the number of customers grows, as customers mature and increase their query loads, and as we add new metrics we want to track. And yet, we continue to manage to track this data in Timescale Cloud, on a PostgreSQL-based database, accomplishing Timescale’s original goal of creating a faster, more scalable PostgreSQL.

Just Give Me the TL;DR (a.k.a. the Big Numbers)

When we bragged talked about building Insights last year, the headline numbers were storing 350+ TBs of data and ingesting 10 billion records a day. This year, we’re going to keep the amount of data but change the other headline number. 

The amount of data we're storing on a Timescale Cloud database service

After publishing, we realized that the term “record” is a bit underspecified. A record is a set of metrics for a query. But a record today (fall 2024) is a much larger set of metrics than when we last posted it (fall 2023) because we’ve increased the scope of things we track. So, instead, we’d rather talk about the number of metrics per day.

That said, we’ve moved out of terabytes of data stored into petabytes of data stored (the title gave this away), adding roughly a petabyte of new data over the course of the year! As mentioned in the previous blog post, we store much of this data in Timescale’s tiered storage architecture.

In fact, we typically only keep a few weeks of data in active storage, so we have also tiered over a petabyte of data. And with many optimizations and improvements made to our tiered storage architecture in the past year, that data is more easily accessible and query-able than ever before (more on this tomorrow, so watch this space).

As for our ingest rate, when we announced Insights last year, we were collecting approximately 100 billion metrics per day across all instances. Today, that number stands at around 800 billion metrics per day due to a combination of more databases, more queries being run per database, and more metrics collected per query. We’ve collected 100 trillion metrics over the lifespan of Insights. Yet, even though our biggest dogfooding effort continues to grow in size, we still use the same size bowl: a vanilla Timescale Cloud instance.

How We Scale PostgreSQL and Stay Fast

Much of our architecture remains the same as described in our original post last year. We still ingest two main types of objects, a detailed “raw” record and a set of UDDSketches that represent a distribution of values for a given metric (“sketch” record). 

A raw record contains the metrics for a single query invocation, along with some more detailed information, like a full breakdown of query plan nodes used to execute the query. Conversely, the set of UDDSketches represents multiple query invocations. This allows us to store orders of magnitude more queries’ stats than if we stored only raw records. Since last year, we have generally sampled fewer raw records, now only collecting about 25 % of queries in this form. The node breakdown of execution can be useful in understanding how custom plan nodes we’ve created for TimescaleDB are performing across the fleet.

Adding new metrics to track has been straightforward—just new columns on our existing hypertables. Because we’ve essentially tripled the amount of metrics we collect, this does put more pressure on storage. For raw records, as previously mentioned, we have just reduced the amount of sampling while continuing to aggressively tier data. For the sketch records, we’ve also begun using tiering for this table. This lets us keep our active dataset for the database around 10 TB (80 TB of pre-compressed data before using Timescale's row-columnar storage engine), with the rest (1 PB+) tiered.

To allow for aggressive tiering and quick responses to queries from our Insights page, we use continuous aggregates (our enhanced version of PostgreSQL materialized views) heavily. UDDSketches “roll up” very nicely: you can combine a set of UDDSketches into a new UDDSketch representing the entire group. This allows us to go from the ingested UDDSketches into a hierarchical continuous aggregate tree with groupings at several levels (minutes, hours, days). 

With a bit of planning, we’ve been able to have stats available at all the granularities we need to serve users without needing to go to the original hypertables. Inserts stay fast, queries stay fast, and we can tier without fear.

In the future, we may need to deploy read replicas to scale the solution, allowing us to separate the high write ingesting and aggregation workload from the high read workload that comes from customer usage. But as it stands today, we don’t need that; we have this billion metric-a-day pipeline running perfectly without scaling out.

Final Words

In just a year, Insights has grown in both scale and impact, proving that PostgreSQL—when engineered for scale—can handle immense workloads. 

We’ve gone from tracking tens of billions of metrics daily to collecting hundreds of billions of metrics while storing petabytes of data—all on a Timescale instance. The power of Timescale’s tiered storage, hypertables, and continuous aggregates has allowed us to not just scale but to stay fast and efficient. 

If you’ve been thinking about taking your Timescale Cloud database to the next level, rest assured, we’re showing it’s entirely possible—our Cloud is your Cloud. And remember, you will never walk alone. Top-notch support is available for free for all Timescale Cloud customers, and our expert team is ready to guide you every step of the way, all the way to petabyte scale. 

Start scaling—create a free Timescale account today.

Ingest and query in milliseconds, even at petabyte scale.
This post was written by

Originally posted

Last updated

5 min read
Engineering
Contributors

Related posts