Nov 15, 2024
Posted by
Ryan Booz
Last week, we shared some advice on maximizing your data ingestion rates, and today, we’re discussing a Timescale classic: chunk size. Many users don’t realize that the size of the data partitions—we call them chunks—within a PostgreSQL table that has been converted into a hypertable is an essential parameter that can affect the performance of their databases and impact how some features work. So first, we’ll explain why this is something you should care about and then show you how to pick your optimal chunk size.
Hypertables are the foundation of TimescaleDB. They add time-series superpowers to PostgreSQL by automatically partitioning data into smaller units—chunks—based on a timestamp in your data.
One of the best things about hypertables is that you, the user, interact with a hypertable just like a regular PostgreSQL table; we put a lot of effort into this abstraction, as we think this is the best experience for users. You shouldn’t constantly be thinking about this internal partitioning; we believe this is something that your database should be doing for you, and that’s what TimescaleDB does.
Because TimescaleDB creates chunks based on time, a hypertable has a setting that determines the earliest and latest timestamp stored in each chunk. We call this setting chunk_time_interval
and set it to 7 days
unless you configure it otherwise.
Default settings like this are helpful so that you can start using a hypertable quickly. However, every use case is different, and default settings are useful to get generally good performance. But, if you want to fine-tune TimescaleDB to perform well for your application, you may want to tweak this setting manually.
The first reason why this matters is that TimescaleDB uses chunks (partitioning) to write and read data quickly. If your chunks are too big (i.e., you store too much data per partition), the benefit of partitioning diminishes because you have more data in the partition than the PostgreSQL cache available to manage it, similar to having one large regular PostgreSQL table. But if you have too many chunks (i.e., you store only a little data per partition), you may overwhelm the query planner or create extra overhead in other management areas.
The second reason why this setting matters is because many TimescaleDB features are affected by the chunk size, including our top features: compression, continuous aggregates, and data retention policies. All these features consider the chunk interval (e.g., seven days of data) or chunk constraints (earliest timestamp and latest timestamp).
For instance, data is compressed by chunks, not entire hypertables or individual rows. Retention policies delete chunks while continuous aggregates (which store their data in special hypertables as well) automatically set their chunk size based on the parent hypertable setting. Therefore, these features are affected by how you configure your chunk size within the context of your hypertable.
With that background, let's consider a few specifics when modifying the chunk_time_interval
of your hypertable.
TimescaleDB is a powerful time-series database with lots of 💯 features that help you ingest and query massive amounts of time-series data. Most of these features depend on each other to work best, and your priorities (or expectations) might not fully align with the features’ design, specifically how chunk_time_interval
impacts how they work.
Do you want to optimize for the best compression ratio? Larger chunks that store more rows per device will probably be better, generally speaking.
Do you want to drop raw data as quickly as possible after materializing data in a continuous aggregate? Smaller chunks that more closely align with the time_bucket
interval of the continuous aggregate definition are a good starting place.
Do you want to ensure that your server resources are using hypertables efficiently? You'll probably have to find a compromise of chunk_size
that aligns with other priorities and still ensures active chunks from all hypertables reside in about 25 percent of your PostgreSQL memory allocation.
There is admittedly a process to nail down what interval setting is best for a specific hypertable based on your data, application, and query patterns. It's easiest to modify chunk_time_interval
if you have some level of understanding of the data that you have.
If you have an existing database that stores data similar to what will be inserted into the hypertable, look for how many rows you track each hour (or day). Within that same period, estimate the density of readings per device. And finally, if possible, import a representative set of data into a hypertable with the default setting of '7 days' and then look at how big each chunk is using the chunks_detailed_size(
)
.
If the chunks only take a fraction of your available memory, drop the hypertable and try again with '14 days' or '30 days' and evaluate again. If the chunks are very large and one or two chunks would take up a lot of your available memory, try again with '1 day' and see how it goes.
If you don't yet have example data, try to do some rough estimates using the same criteria.
It's often very helpful to create sample data using PostgreSQL to help you "see" what your final data might look like in a hypertable, how big chunks are, etc. See our series (part 1, part 2, part 3) on using generate_series()
to create lots of sample data for easy testing!
With information about your data and sample chunk size in hand, let's walk through an example of how to merge all of this into a "sweet spot" setting for a sample dataset, talking about some of the compromises you may have to make when working with all of the TimescaleDB features.
Our hypothetical server and dataset are outlined in the following table. For our purposes, assume that each row of data records 10 columns of data.
Server Resources and Sample Dataset
Application Data Expectations
With a clear understanding of our server, dataset, and expectations for how we want to manage the data, let's look at some optimal settings for chunk_time_interval
for this example dataset.
Our database server has 16 GB of memory available, and we want our active chunks to be around 4 GB total. The average size of a seven-day chunk is approximately 1 GB currently, and we know that our application typically queries for the last 14 days. Other TimescaleDB features aside, it seems that ‘7 days' is a good starting point for chunk_time_interval
because more than 14 days of data could reside in 4 GB of memory. We might even consider increasing chunk_time_interval
to '10 days' or even '14 days' if other features below would benefit from having more data per chunk, but we probably wouldn't go smaller.
Compression is one of the TimescaleDB features that is impacted by both the size of the chunk (e.g., how many rows per device are stored in the chunk) and the chunk constraints based on time, something we track as range_begin
and range_end
in TimescaleDB.
First, when you're looking to maximize the compression ratio, having more rows per device will almost always provide better compression overall. TimescaleDB compression works by grouping up to 1,000 rows of data at a time and performing various types of compression against the groups of column data. If your chunks are too small and you're not saving many values per chunk, per device, your compression ratio will probably be smaller.
In our case, each chunk should have approximately 10,000 rows per device, more than enough to get decent compression. This also shows that we could easily have '14 day' or '30 day' chunks and still not have an overwhelming amount of data in each chunk, particularly after compression.
The second consideration is how quickly data in chunks will be compressed after our '1 month' setting, a value that's calculated as now()-'1 month'::interval'
each time the policy runs. Chunks will not be compressed until all of the data in the chunk is older than the policy interval specifies.
Since we want data to be compressed after one month, the value of now() - '1 month'::interval'
will be compared to the range_end
of a chunk. If range_end
is older than the policy interval, the chunk will be compressed. If it's not, TimescaleDB knows that some rows in the chunk are still newer than the compression policy specifies, and it will not compress that chunk. If the policy doesn't run for another day (or another week), you'll have data older than '30 days', which is still uncompressed.
If your requirements allow that kind of "flex" in data compression, then a chunk_time_interval
of '7 days' (or '14 days') will allow you to pack the chunks more densely with device readings while still compressing data close to the 30-day window. It will almost always be compressed a little later than 30 days (unless all of the stars align), but we'll still have most of our oldest data in a compressed chunk.
With data retention, TimescaleDB drops entire chunks. It doesn't DELETE
data. Therefore, data retention, like compression, relies on the range_end
of a chunk to determine when to drop a chunk.
Again, the chances of the chunk range_end
and the timestamp that's calculated by the policy (now()-'3 months'::interval'
) aligning perfectly is unlikely to happen at exactly three months, but the chunk will be dropped within a few days of that marker in most cases. If dropping data at exactly three months is critical, you'll need a smaller chunk size, but that will probably impact the efficiency of compression.
Data retention expectation is typically the one that needs to compromise because of how the other TimescaleDB features interact with the hypertable and chunk size. Lowering the chunk_time_interval
to '1 day' so that chunks are dropped as close to the 3 month
mark as possible will reduce the effectiveness of compression efficiency to a minimum and impact how continuous aggregates are sized as well.
If you're focused on precise data retention as an end goal but not using compression, you're not using the full power of TimescaleDB and should re-evaluate your data management expectations.
Finally, our application will have three continuous aggregates that use a time_bucket
of 30 minutes, one hour, and one day, each with a different interval of time that we want to drop the aggregated data using a data retention policy. This is where the hypertable chunk_time_interval
impacts continuous aggregates the most.
If we keep the chunk_time_interval
set to '7 days', any continuous aggregate we create will automatically use the setting of the underlying hypertable to set the chunk_time_interval
of the materialized hypertable. By default, TimescaleDB sets this to 10x the chunk size of the underlying hypertable.
With our example data, that means that all three continuous aggregates will have a chunk_time_interval
of '70 days'. This is probably a reasonable setting from a data density perspective (our aggregates are reducing the total number of rows in each successive continuous aggregate), but that '70-day' interval will impact our ability to drop data from the aggregate in the timeframe we specified.
The 30-minute' continuous aggregate can have data drop after seven days. Unfortunately, since the chunk_time_interval
of the continuous aggregate is 70 days, in this case, that data won't actually be dropped for another two months later than I was expecting. The same holds for the '1-hour' aggregate. If we try to drop data from the continuous aggregate after '1 month', the most recent chunk still has another month or more until it's eligible to be dropped based on the chunk range_end
value.
This is the most important aspect of the hypertable chunk_time_interval
that impacts continuous aggregates. It's worth considering how any change in chunk size of the hypertable will necessitate a change to the chunk_time_interval
of the continuous aggregate, at least based on your data requirements.
With our requirements in hand and having examined how `chunk_time_interval` would impact various TimescaleDB features, here is where we would probably start the settings for this hypertable. Recognize that this is an example and each dataset is specific, so use the information above, a sample of your data, and features like Timescale database forking to quickly iterate changes in your settings to see if they will have the intended impact.
Hypertable chunk_time_interval
For this table, we'd recommend starting at '14 days' for this application. Our reasoning for this includes:
Continuous Aggregate chunk_time_interval
Based on our data retention settings for our continuous aggregates, we'll set the chunk_time_interval
of each slightly different to accommodate data retention better.
Modify this setting immediately after creating the aggregate but before any data is materialized through a refresh policy (or manually). For example:
-- Create the continuous aggregate WITH NO DATA so that we can
-- modify the chunk_time_interval before any data is added
CREATE MATERIALIZED VIEW data_view_30min
WITH (timescaledb.continuous)
AS
SELECT time_bucket('30 minutes', time) as bucket,
…
WITH NO DATA;
SELECT set_chunk_time_interval('data_view_30min', INTERVAL '14 days');
-- Start the refresh policy after setting the new chunk time
-- to start populating the continuous aggregate
SELECT add_continuous_aggregate_policy('data_view_30min',
start_offset => INTERVAL '1 day',
end_offset => INTERVAL '1 min',
schedule_interval => INTERVAL '10 min');
These changes are made specifically to accommodate data retention on continuous aggregates. You should verify that queries perform as expected.
Optimizing the chunk_time_interval
parameter of your hypertables can dramatically impact the efficiency and performance of many key TimescaleDB features.
Spending time understanding your data, server resources, and application data requirements will go a long way in helping you fine-tune this setting. Also, remember to take advantage of forks for testing in Timescale—you can create a Timescale account here and use it for free for 30 days.
If you have any further questions, the Timescale Community Forum is always open. Don’t hesitate to ask!