Written by James Blackwood-Sewell and Carlota Soto
PostgreSQL is one of the most popular relational databases on the planet. Its rich feature set and open-source nature make it a popular choice for businesses of all sizes. But it can run into trouble as data grows, especially with single large tables.
One popular strategy to tackle these issues is table partitioning. In this article, we'll explore when and why you should consider Postgres partitioning (especially your large tables), sharing some best practices and common pitfalls if you decide to partition your data.
Table partitioning is a database design technique that divides a large table into smaller, more manageable pieces known as partitions. Each partition is a sub-table that holds a subset of the data, with each row existing in exactly one partition. Partitions can have their own indexes, but any unique indexes (including primary keys) must include the column used to partition the table.
This technique can offer significant advantages when managing large tables. By breaking down a large table into smaller partitions, you can benefit from improved query performance, optimized index sizes, and more efficient data maintenance operations, keeping your large-scale databases more agile and responsive overall.
But partitioning is not a one-size-fits-all solution: sometimes, it won't benefit you. It can even degrade your performance, as we’ll see later on.
PostgreSQL supports three partitioning strategies:
Range partitioning: this strategy is ideal for time-series data or incrementing sequences (maybe a BIGINT primary key), where you partition data based on a range of values (e.g., by day or number of keys).
List partitioning: you would use list partitioning when you want to partition data based on specific values in a column (e.g., country or department).
Hash partitioning: hash partitioning is suitable when there is no clear partitioning key, as it distributes data evenly across partitions based on a hash function.
One crucial characteristic of partitioning in PostgreSQL is that it does not support “global indexes”: there is no way to create an index across partitions. Instead, when querying across partitions, PostgreSQL will evaluate the WHERE
clause, and if there are constraints on the partition column, it will use those to exclude partitions that don’t hold data relevant to the query.
It’s probably clear to you at this point that if you were to implement partitioning in a production setup, you'd need automation in place to create and maintain partitions, especially if the partitioned table expects to receive data continuously. There are multiple ways to do this, the most common being the following:
You can write SQL procedures/functions to check if the necessary partitions for upcoming days/weeks/months exist, using cron (on Unix-like systems) or Task Scheduler (on Windows) to run the script regularly.
You can use a PostgreSQL extension dedicated to this, the most popular being pg_partman (which will probably need to be paired with a scheduler).
You can use Timescale, whose hypertables have built-in automatic partitioning (reducing the creation and maintenance of partitions to a single command).
So, yes, partitioning can be very powerful, but it is certainly not for every use case. The decision to partition a PostgreSQL table is not strictly based on the table's absolute size. Evaluating your database's characteristics and requirements before implementing partitioning is essential.
Generally speaking, you should start thinking about partitioning in one (or more) of the following situations:
You have large tables. As we mentioned, the size of your tables is not the only factor determining whether you may see benefits from partitioning; this said, if you have large tables (tens of millions to billions of rows), you would probably benefit from partitioning.
Your ingestion rate is very high. Even if the current table size isn't massive, a high data ingestion rate can indicate that the table will grow significantly soon. It might be beneficial to preemptively manage this growth with a partitioning strategy before it starts affecting your performance and maintenance operations.
You’re beginning to notice query performance degradation. Partitioning may also be beneficial if your queries are slowing down, especially those that will only touch a subset of your data. You can benefit from this even when your tables are smaller. For example, partitioning can significantly enhance query performance when your daily queries include searches based on a specific range or criteria. Let's say you're dealing with time-series data: partitioning by date can help you quickly retrieve records within a particular time frame without scanning the entire table.
You’re dealing with maintenance overhead. As a table grows, maintenance operations like VACUUM
, ANALYZE
, and indexing can take longer and might start impacting your operational efficiency. Partitioning can simplify these operations because you can focus on maintaining smaller partitions independently, reducing the impact on your database's overall performance.
You’re managing data retention policies. If your dataset has built-in obsolescence that periodically purges older data, partitioning can make these operations much more efficient. Dropping an old partition is much faster and less resource-intensive than deleting rows.
You want to use less memory. If you operate with limited memory, you might benefit from partitioning, as smaller indexes and data chunks fit better in memory and improve cache hit rates. In most cases, this will also improve performance.
There are situations where introducing partitioning may be counterproductive, and you may want to look at other optimizations instead, such as tuning your indexes or queries:
Your tables are small, and you’re not ingesting at high rates.
If your tables are not large and are unlikely to grow significantly, partitioning would add unnecessary complexity without delivering noticeable benefits.
You’re just starting with PostgreSQL. Implementing and maintaining partitioned tables introduces a certain level of operational and architectural complexity: managing such tables is more challenging, so make sure you feel ready to go there—or use a hosted service like Timescale to manage this for you.
You have uniform data access patterns. If your queries usually access data uniformly across the table rather than focusing on specific subsets (like recent data, ranges of data, specific categories, and so on), then your queries could actually perform worse.
Frequent full scans. If most of your queries don’t use your partition key in the WHERE
clause, you’ll scan every partition every time. This scan will be slow and will get slower the more partitions you have.
List partitioning is useful when you want to partition data based on a finite set of values, such as categories or regions. Each partition will contain data that matches a particular list of values.
Example:
Suppose we have a table called orders
and we want to partition it based on different regions: 'North', 'South', 'East', and 'West'.
-- Create the parent table
CREATE TABLE orders (
order_id SERIAL PRIMARY KEY,
order_date DATE,
region TEXT,
amount NUMERIC
) PARTITION BY LIST (region);
-- Create partitions for each region
CREATE TABLE orders_north PARTITION OF orders FOR VALUES IN ('North');
CREATE TABLE orders_south PARTITION OF orders FOR VALUES IN ('South');
CREATE TABLE orders_east PARTITION OF orders FOR VALUES IN ('East');
CREATE TABLE orders_west PARTITION OF orders FOR VALUES IN ('West');
In this example, any INSERT
statement will automatically route data to the correct partition based on the region
value.
Range partitioning is useful when you want to partition data based on a continuous range of values, such as dates, numerical values, or other ordered data.
Example:
Suppose we want to partition the orders
table by the year of the order_date
.
-- Create the parent table
CREATE TABLE orders (
order_id SERIAL PRIMARY KEY,
order_date DATE,
region TEXT,
amount NUMERIC
) PARTITION BY RANGE (order_date);
-- Create partitions for each year
CREATE TABLE orders_2023 PARTITION OF orders FOR VALUES FROM ('2023-01-01') TO ('2024-01-01');
CREATE TABLE orders_2022 PARTITION OF orders FOR VALUES FROM ('2022-01-01') TO ('2023-01-01');
CREATE TABLE orders_2021 PARTITION OF orders FOR VALUES FROM ('2021-01-01') TO ('2022-01-01');
Here, records are routed based on their order_date
. For example, a record with an order_date
of 2023-05-15
will be placed in the orders_2023
partition.
Hash partitioning is useful when you want to distribute rows across partitions in a way that balances the load. Data is divided using a hash function that evenly distributes data across the specified number of partitions.
Example:
Let’s partition the orders
table by a hash of the order_id
, spreading the rows across four partitions.
-- Create the parent table
CREATE TABLE orders (
order_id SERIAL PRIMARY KEY,
order_date DATE,
region TEXT,
amount NUMERIC
) PARTITION BY HASH (order_id);
-- Create 4 partitions
CREATE TABLE orders_p1 PARTITION OF orders FOR VALUES WITH (MODULUS 4, REMAINDER 0);
CREATE TABLE orders_p2 PARTITION OF orders FOR VALUES WITH (MODULUS 4, REMAINDER 1);
CREATE TABLE orders_p3 PARTITION OF orders FOR VALUES WITH (MODULUS 4, REMAINDER 2);
CREATE TABLE orders_p4 PARTITION OF orders FOR VALUES WITH (MODULUS 4, REMAINDER 3);
In this case, each row will be placed in one of the four partitions based on the result of a hash function applied to order_id
. PostgreSQL will compute the hash and then distribute the rows across the partitions.
Now let’s see how you can create a partitioned PostgreSQL table in Timescale using hypertables. Hypertables automatically partition your data—you can learn more about them in the Timescale documentation.
Log into your Timescale Cloud account. If you don’t have an account, you can create one for free and try it for 30 days. Alternatively, you can install TimescaleDB on your machine and then enable it.
-- Enable the TimescaleDB extension
CREATE EXTENSION IF NOT EXISTS timescaledb;
Next, create a similar case from the previous, with an orders
table that will store the time-series data.
-- Create the sensor_data table
CREATE TABLE orders (
customer_id BIGINT NOT NULL,
order_time TIMESTAMPTZ NOT NULL,
region TEXT,
amount NUMERIC,
PRIMARY KEY (order_time, customer_id) -- Composite key for uniqueness
);
This is a standard table that will later be converted into a hypertable. The order_time
column is required for time-series data, and we have replaced the primary key with a combined order_time
+ customer_id
column to uniquely identify the orders from which customer.
Convert the orders
table into a hypertable using the create_hypertable()
function. Here, we will partition the data by time into chunks of one (1) day.
-- Convert the table to a hypertable partitioned by time (daily chunks)
SELECT create_hypertable('orders', by_range('order_time', INTERVAL '1 day'));
Similar to hashing partition previously presented, you can use space partitioning on the region
to distribute the data across multiple partitions using add_dimension.
SELECT add_dimension('orders', by_hash('region', 4));
Explanation:
The order_time
column is used for time-based partitioning.
chunk_time_interval => INTERVAL '1 day'
specifies that the time-based partitioning should create chunks of data for each day.
The region
column is used for space partitioning, with 4
(four) partitions.
You can now insert data into the hypertable, and TimescaleDB will automatically manage the partitioning and chunking of the data.
-- Insert some sample data
INSERT INTO orders (order_time, customer_id, region, amount)
VALUES
('2024-08-01 00:00:00', 1, 'Region A', 600),
('2024-08-01 00:00:00', 2, 'Region A', 580),
('2024-08-01 01:00:00', 1, 'Region B', 610),
('2024-08-01 01:00:00', 2, 'Region C', 590),
('2024-08-02 00:00:00', 1, 'Region B', 620),
('2024-08-02 00:00:00', 2, 'Region C, 570);
Each row will be routed to the appropriate chunk based on the time
column. The data will be further distributed across the space partitions based on the sensor_id
.
You can query the sensor_data
hypertable just like a normal PostgreSQL table. TimescaleDB will automatically optimize the query by accessing only the relevant chunks.
-- Query data for a specific time range
SELECT * FROM orders
WHERE order_time BETWEEN '2024-08-01 00:00:00' AND '2024-08-01 23:59:59';
You can view the chunks that TimescaleDB has created using the show_chunks()
function.
-- Show the chunks created by TimescaleDB
SELECT show_chunks('orders');
This will show the time ranges and partitions that TimescaleDB has automatically created to store your data.
For more advanced tips on how to use hypertables in TimescaleDB, check out this video.
But if you decide to do it, here's how to do it properly:
Choose the right partition size. What’s your ideal partition size? It depends, but you should aim for a balance. While PostgreSQL can handle a high number of partitions, having too many will increase planning time and could negatively affect query performance. At the same time, if your partitions are too large, you won’t be able to use ranges to exclude data, undermining the effectiveness of partition pruning.
Keep your partition size consistent. Aim to keep partitions relatively uniform in size, ensuring that maintenance tasks and query performances are consistent across partitions.
Choose the right partitioning key. Opt for a key that aligns with your query patterns. For instance, if most of your queries filter by date, a timestamp or date column would be an ideal partitioning key.
Create partitions in advance. Be sure to create partitions for upcoming periods (e.g., future days or months) ahead of time so there's no interruption in data ingestion. While you can use a default partition to catch orphaned records, in practice, this introduces a maintenance burden and does not perform well.
Take advantage of data retention policies to maintain old partitions. For example, if you're partitioning by time and data has a limited useful life, schedule regular tasks to drop or archive old partitions.
Optimize your queries. If you’re especially interested in optimizing query performance, analyze and understand the query execution plan to validate that it is only scanning necessary partitions.
Properly place partitions across different storage mediums. If you're using tablespaces to place partitions on different storage engines (e.g., EBS or S3), ensure that frequently accessed partitions are on faster storage and older or less accessed partitions can be on slower, cheaper storage. Timescale makes this implementation easy via data tiering.
Similarly, make sure to avoid these common mistakes:
Over-partitioning. Creating many small partitions is tempting, but this won’t work well—you’ll get into query planning and management challenges.
Inefficient indexing. Avoid creating unnecessary indexes on your partitions. Only index the columns that are frequently filtered or joined on.
Unoptimized query pattern. Queries spanning multiple partitions or not using the partition key in the WHERE
clause might suffer in performance. Be sure to optimize the majority of your queries for the partitioning scheme.
Running out of partitions. If you insert data with no partition to live in, it will either be rejected or stored in a DEFAULT
partition. Ensure you either pre-create partitions at a quiet time (as this will lock your table) or use an extension that creates new partitions on the fly.
Monitor disk usage (partitions need extra space). If you're creating many partitions, especially on different tablespaces or disks, monitor disk usage to avoid out-of-space issues.
Partitioning PostgreSQL tables can be a great ally in your production database, especially if your tables are getting big, your ingestion is growing, or your query performance is degrading. Still, partitioning is not a one-size-fits-all solution: consider if it’s the right fit for your use case before deciding to implement it.
If you’ve decided to partition your large PostgreSQL tables by time, check out Timescale, which completely automates the creation and maintenance of partitions: you will interact with your tables as if they were regular PostgreSQL tables, while Timescale handles partitioning under the hood. You will get the extra performance—without the extra work.
Want to keep reading? Learn more about determining the optimal Postgres partition size.