The Problem With Locks and PostgreSQL Partitioning (and How to Actually Fix It)
In my career, I have frequently worked for companies with large amounts of time-partitioned data, where I was a software engineer focusing on our PostgreSQL databases.
We’d already grown past the point where deleting data row-by-row was no longer practical, so we needed to use PostgreSQL partitioning to manage data retention. In brief, dropping a whole partition allows PostgreSQL to remove the entire file from disk for a subset of your data rather than going through each row and removing them individually. So it’s much faster. But if you are doing partitioning natively in PostgreSQL, you do have to make sure to add new partitions where you’re ingesting new data and drop the old ones.
This was a frequent cause of outages for us, even if we had reasonably well-tested scripts for adding and removing partitions. Unfortunately, the interactions around the scripts were less well-tested, and new, frequent, and long-running queries prevented the partition management scripts from getting the locks required and creating new partitions. We didn’t see the problem at first because we’d created partitions a few days in advance, but then we ran out of time, and with no new partitions, we couldn’t insert, and whoops, down goes the app.
These types of problems are particularly hard to debug and disentangle because they are often caused by totally unrelated pieces of code in combination with changes in load. PostgreSQL has begun to address this with newer approaches attaching partitions concurrently, but they’re quite complex.
I’ve seen the outages caused by partitions failing to create, disk filling up because they can’t be dropped, or the pauses in other normal queries by partition management code. I know how difficult these problems can be. This is why TimescaleDB's hypertables were so exciting to me when I discovered them, especially their lock minimization strategies.
Understanding PostgreSQL Locks
What are PostgreSQL locks?
PostgreSQL locks are mechanisms that control concurrent access to data in the database to ensure consistency, integrity, and isolation of database transactions.
PostgreSQL, like most other relational database management systems, is a concurrent system, which means that multiple queries can be processed at the same time. Locks help in managing multiple transactions attempting to access the same data simultaneously, avoiding conflicts and potential data corruption.
Why are PostgreSQL locks necessary?
Concurrency is essential for optimizing the performance and responsiveness of the database. However, concurrency introduces several challenges that need careful handling to ensure the database’s integrity, consistency, and reliability:
- When multiple queries are executed concurrently, there's a risk that one transaction might view inconsistent or uncommitted data modified by another ongoing transaction. This can lead to erroneous results and inconsistencies in the database.
- Queries executing simultaneously can interfere with each other, leading to performance degradation, locking issues, or inconsistent data.
- When two transactions try to modify the same data simultaneously, it can lead to conflicts, data corruption, or loss of data.
Locks are necessary to prevent these problems.
Types of PostgreSQL locks
PostgreSQL supports many different types of locks, but the three relevant to this article are ACCESS SHARE
, SHARE UPDATE EXCLUSIVE
, and ACCESS EXCLUSIVE
locks.
ACCESS SHARE
locks are the least restrictive and are intended to prevent the database schema from changing under a query along with related caches being cleared. Access share locks are acquired for database read operations. The purpose of access share locks is to block access exclusive locks.SHARE UPDATE EXCLUSIVE
locks allow concurrent writes to a table but block operations that change the database schema in ways that might interfere with running queries. These are used for some forms of concurrent schema changes in PostgreSQL, though two concurrent transactions cannot both take this lock on the same table. For example, you cannot concurrently detach and attach the same partition to/from the same parent table in different sessions. One must complete before the other starts. These locks generally are used for concurrency-safe schema changes, which do not clear cached relation information.ACCESS EXCLUSIVE
locks are the most restrictive and are intended to prevent other queries from operating across a schema change. Access exclusive locks block all locks from all other transactions on the locked table.
Cache invalidation and ACCESS EXCLUSIVE
locks
For performance reasons, PostgreSQL caches information about tables and views (which we call “relations”) and uses this cached information in query execution. This strategy is instrumental for PostgreSQL's efficiency, ensuring that data retrieval is quick and resource utilization is optimized.
A critical scenario that needs meticulous handling lock-wise is when the structrure of tables is altered. When the schema is altered (e.g. by adding or dropping columns, changing data types, or modifying constraints) the cached information related to that table might become outdated or inconsistent. Therefore, it needs to be invalidated and refreshed to ensure that the query execution reflects the modified schema.
To make this work, PostgreSQL takes an access exclusive lock on the table in question before the cached information for that relation can be invalidated.
Using PostgreSQL Partitioning to Simplify Data Management
In PostgreSQL declarative partitioning, PostgreSQL tables are used both for empty parent tables and for partitions holding the data. Internally, each partition is a table, and there is mapping information used by the planner to indicate which partitions should be looked at for each query. This information is cached in the relation cache.
When tables are partitioned based on time, it allows for an organized structure where data is segmented into specific time frames. This makes data management much faster, since dropping a whole partition allows PostgreSQL to remove the entire partition from disk rather than going through each row and removing them individually.
In PostgreSQL, you can follow two general approaches for managing partitions and data retention, which as we'll see later, have two different concurrency considerations and problematics.
Approach #1: Dropping partitions
In the first approach, we simply drop partitions from a partitioned table when we want to delete data.
CREATE TABLE partition_test (
event_time timestamp,
sensor_id bigint,
reported_value float
) partition by range (event_time);
-- Create partition
CREATE TABLE partition_test_2022 PARTITION OF partition_test
FOR VALUES event_time FROM ('2022-01-01 00:00:00') TO ('2023-01-01 00:00:00');
```
--Drop partition
ALTER TABLE partition_test DROP PARTITION partition_test_2022;
Approach #2: Concurrent workflow
PostgreSQL also offers (in PostgreSQL 14 and newer) a concurrent workflow for these operations.
CREATE TABLE partition_test_2022 (like partition_test);
ALTER TABLE partition_test ATTACH PARTITION partition_test_2022 FOR VALUES event_time FROM ('2022-01-01 00:00:00') TO ('2023-01-01 00:00:00') CONCURRENTLY;
To remove a partition concurrently, we can:
ALTER TABLE partition_test DETACH PARTITION partition_test_2022 CONCURRENTLY;
DROP TABLE partition_test_2022;
The Problem With Locks and PostgreSQL Partitioning
From a database administration perspective, neither of these approaches is very safe.
Both the partition creation and dropping requires an access exclusive lock on the partition_test
, meaning that once the query is issued, no other queries can run against that table until the query is concluded and the transaction committed or rolled back. The locking in each case looks like this:
In terms of the concurrent approach, it still has to address the issue of clearing the relation cache. It does so in two stages: first, a share update exclusive lock is taken partition_test
, and then information is written to the catalogs indicating that the table will be removed from the partition list. The backend then waits until all running queries have concluded (and all transactions guaranteeing repeatable reads have concluded) before removing the table from the partition map.
This approach does not rely on locks to signal that the process is complete, only to prevent multiple concurrent updates for the status of the same set of partitions. As a result, even unrelated queries can block the detach operation. If the partition management script’s connection is interrupted for any reason, cleanup processes must be performed by the database administrator.
Once the partition is removed from the partition list, it is locked in access exclusive mode and dropped. The locking approach of this process looks like this:
In conclusion,
- The first approach (involving the manual creation and dropping of partitions) relatively quick operations but forces hard synchronization points on partitioned tables, which in time-series workloads are usually partitioned due to being heavily used. Problems here can cause database outages fairly quickly.
- The concurrent workflow doesn’t always solve these problems. In mixed-workflow applications, waiting for all running queries to complete (which can include long-running automatic maintenance tasks) can lead to long delays, dropped connections, and general difficulties in actually managing data retention. Particularly under load, these operations may not perform well enough to be useful.
Common Advice on How to Fix This Problem (and Why It's Not the Best)
The overall problems of partition management with time-series data fall into two categories:
1) Failure to create partitions before they are needed can block inserts.
2) Dropping partitions when needed for regulatory or cost reasons not only can fail but can also block reading and writing to the relevant tables.
If you ask for advice, you'll probably hear one of these two things:
Use custom scripts
Many companies begin their partition-management journey with custom scripts. This has the advantage of simplicity, but the disadvantage is that the operations can require heavy locks, and there is often a lack of initial knowledge on how to address these.
Custom scripts are the most flexible approach to lock problems of partition management because of the entire toolkit (lock escalation, time-out and retry, and more). This allows knowledgeable teams to build solutions that work around the existing database workloads with the best success chance.
On the other hand, this problem is full of general landmines, and teams often do not begin with the knowledge to navigate these hazards successfully.
A second major problem with custom scripts is that database workloads can change over time, and this is often out of the hands of the responsible team. For example, a data science team might run workloads that interfere with production in ways the software engineering teams had not considered.
Use pg_partman
pg_partman
provides a general toolkit for partition management which can mitigate the problem on some workloads. pg_partman
takes a time-out-and-retry approach to partition creation and removal, meaning that—depending on the configuration and how things are run—the functions will run in an environment where a lock time-out is set. This prevents a failed lock from leading to an outage, but there is no guarantee that it will be obtained before the partitions are required.
In most cases, you can tune these features to provide reasonable assurances that problems will usually be avoided. Workloads exist that prevent the partition management functions from successfully running in such an environment.
pg_partman
is a good tool and an important contribution to this topic, but at scale and under load, it will only work in cases where you have a real opportunity to get the locks required within the lock time-out. I have personally worked in environments where important services would have to be briefly disabled to allow this to happen.
How TimescaleDB Solves the Problem of Locking in PostgreSQL Partitioning
Instead of using PostgreSQL native partitioning, you can install the TimescaleDB extension and use hypertables, which are PostgreSQL tables that are automatically partitioned. This solves the problems caused by locking, since hypertables minimize locks by design.
TimescaleDB automatically partitions hypertables into chunks, organized by various partitioning criteria, usually time. This implementation is independent of PostgreSQL’s partitioning strategies and has been optimized as an independent add-on to PostgreSQL rather than a part of PostgreSQL core. TimescaleDB does not use inheritance as a table partitioning structure either, nor does TimescaleDB rely on the relation cache mentioned above for determining which chunks to scan.
Within a TimescaleDB hypertable, chunks are added transparently as needed and removed asynchronously without intrusive locks on the parent table. TimescaleDB then uses various strategies to hook into the planner and execute TimescaleDB-specific approaches to partition selection and elimination. These strategies require locking the chunk table with intrusive locks but not locking the parent.
This approach is likely to lead to some potential problems in serializable transaction isolation levels because once the underlying partition is gone, it is gone. In the event that a serializable transaction starts and then chunks are dropped, this will result in serialization errors or isolation violations.
Lock minimization
PostgreSQL has traditionally taken the view that concurrency is not extremely important for database operations while Data Definition Language (DDL) commands are run. Traditionally, this is true. Even today, DDL commands are usually run sufficiently infrequently that the database cannot take the performance hit of introducing DDL commands as synchronization points.
The emerging problems of heavy PostgreSQL users today are not usually performance problems but the fact that applications are often not written with an awareness of what these added synchronization points will mean. In my experience, these synchronization points themselves are a significant cause of database outages among large-scale PostgreSQL users.
Timescale has been built to avoid the sort of locking problems that currently exist with PostgreSQL’s declarative partitioning simply because this is a common problem in time-series workloads.
TimescaleDB maintains its own chunk catalogs and only locks the partitions that will be removed. The catalog entry is removed, then the chunk table is locked and dropped. Only an access share lock is taken on the top-level table. This means that reads and even writes can be done to other chunks without interfering with dropping or adding chunks.
TimescaleDB’s current approach has one limitation when used under serializable transactions. Currently, if you use serializable transactions, there are certain circumstances where a transaction could go to read dropped chunks and no longer see them, resulting in a violation of the serialization guarantees. This is only a problem under very specific circumstances, but in this case, TimescaleDB behaves differently than PostgreSQL’s concurrent DDL approaches.
In general, though, you should only drop chunks when you are reasonably sure they are not going to be accessed if you use serializable transaction isolation.
Why is PostgreSQL not doing this?
TimescaleDB’s solution cannot be perfectly replicated with stock PostgreSQL at the moment because dropping partitions requires active invalidation of cached data structures, which other concurrent queries might be using.
Offering some sort of lazy invalidation infrastructure (via message queues, etc.) would go a long way to making some of this less painful, as would allowing more fine-grained invalidations to caching.
Conclusion
TimescaleDB’s approach to the problem of locking in the best solution today, better than the options available in stock PostgreSQL. But it's not yet perfect; it operates in between the two options given in terms of concurrency capabilities. We cannot drop a chunk that a serializable transaction has read until that transaction concludes regardless.
Getting there is likely to require some changes to how the table and view characteristics are cached by PostgreSQL and how this cache invalidation works. I think that such improvements would help us on our way toward more transactional DDL, however.
Many ALTER TABLE
commands are limited in concurrency largely because of these caching considerations. I think the general success of our approach here is also evidence of a need to address these limitations generally.
In the meantime, if you're planning to partition your tables, check out Timescale. If you're running your PostgreSQL database in your own hardware, you can simply add the TimescaleDB extension. If you're running in AWS, try the Timescale platform for free.