Nov 28, 2024
Posted by
Kirk Laurence Roybal
Timescale employs PostgreSQL contributors who are working feverishly to mitigate a problem with PostgreSQL. That problem is commonly referred to as “transaction ID wraparound” and stems from design decisions in the PostgreSQL project that have been around for decades.
Because this design decision was made so early in the project history, it affects all branches and forks of PostgreSQL, with Amazon RDS PostgreSQL, Greenplum, Netezza, Amazon Aurora, and many others suffering from transaction ID wraparound failures.
In this article, the second in a series of posts tackling PostgreSQL errors or issues, we’ll explain what transaction ID wraparound is, why it fails, and how you can mitigate or resolve it. But let’s start with a bit of PostgreSQL history.
To fully understand the problem of transaction ID wraparound (or XID wraparound), a bit of history is in order. The idea of a transaction counter in PostgreSQL originated as a very simple answer to transaction tracking. We need to know the order in which transactions are committed to a PostgreSQL database, so let's enumerate them. What is the simplest way to give transactions a concept of order? That would be a counter. What is the simplest counter? An integer. Tada!
So, form follows function, and we have an integer counter. Seems like an obvious, elegant, and simple solution to the problem, doesn't it?
At first glance (and second and third, honestly), this rather simple solution stood up very well. Who would ever need more than 231 (just over 2 billion) transactions in flight? That was an astronomical number for 1985.
Since this is such a huge number, we should only need a single counter for the entire database cluster. That will keep the design simple, prevent the need to coordinate multiple transaction counters and allow for efficient (just four bytes!) storage. We simply add this small counter to each row, and we know exactly what the high watermark is for every row version in the entire cluster.
This simple method is row-centric and yet cluster-wide. So, our backups are easy (we know exactly where the pointer is for the entire cluster), and the data snapshot at the beginning and end of our transaction is stable. We can easily tell within the transaction if the data has changed underneath us from another transaction.
We can even play peek-a-boo with other transaction data in flight. That lets us ensure that transactions settle more reasonably, even if we are wiggling the loose electrical connectors of our transaction context a bit.
We can stretch that counter quite a bit by making it a ring buffer. That is, we'll OR
the next value to the end rather than add it there. That way, 231 or 1 = 1. So, our counter can wrap around the top (231) and only becomes problematic when it reaches the oldest open transaction at the bottom.
This "oldest" transaction is an upwardly moving number also, which then wraps around the top. So, we have the head (current transaction) chasing the tail (oldest transaction) around the integer, with 2,147,483,648 spaces from the bottom to the top. This makes our solution even look like a puppy dog, so now it's cute as well as elegant.
The idea is that this would make the counter almost infinite, as the head should never catch the tail. At that point, who could possibly need more transactions than that? Brilliant!
Transaction counters are obviously the way to go here. They just make everything work so elegantly.
For many years, PostgreSQL raged forward with the XID wraparound transaction design. Quite a few features were added along the way that were based on this simple counter. Backups (pg_basebackup
and its cousins), replication (both physical and logical), indexes, visibility maps, autovacuum, and defragmentation utilities all sprouted up to enhance and support this central concept.
All of these things worked together brilliantly for quite some time. We didn't start seeing the stress marks in the fuselage until the hardware caught up with us. As much as PostgreSQL wants to turn a blind eye to the reality of the hardware universe, the time came upon us when systems had the capacity to create more than 231 transactions at a time.
High-speed ETL, queuing systems, IoT, and other machine-generated data could actually keep the system busy long enough that the counter could be pushed to its inherent limit.
These processes weren't exactly showstoppers, though. We came up with band-aids for much of it.
COPY
got its own transaction context, reducing the load significantly. So did VACUUM
. VACUUM
sprouted the ability to just freeze transactions without having to do a full row cleanup. That made the tail move forward a bit more quickly. External utilities gained features, individual tables gained VACUUM
settings so they could be targeted separately.
Okay, that helped. But did it help enough? These features were never designed to fundamentally fix the issue. The issue is that size matters. But to be a bit more descriptive...
In the early aughts, I was involved in building a data center for a private company. We spent some $5M creating a full-featured data center, complete with Halon, a 4K generator, Unisys ES7000, and a Clarion array. For the sake of our XID wraparound article, I'll focus on the Clarion array. It cost just a bit over $2M and could hold 96 drives for a whopping total of 1.6 TB! In 2002, that was incredible.
It doesn't seem so incredible now, does it? Kinda disappointing even. A few weeks ago, I needed an additional drive for my home backup unit. As I was walking through a Costco, I absent-mindedly threw a 2 TB drive into my cart that retailed for $69. It wasn't until I got home and was in the middle of installing it that it dawned on me how far we've come in the storage industry.
Some of the young whippersnappers don't even care about storage anymore. They think the "cloud" storage is effectively infinite. They're not wrong.
To bring this around to PostgreSQL, tables with 2M rows were a big deal in 2002. Now that's not even on the radar of "big data." A VLDB (very large database) at the time was 2 TB. Now it's approaching 1 PB.
"A lot" of transactions in 2002 was 2M. Now, I would place that number at somewhere around 2B. Oops. Did I just say 2B? Isn't that close to the same number I said a few paragraphs ago was the limit of our transaction space? Let me see, that was 231, which is 2,147,483,648.
Ouch.
To be fair, not everybody has this problem. 2,147,483,648 is still a really big number, so a fairly small number of systems will ever reach this limit, even in the transaction environment of 2023.
It also represents the number of transactions that are currently in flight, as the autovacuum process will latently brush away transaction counters that are no longer visible to the postmaster (pg_stat_activity
). But if the number of phone calls to consultants is any indication, this limitation is nonetheless becoming quite an issue. It certainly isn't going away any time soon.
Everybody in the PostgreSQL ecosystem is painfully aware of the limitation. This problem affects more than just the core of PostgreSQL, it affects all of the systems that have grown around it also. Do you know what it also affects? All the PostgreSQL-based databases, such as Amazon RDS and Aurora.
To make any changes to the core of PostgreSQL, all of the ramifications of those changes have to be thought out in advance. Fortunately, we have a whole community of people (some of them proudly part of our own organization) that are really, really good at thinking things out in advance.
Query to show your current transaction ages:
with overridden_tables as (
select
pc.oid as table_id,
pn.nspname as scheme_name,
pc.relname as table_name,
pc.reloptions as options
from pg_class pc
join pg_namespace pn on pn.oid = pc.relnamespace
where reloptions::text ~ 'autovacuum'
), per_database as (
select
coalesce(nullif(n.nspname || '.', 'public.'), '') || c.relname as relation,
greatest(age(c.relfrozenxid), age(t.relfrozenxid)) as age,
round(
(greatest(age(c.relfrozenxid), age(t.relfrozenxid))::numeric *
100 / (2 * 10^9 - current_setting('vacuum_freeze_min_age')::numeric)::numeric),
2
) as capacity_used,
c.relfrozenxid as rel_relfrozenxid,
t.relfrozenxid as toast_relfrozenxid,
(greatest(age(c.relfrozenxid), age(t.relfrozenxid)) > 1200000000)::int as warning,
case when ot.table_id is not null then true else false end as overridden_settings
from pg_class c
join pg_namespace n on c.relnamespace = n.oid
left join pg_class t ON c.reltoastrelid = t.oid
left join overridden_tables ot on ot.table_id = c.oid
where c.relkind IN ('r', 'm') and not (n.nspname = 'pg_catalog' and c.relname <> 'pg_class')
and n.nspname <> 'information_schema'
order by 3 desc)
SELECT *
FROM per_database;
Many enhancements have already been made to PostgreSQL to mitigate the transaction ID wraparound problem and solve it permanently. Here are the steps on the way to the solution.
pg_basebackup
) that previously assumed 32-bit integer transactions.VACUUM
, and other processes have been enhanced for 64-bit transactions.The goal of all of these changes is to eventually move to a 64-bit transaction counter for the entire system.
There's a bit of bad news. I'm going to close my eyes while I write this, so I won't have to look at your face while you read it.
Updating the user tables in your database to use 64-bit transaction counters will require rewriting all of your data. Remember at the beginning, where I said the transaction counter was a per-row solution? Oh, yeah.
That means that its limitations are also per row. There are only eight bytes reserved for xmin
and eight bytes for xmax
in the row header. So, every single row of data in the database is affected.
At some point, there will be a major version of PostgreSQL that requires a data dump, replication, pg_upgrade
or another such process to re-create every row in the database in the new format. It is true that every major version of PostgreSQL could change the format of data on disk.
The pg_upgrade
utility will not be able to use symlinks or hardlinks for the upgrade. These links usually allow for some efficiency while upgrading. There will be no such shortcuts when the "fix" for transaction ID wraparound is put into place.
Okay, now for the good news. We will all be in retirement (if not taking a dirt nap) when the next bunch of suckers engineers has to deal with this issue again. 263 is not double the number of transactions. It is 9,223,372,034,707,292,160 (nine quintillion) more.
You can still make use of some basic mitigation strategies for transaction ID wraparound failures:
TOAST
tables separately so the autovacuum has a better chance of catching up.REINDEX CONCURRENTLY
more frequently so that the autovacuum has less work to do.CLUSTER ON INDEX
will re-order the data in the table to the same order as an index, thus "vacuuming" the table along the way.VACUUM FULL
, which blocks updates while vacuuming but will finish without interruption. Let me say that again. There will be no writes while VACUUM FULL
is running, and you can't interrupt it. 😠INSERT
, UPDATE
, and DELETE
operations. The counter is issued per transaction (not row), so grouping operations helps reserve counters.All of these strategies are basically the same thing. The objective is to ensure the tail number (oldest transaction) moves forward as quickly as possible. This will prevent you from ending up in a "transaction ID wraparound" scenario. 🙂 ♥️ 👍
While Timescale—also built on the rock-solid foundation of PostgreSQL— does not solve transaction ID wraparound failure, it can help you prevent it since our ingestion inherently batches the data by design after you install timescaledb-parallel-copy
.
Of course, you can do this for yourself with transaction blocks, but our tools will do the right thing automatically.
We also provide a general-purpose job scheduler that can be useful for adding VACUUM
and CLUSTER
operations.
So, if you want to mitigate the chances of ever dealing with XID wraparound problems while enjoying superior query performance and storage savings compared to vanilla PostgreSQL or Amazon RDS for PostgreSQL, try Timescale. Sign up now (30-day free trial, no credit card required) for fast performance, seamless user experience, and the best compression ratios.