Hypertable data corruption and autovacuuming

jernej · September 19, 2024, 3:37pm

Hey,

We are running a EC2 Ubuntu 24.04 x86_64 instance with timescaledb-ha:pg15.8-2.16.1 and we have experienced issues, when suddenly some of the hypertable chunks go into some weird state and timescaledb starts throwing errors similar to uncommitted xmin 1259817527 from before xid cutoff 1405335377 needs to be frozen…

2024-09-19 01:50:00.189 UTC [16] LOG:  checkpoint complete: wrote 64695 buffers (6.2%); 0 WAL file(s) added, 0 removed, 32 recycled; write=232.084 s, sync=0.088 s, total=232.262 s; sync files=66, longest=0.013 s, average=0.002 s; distance=529034 kB, estimate=535583 kB
2024-09-19 01:50:01.844 UTC [16] LOG:  checkpoint starting: wal
2024-09-19 01:50:12.081 UTC [224410] ERROR:  uncommitted xmin 1259817527 from before xid cutoff 1405335135 needs to be frozen
2024-09-19 01:50:12.081 UTC [224410] CONTEXT:  while scanning block 172 offset 13 of relation "_timescaledb_internal._hyper_18_3048_chunk"
        automatic vacuum of table "common._timescaledb_internal._hyper_18_3048_chunk"
2024-09-19 01:50:14.419 UTC [224412] ERROR:  uncommitted xmin 1259817527 from before xid cutoff 1405335377 needs to be frozen
2024-09-19 01:50:14.419 UTC [224412] CONTEXT:  while scanning block 172 offset 13 of relation "_timescaledb_internal._hyper_18_3048_chunk"
        automatic vacuum of table "common._timescaledb_internal._hyper_18_3048_chunk"
2024-09-19 01:50:15.889 UTC [16] LOG:  checkpoint complete: wrote 53169 buffers (5.1%); 0 WAL file(s) added, 0 removed, 33 recycled; write=13.681 s, sync=0.276 s, total=14.045 s; sync files=74, longest=0.067 s, average=0.004 s; distance=536104 kB, estimate=536104 kB
2024-09-19 01:50:16.894 UTC [16] LOG:  checkpoints are occurring too frequently (15 seconds apart)
2024-09-19 01:50:16.894 UTC [16] HINT:  Consider increasing the configuration parameter "max_wal_size".

At the moment, we are solving this by inspecting the “problematic” tables and running update table set column = column to update all the rows and vacuum the table manually. After this, the log messages go away till the next time. This happened now three times in the span of three weeks, however we are not able to reproduce this intentionally.

Instance has about 1.3TB of data in hypertables without any compression enabled.

What can we do about it?

We have changed the following configuration (most of it to the recommended ones from tune app):

timescaledb.telemetry_level=off 	
log_min_duration_statement=5000 
max_connections=150 
shared_buffers=8GB 
effective_cache_size=22GB 
maintenance_work_mem=1GB 
max_worker_processes=46 
max_parallel_workers_per_gather=4 
max_parallel_workers=8 
default_statistics_target=100 
max_locks_per_transaction=128 
work_mem=32MB

Thank you in advance, cheers, J.

jonatasdp · September 20, 2024, 3:02pm

Hello @jernej , just a reminder, there’s no update in the database

Always, an update will generate a new row and it will be consuming and invalidating old rows. I’d suggest you create a new table with the new structure, backfill, erase other with some pagination level and then union all data until you have the migration complete.

Remember the problem of updates is not generate new rows, but vacuuming and updating statistics.

Another alternative, is try to paginate your updates to rework on them chunk by chunk limiting it. I still think it will be much more efficient if you just insert on new hypertable with defaults and avoid all rework in the entire DB.

jernej · September 24, 2024, 8:41am

I am a bit puzzled here. We run update set column = column on chunk tables (e.g. _hyper_18_3048_chunk) for the whole table after the errors (uncommited xmin...) pop up in the logs. Running UPDATE fixed the problem and i am unsure how would pagination help with the original problem (which seems to be somekind of data corruption).

Do you have any idea what would cause these kind of errors at first place?

wal
2024-09-19 01:50:12.081 UTC [224410] ERROR:  uncommitted xmin 1259817527 from before xid cutoff 1405335135 needs to be frozen
2024-09-19 01:50:12.081 UTC [224410] CONTEXT:  while scanning block 172 offset 13 of relation "_timescaledb_internal._hyper_18_3048_chunk"
        automatic vacuum of table "common._timescaledb_internal._hyper_18_3048_chunk

Thank you in advance, br, Jernej

jonatasdp · September 24, 2024, 9:46pm

Thanks for the details! Let’s see if @sven can give some hints about it.