Hey,
We are running a EC2 Ubuntu 24.04 x86_64 instance with timescaledb-ha:pg15.8-2.16.1
and we have experienced issues, when suddenly some of the hypertable chunks go into some weird state and timescaledb
starts throwing errors similar to uncommitted xmin 1259817527 from before xid cutoff 1405335377 needs to be frozen
…
2024-09-19 01:50:00.189 UTC [16] LOG: checkpoint complete: wrote 64695 buffers (6.2%); 0 WAL file(s) added, 0 removed, 32 recycled; write=232.084 s, sync=0.088 s, total=232.262 s; sync files=66, longest=0.013 s, average=0.002 s; distance=529034 kB, estimate=535583 kB
2024-09-19 01:50:01.844 UTC [16] LOG: checkpoint starting: wal
2024-09-19 01:50:12.081 UTC [224410] ERROR: uncommitted xmin 1259817527 from before xid cutoff 1405335135 needs to be frozen
2024-09-19 01:50:12.081 UTC [224410] CONTEXT: while scanning block 172 offset 13 of relation "_timescaledb_internal._hyper_18_3048_chunk"
automatic vacuum of table "common._timescaledb_internal._hyper_18_3048_chunk"
2024-09-19 01:50:14.419 UTC [224412] ERROR: uncommitted xmin 1259817527 from before xid cutoff 1405335377 needs to be frozen
2024-09-19 01:50:14.419 UTC [224412] CONTEXT: while scanning block 172 offset 13 of relation "_timescaledb_internal._hyper_18_3048_chunk"
automatic vacuum of table "common._timescaledb_internal._hyper_18_3048_chunk"
2024-09-19 01:50:15.889 UTC [16] LOG: checkpoint complete: wrote 53169 buffers (5.1%); 0 WAL file(s) added, 0 removed, 33 recycled; write=13.681 s, sync=0.276 s, total=14.045 s; sync files=74, longest=0.067 s, average=0.004 s; distance=536104 kB, estimate=536104 kB
2024-09-19 01:50:16.894 UTC [16] LOG: checkpoints are occurring too frequently (15 seconds apart)
2024-09-19 01:50:16.894 UTC [16] HINT: Consider increasing the configuration parameter "max_wal_size".
At the moment, we are solving this by inspecting the “problematic” tables and running update table set column = column
to update all the rows and vacuum the table manually. After this, the log messages go away till the next time. This happened now three times in the span of three weeks, however we are not able to reproduce this intentionally.
Instance has about 1.3TB of data in hypertables without any compression enabled.
What can we do about it?
We have changed the following configuration (most of it to the recommended ones from tune app):
timescaledb.telemetry_level=off
log_min_duration_statement=5000
max_connections=150
shared_buffers=8GB
effective_cache_size=22GB
maintenance_work_mem=1GB
max_worker_processes=46
max_parallel_workers_per_gather=4
max_parallel_workers=8
default_statistics_target=100
max_locks_per_transaction=128
work_mem=32MB
Thank you in advance, cheers, J.