This is a db running in kubernetes using PostgreSQL 13.6 and timescale 2.6, we have a compression policy with a 1 month threshold. In the time range affected, we expected an uncompressed chunk size of <4GB and after compression it was usual to get chunk size 400MB. At the time the db had about 10GB RAM, so less than ideal for the chunk size, and was quite prone to OOM restarts. The problems observed were slow performance when querying certain time ranges, continuous aggregate policies also failed in the same ranges.
Looking at the timeseries chunks in the affected ranges, I see that some of the chunk sizes are incredibly large eg 50GB+ (there are about 10 of these)
I think that compression policy probably failed in such a way that chunks were left in this kind of bloated state. I expect it was caused by our general server instability. Since we’ve improved in that regard, and provided more resources to the server, I don’t think it will happen again, I just need to clean up and move on.
I will try to illustrate the problem, first by looking at details for the chunks on 3 consecutive days in July - all should have a similar number of rows, and all should have been compressed.
SELECT chunkInfo.chunk_name, chunkInfo.range_start, range_end, pg_size_pretty(sizeInfo.total_bytes) as Total_Size FROM chunks_detailed_size('vector_events') as sizeInfo
INNER JOIN timescaledb_information.chunks as chunkInfo ON sizeInfo.chunk_name = chunkInfo.chunk_name
WHERE range_start >= '20220710' AND range_start <= '20220712' ORDER by chunkInfo.range_start;
chunk_name | range_start | range_end | total_size
---------------------+------------------------+------------------------+------------
_hyper_1_2010_chunk | 2022-07-10 00:00:00+00 | 2022-07-11 00:00:00+00 | 374 MB
_hyper_1_2012_chunk | 2022-07-11 00:00:00+00 | 2022-07-12 00:00:00+00 | 66 GB
_hyper_1_2014_chunk | 2022-07-12 00:00:00+00 | 2022-07-13 00:00:00+00 | 3191 MB
Note the 66GB ‘megachunk’ but also the differences in size between the 3 chunks. Look also at the row counts:
SELECT count(*) from vector_events where event_time >= '20220710' and event_time < '20220711'; -- 61936536 (took 5 sec to count)
SELECT count(*) from vector_events where event_time >= '20220711' and event_time < '20220712'; -- 60385707 (took 4 minutes to count)
SELECT count(*) from vector_events where event_time >= '20220712' and event_time < '20220713'; -- 58562431 (took 1 sec)
All 3 have a similar number of rows, although it was far slower to execute a count query on the megachunk.
So something is clearly wrong here, now look at more detail on the chunks_detailed_size
SELECT * FROM chunks_detailed_size('vector_events')
WHERE chunk_name in ('_hyper_1_2010_chunk','_hyper_1_2012_chunk','_hyper_1_2014_chunk')
ORDER BY chunk_name;
chunk_schema | chunk_name | table_bytes | index_bytes | toast_bytes | total_bytes | node_name
-----------------------+---------------------+-------------+-------------+-------------+-------------+-----------
_timescaledb_internal | _hyper_1_2010_chunk | 11911168 | 1597440 | 378617856 | 392126464 |
_timescaledb_internal | _hyper_1_2012_chunk | 68586569728 | 1904730112 | 0 | 70491299840 |
_timescaledb_internal | _hyper_1_2014_chunk | 3068485632 | 1540096 | 275857408 | 3345883136 |
(3 rows)
So the toast bytes for our megachunk is 0, both the index and table bytes are huge for this chunk. There’s also a big difference in the table bytes for the other 2 chunks, although their indexes are similar. Remember that _hyper_1_2014_chunk actually has fewer rows than _hyper_1_2010_chunk, yet _hyper_1_2014_chunk 's table size is 250x that of _hyper_1_2010_chunk. There must be something wrong with compression here…
Examine the chunks using chunk_compression_stats()
SELECT chunk_name, compression_status, before_compression_table_bytes, before_compression_index_bytes, before_compression_toast_bytes, before_compression_total_bytes,
after_compression_table_bytes, after_compression_index_bytes, after_compression_toast_bytes, after_compression_total_bytes
FROM chunk_compression_stats('vector_events')
WHERE chunk_name in ('_hyper_1_2010_chunk','_hyper_1_2012_chunk','_hyper_1_2014_chunk')
ORDER BY chunk_name;
chunk_name | compression_status | before_compression_table_bytes | before_compression_index_bytes | before_compression_toast_bytes | before_compression_total_bytes | after_compression_table_bytes | after_compression_index_bytes | after_compression_toast_bytes | after_compression_total_bytes
---------------------+--------------------+--------------------------------+--------------------------------+--------------------------------+--------------------------------+-------------------------------+-------------------------------+-------------------------------+-------------------------------
_hyper_1_2010_chunk | Compressed | 6465093632 | 1953816576 | 0 | 8418910208 | 11902976 | 1589248 | 378601472 | 392093696
_hyper_1_2012_chunk | Uncompressed | | | | | | | |
_hyper_1_2014_chunk | Compressed | 70183092224 | 1847230464 | 0 | 72030322688 | 12009472 | 1531904 | 275841024 | 289382400
(3 rows)
According to this, the megachunk _hyper_1_2012_chunk is uncompressed, but the other 2 chunks are meant to be compressed. Apparently the after compression table bytes for _hyper_1_2014_chunk is 12009472, that does not agree with the table_bytes in chunks_detailed_size.
So I suspect that there was some kind of crash during compression on these chunks, and that has left them in this weird state, there are 2 separate issues here:
- Megachunks like _hyper_1_2012_chunk are not compressed, and are 20x larger than their expected uncompressed size.
- Chunks like _hyper_1_2014_chunk which timescale thinks are compressed, but their size stats suggest they are actually not compressed
I am a bit curious how this can have happened, especially how a chunk can get to be 20x larger than its uncompressed size. But I mostly would like to know how I can best fix it.
I have already tried taking a backup of the production db and restoring it in a test environment. We used pgBackrest to take the backup, interestingly this backup/restore technique also carried over the weird chunk sizes to our test env. So I can’t just fix it using a simple backup / restore.
I’m considering a couple of approaches to clean this up:
“nuke it all and start over”
Backup the entire db’s raw event data to a csv or similar, and restore that into a new db. I think this would work, but it’s a 1.5TB database, so the size and time considerations might be a problem.
“identify and fix the broken bits”
Write a script to do the following:
Fix megachunks: Identify the megachunks, and their time ranges. copy their event data to a holding table, drop the megachunks, and then copy from the holding table back into the hypertable. Call compress_chunk()
Fix chunks that timescale thinks are compressed: Identify chunks with compression status ‘compressed’ and total_bytes > 1000000000. Call decompress_chunk, then call compress_chunk();
So I was just wondering if there are any, better ways of fixing this, and really how can a 3GB table become 60GB!
Thanks in advance, and wishing you a happy new year!