Here you can see dn1 holds a replica of _dist_hyper_1_2_chunk, if I try to copy that chunk from dn1 to another node i get this:
timescale=# CALL timescaledb_experimental.copy_chunk('_timescaledb_internal._dist_hyper_1_2_chunk', 'dn1', 'dn4');
ERROR: [dn1]: relation "_timescaledb_internal._dist_hyper_1_2_chunk" does not exist
DETAIL: Chunk copy operation id: ts_copy_432_2.
So if I am not mistaken, the access node thinks this chunk is on DN1, but in fact it is not.
Any way to correct the metadata on access node for this guy? I am not sure which transaction it was inside of _timescaledb_catalog.chunk_copy_operation table, so not sure if i can use timescaledb_experimental.cleanup_copy_chunk_operation() here.
Hi @jamessalzman , thanks for posting!
Could you post more details on the automation such as the commands you were using prior to this issue? move_chunk may be more suitable depending on what you try to achieve.
I was trying to think how this could happen, first of all could you please show the content of the _timescaledb_catalog.chunk_copy_operation table. This table is used internally to keep the state of each copy/move chunk operations, we are interested here in non-completed copy operations.
In case if copy operation failure, we expect that the cleanup_copy_chunk_operation() operation would be executed always before running any other copy commands with the same chunk. For ease of use, you can specify your own copy operation id in the copy/move chunk command for later use with the cleanup function.
I was just using the copy_chunk command as an example to show that there is bad meta data on access node. Access node thinks DN1 has a copy of the chunk, while it does not so it fails out.
I am using both move_chunk and copy_chunk in my automation.
I did continue to do my thing maybe not noticing the error occured. Can you provide an example of supplying the operation id to the function? I did not see that listed in the documentation. This would be helpful so I can automatically call cleanup_<copy/move>_chunk_operation() on any failures.
Looks like there are some operations were not completed are in the list, the problem with those operation that they can consume vital resources (such as replication slots), so they needed to be cleaned up.
I would assume that reason of failure could be that the system is run out of replication slots/background workers, so checking up PostgreSQL log would be a good idea here, not all the errors are returned back to the user unfortunately.
You are right about the documentation about copy/move chunk operation id, we haven’t updated it yet since this functionality was introduced recently.
You can use it by providing additional argument to the function: