’m having a challenge that makes me question if Timescale is a right fit for this:
The DB is roughly 2 TB, a few hundred million rows. It is recording financial trade data.
We are recording trade activity on roughly 140 symbols.
Data arrives in a live stream, and it gets written in a table.
We regularly receive authoritative data that should overwrite the live stream data. It always arrives by chunk of 24h, and it always arrives in order, day by day. It never arrives earlier than 24h after a day has passed, but sometimes it arrives later.
So we may receive day 1, 2, 3 then nothing for 3 days, then days 4, 5 then nothing for a day, then day 6, etc.
When these blocks of 24h of data arrive, we erase the live data for that day and insert the authoritative data.
It can never arrive earlier than 24h after a day has passed, and never later than 120h.
The crucial part here is that we receive this authoritative data PER symbol, so symbol A may be at day 5 and symbol B may be up-to-date only until day 3 at this point in time, etc.
So, this part works… But when we tried to add compression and things started to fail.
The chunks have a duration of 1 day and this is the center of the problem because all symbols share the same chunk for a day.
Data for all symbols arrive at different time, so compression can only apply up to the timestamp where all symbols have been received.
I implemented a remove/add of the compression policy to update the time up to which data can be compressed, but delete operations (on live data not covered by the compression yet) fail.
I tried to implement backfilling, but the script on GitHub has errors.
Likewise, I tried partitioning by symbol, with 140 partitions (one per symbol). My expectation was to have 1 chunk / day / symbol and each could run their own compression, but it doesn’t look like it is working that way.
The delete operations, which are always ahead of the last time that should be compressed, do fail.
I’m running out of ideas…
So, in short, I need to replace live data with authoritative data on a regular basis, by slices of 24h at once. And these slices arrive per symbol (we have 140) and share the same chunks (size of 1 day).
- I tried to keep compression limited to chunks where all the symbols have been updated through policy updates, and it failed as the delete operation to remove the live data for the day to be written fails.
- I tried to make 1 chunk / symbol / day with partitioning, so they each symbol can have its own compression going without caring about the other chunks, but the delete operation fails too.