I’m looking to ingest TiBs of “raw” data into a timescale hypertable and then run some analysis on it. As is often the case though, that “raw” data is sometimes incorrect or needs minor adjustments.
There is no “single” way to “fix” the raw data though.
Sometimes when it’s simply missing, flatlining or interpolation are valid options (which I think timescale supports at query-time). Some other times though, the fix is extremely ad-hoc: replace specific columns/rows with specific values. It’s important to me that I keep visibility over what kind of manual fixes have been applied to the raw data, but also, ideally that I keep a full copy of the raw data (to apply different fixes, in potentially different places).
My initial thinking is that I could just duplicate the raw data into a secondary hypertable and make any fixes I need there. But as the raw dataset grows and the ways of “fixing” it multiply, I am concerned that my storage bill is just going to explode. Also, the initial cost of duplicating the raw dataset sounds like an issue if I want to quickly test a new fix.
Given this, I am wondering if timescale supports (or could support) some kind of “partial” copies, such that a new hypertable would reference data from another hypertable (VIEW comes to mind) except for a few chunks.
I’m borrowing the idea from BTRFS’ reflink feature. Hence the name of topic.
You can refer to the existent data using a pair of columns like time and your identifier to make it unique.
If I was going to develop a feature to take care of this, I’d try the following approaches:
create a raw data hypertable with time, identifier and content. The time and identifier should become a unique element to allow you to refer to it again.
create a transformations table to work on the overrides and replace/clean whatever is necessary to fix.
create a view to joining raw data + transformations.
The transformations can also be tracked in a jsonb column directly in the raw hypertable. You can create a function to apply the transformations and keep everything in the same hypertable.
If you don’t need the raw data or elect some transformations as official. You can clean them after you update the original content.
If you want to go deeper with a more advanced mechanism like BTRFS reflink, I will encourage you to open an issue on GitHub to discuss it directly with the database core developers.
Thanks @jonatasdp, this is already very helpful. I think I’d like to understand what a tightly integrated solution would take/cost. I’ll move the conversation to GitHub as soon as I get an opportunity.
Trying to log the only concern I have with the approach you suggested: a view for joining raw and transformed data sounds good for the case of reading data, but I’d like to abstract writes as well.
Thinking about it, maybe an INSTEAD OF trigger might be enough to put updates in the “transformed” table and update the view.
Very good idea! Let us know how did you manage to solve the issue! That would give a fantastic journey for a blog post or a conference talk! Let me know if you want to collaborate on some learning material when you find the solution!