Time-Travel Appropriate Use Case

agad · October 10, 2022, 7:50pm

I am new to timescaledb and we are evaluating a potential use case. We would like to create time-travel versions of some of our tables. The idea would be that each row would have two time columns - start and end. When a row is first added, its start is the timestamp of creation but the end is NULL. When a row is updated, its end value becomes the modification timestamp and a new row is added that has a start value 1ms ahead of the update time and again its end time is NULL. 95% of queries would need to query when end is NULL since they just need to retrieve the ‘current’ version. The other 5% of queries would look for rows where the time frame being queried is between start and end or between start and NULL.

My worry is that this would not be performant at all since in all liklihood the 95% of queries that were not time traveling would result in a scan across all partitions. I assume that adding a “where end is NULL” wouldnt help much.

Alternatively, we could have a standard table, and then an ‘audit table’ that contained all the time changes. No time travel queries would hit the standard table and then any time travel queries would hit the audit table.

Are my assumptions above correct? Would option 1 run into serious performance issues on large (10 million plus rows) or is timescale able to partition in such a way that the 95% of queries that were looking for end=null would perform close to as well as a standard PG query where timescale was not installed?

noctarius · October 11, 2022, 10:01am

I think the biggest issue with that approach may be, that you actually have to update the hypertable. How long would you expect the record to the “the latest” and do you expect to use compression?

An index will help quite a bit, but it really depends on the query you’re looking for. For the index, I’d go with a partial index, only on end=null and as column using the start column. That way, if you add both columns to the where clause, you’ll have a “perfect” index for 95% of the queries.

agad · October 13, 2022, 12:44pm

Oh I had not even considered partial indexes. That would likely solve the problem from the query/indexing perspective.

How long something would be the latest would ‘depend’ on each tenant’s use case and that would be variable but its not something I would expect to occur more than a couple of times a day at most - its not an action that would be expected to occur in rapid succession throughout the day. We also do not anticipate modifications to 95% of tables to occur more than 1-2x per minute at peak.

noctarius · October 13, 2022, 5:06pm

Compression and updating data aren’t the best friends. While a lot of use cases work, compression really likes to have unmodifiable data, since compressed chunks have to be rewritten on data changes.

James · October 14, 2022, 3:37am

Hmm. Another potential approach is to just not record the end_time.

Assuming you had a table like this:

CREATE TABLE time_travel (ts timestamptz, id text, value integer);
CREATE INDEX ON time_travel(id, ts DESC);
SELECT create_hypertable('time_travel', 'ts);

You can use a window function to calculate the end time like so:

SELECT
  value, 
  ts as time_start,
  LEAD(ts,1) OVER ( ORDER BY ts ) time_end
FROM time_travel
WHERE 
  id='whatever' AND 
  ts > now() - interval '2 days' AND
  ts < now() - interval '1 day'
ORDER BY ts;

This will only scan the partitions which match the WHERE. You will probably end up with one record at the end which has a start but not an end value as the next record was out of the time window, and you won’t catch records which were “in flight” at the start of the window. This may or may not be acceptable for you. You could fix this with a lateral joins, but it will become increasingly unwieldy.

You could get the latest value like this:

SELECT last(value, ts)
FROM time_travel
WHERE id = 'whatever';

This will plan all partitions (which might be slow if you have many, many chunks), but will only actually scan as many indexes as it needs to find a single record. Most of the time this will be just the latest chunk’s index. (unless ‘whatever’ has stopped ingesting data, if it’s often not the latest one maybe this isn’t the solution for you).

You can verify this with an EXPLAIN ANALYZE which should show not executed for most chunks.

agad · October 14, 2022, 11:16am

Here is question, is there a value or drawback to applying timescale to any table even if it is not clear that there will be appropriate timeseries data in the table? I think I read somewhere that timescale had other benefits for all PG tables even if they did not have timeseries data. However, now I can’t find that and wonder if I imagined it ;).

Also, will timescale be more efficient at these queries than if I simply imposed the appropriate indexing on a range type?

I think we are going to go with an approach where clients can activate time travel on specific tables, and when they do, it will create a mirror table that will be populated via a trigger. This table would have the start and end times and would be used to implement all time travel requests. Keeps the primary table clear

James · October 17, 2022, 12:19am

The approach you describe is the traditional Postgres time-travel method. If you’re happy with double-handling writes (slowing down ingest) then this will work. You would make the audit table a hypertable, but not the current table. This makes sense if you want to allow customers to “activate” time-travel - I do wonder if a better customer experience would be to manage this by using chunk retention (ie: non-timetravel tables have a retention of 1 day on the audit table, then when you activate time-travel you extend that AND you magically see the last chunk of data for free.)

Another approach is to insert into the data only into the audit table and then use a continuous aggregate to update the current table. This would get the same result, but remove the overhead from the write path. This overhead would be split between the cagg background process and query-time (only for the most recent chunk). This would be more performant than the third option below if many items didn’t have an update in the last chunk.

The third approach which I described above is to only have one table, the audit table. You get the data which is being processed currently from that table. You have to scan the number of chunk indexes which track back to the last time the value was updated. This would presumably be quick most of the time (current chunk), but could be slower if a value wasn’t updated for a year.

Also, will timescale be more efficient at these queries than if I simply imposed the appropriate indexing on a range type?

Timescale is primarily about timeseries data. By definition this has a timestamp (or integer which represents a timestamp, like an epoch value) as part of the primary key. We do bring benefits to non-timeseries data - but you don’t actually need to use hypertables to get these, see Improving DISTINCT Query Performance Up to 8,000x on PostgreSQL.

James · October 18, 2022, 12:27am

I just realised the CAGG option above won’t work at the moment - although we have some work which is in early stages which will help it work.

So you’re left with option 1 and option 3.