Creating time series from filtered raw data

samuele · August 21, 2024, 7:41am

Hello everybody,

This is my first time using Timescale so I don’t have a lot of experience with its usage patterns.

Here’s the context: I’m building an analytics application that uses Timescale to generate time series of aggregated data computed from the data of projects in another application.

The basic flow is:

a periodic job calls the API of the main application to get the latest projects (since the most recent one fetched)
the analytics application has a database table where it saves the raw data of every project retrieved from the main application (this is in order to keep the full history available in case any projects are deleted on the main application, and to cache the data locally, preventing the analytics app from having to query for all the projects at once every time)
when the job is done fetching new projects, if a new project was fetched, it uses a function that takes all the projects, computes some aggregate statistics on those data (you can think of it as returning a JSON object with some fields in it, whose values depend on all the projects), and saves them with the current timestamp to a table named aggregated_statistics. This table has a TimeScaleDB hypertable, and it is the one used by the analytics application to expose via API the time series for the aggregated statistics.

I have the following requirement, which is subject to becoming more complex in the future: the project may belong to a group, and the consumer of the analytics API needs to be able to access the aggregated statistics filtering by group.

In order to implement this, here’s what I thought:

the aggregated_statistics gets added a group_id, which is nullable
each time new tasks projects fetched, the aggregated statistics are computed using all the projects in the database; plus, for each group of the new projects, the statistics are re-computed only for the projects in those groups, adding an entry to aggregated_statistics for each group_id

This effectively yields a “global” time series, plus one time series per group.

Unfortunately, this doesn’t work for more complex requirements, such as AND’ing or OR’ing the groups (e.g. filtering by two groups).

I can’t think of anything else other than generating the time series on the fly with each request, which would be pretty taxing performance-wise. I guess the problem lies in the fact that, rather than having to filter a time series (which would be pretty easy), I need to present in a time-series manner data that is computed as a function of filtered data.

Is there a better way?

jonatasdp · August 26, 2024, 5:04pm

Hello Samuele!

Thanks for providing such a detailed context!

You can try to think of your minimal bucket statistics as your “source of truth” and use hierarchical views on top of it to represent the other statistics per group or global.