Hello everybody,
This is my first time using Timescale so I don’t have a lot of experience with its usage patterns.
Here’s the context: I’m building an analytics application that uses Timescale to generate time series of aggregated data computed from the data of projects in another application.
The basic flow is:
- a periodic job calls the API of the main application to get the latest projects (since the most recent one fetched)
- the analytics application has a database table where it saves the raw data of every project retrieved from the main application (this is in order to keep the full history available in case any projects are deleted on the main application, and to cache the data locally, preventing the analytics app from having to query for all the projects at once every time)
- when the job is done fetching new projects, if a new project was fetched, it uses a function that takes all the projects, computes some aggregate statistics on those data (you can think of it as returning a JSON object with some fields in it, whose values depend on all the projects), and saves them with the current timestamp to a table named
aggregated_statistics
. This table has a TimeScaleDB hypertable, and it is the one used by the analytics application to expose via API the time series for the aggregated statistics.
I have the following requirement, which is subject to becoming more complex in the future: the project may belong to a group, and the consumer of the analytics API needs to be able to access the aggregated statistics filtering by group.
In order to implement this, here’s what I thought:
- the
aggregated_statistics
gets added agroup_id
, which is nullable - each time new tasks projects fetched, the aggregated statistics are computed using all the projects in the database; plus, for each group of the new projects, the statistics are re-computed only for the projects in those groups, adding an entry to
aggregated_statistics
for eachgroup_id
This effectively yields a “global” time series, plus one time series per group.
Unfortunately, this doesn’t work for more complex requirements, such as AND’ing or OR’ing the groups (e.g. filtering by two groups).
I can’t think of anything else other than generating the time series on the fly with each request, which would be pretty taxing performance-wise. I guess the problem lies in the fact that, rather than having to filter a time series (which would be pretty easy), I need to present in a time-series manner data that is computed as a function of filtered data.
Is there a better way?