I have been trying to track down an issue that we’re seeing in our database, and after some time debugging, I believe it has to do with Timescale. However, it is not clear to me yet whether it’s a misunderstanding/misuse of Timescale which is contributing to the problem as I am relatively new to it.
I will start by describing the issue. In our database, we have a table which is used to store time-series data for a number of “feeds” as well as some other related things. This table is a Timescale hypertable. Here’s an overview of its schema:
- id: PK UUID
- feed_id: FK
- foo_id: FK
- bar_id: FK
- timestamp: used for hypertable
- value: to be aggregated
With this table, we perform queries like:
-- time-series data
select time_bucket_gapfill('1 day', timestamp) as bucket, interpolate(sum(value))
from our_hypertable
where feed_id = '...' and foo_id = '...'
group by bucket;
and
-- non-time-series summary
select sum(value)
from our_hypertable
where feed_id = '...' and foo_id = '...'
group by bucket;
On certain pages there are a lot of queries like these that are fired off for various “feeds” and over time and growth these pages have slowed down. One hot-spot optimization we have considered making is periodically calculating the non-time-series summary and caching it for various page views. So we set up a job to run periodically over all “foos” (a small table) to calc and cache their summaries.
This is where the problem began. We noticed that when running these jobs, the queries would execute very quickly (a few milliseconds) but the db CPU would spike up 20-30% after a few jobs and never recover. By the middle of the run (100 or so jobs) the CPU would peg out at 100% and not recover until the work was cancelled or done. The queries run in these jobs were the same as are run on page views, so we did a little more testing and noticed that even a few rapid refreshes of the page would get CPU utilization up to 30-50% (note: this is a relatively low-traffic site, so it seems we only happened to not have encountered the issue by chance during general page usage)
In summary, it is strange to us that the db is able to quickly execute our queries while having a sort of residual CPU affect from them. Forgive me for speaking in somewhat vague generalities here. I can provide more details to specific questions, but my hope was to get the conversation started and go from there. I am also somewhat hopeful that this is a “you don’t know what you don’t know” situation w.r.t. timescaledb usage, e.g. “oh, tsdb does this specific background work shortly after being queries, you should use a continuous aggregate” or some such.
Thanks for taking the time to read! Looking forward to hearing your thoughts
Tech details:
- self-hosted
- tsdb v2.8.1
- pgdb v14