Strange CPU consumption after querying hypertable

iamvery · July 19, 2023, 12:51pm

I have been trying to track down an issue that we’re seeing in our database, and after some time debugging, I believe it has to do with Timescale. However, it is not clear to me yet whether it’s a misunderstanding/misuse of Timescale which is contributing to the problem as I am relatively new to it.

I will start by describing the issue. In our database, we have a table which is used to store time-series data for a number of “feeds” as well as some other related things. This table is a Timescale hypertable. Here’s an overview of its schema:

id: PK UUID
feed_id: FK
foo_id: FK
bar_id: FK
timestamp: used for hypertable
value: to be aggregated

With this table, we perform queries like:

-- time-series data
select time_bucket_gapfill('1 day', timestamp) as bucket, interpolate(sum(value))
from our_hypertable
where feed_id = '...' and foo_id = '...'
group by bucket;

and

-- non-time-series summary
select sum(value)
from our_hypertable
where feed_id = '...' and foo_id = '...'
group by bucket;

On certain pages there are a lot of queries like these that are fired off for various “feeds” and over time and growth these pages have slowed down. One hot-spot optimization we have considered making is periodically calculating the non-time-series summary and caching it for various page views. So we set up a job to run periodically over all “foos” (a small table) to calc and cache their summaries.

This is where the problem began. We noticed that when running these jobs, the queries would execute very quickly (a few milliseconds) but the db CPU would spike up 20-30% after a few jobs and never recover. By the middle of the run (100 or so jobs) the CPU would peg out at 100% and not recover until the work was cancelled or done. The queries run in these jobs were the same as are run on page views, so we did a little more testing and noticed that even a few rapid refreshes of the page would get CPU utilization up to 30-50% (note: this is a relatively low-traffic site, so it seems we only happened to not have encountered the issue by chance during general page usage)

In summary, it is strange to us that the db is able to quickly execute our queries while having a sort of residual CPU affect from them. Forgive me for speaking in somewhat vague generalities here. I can provide more details to specific questions, but my hope was to get the conversation started and go from there. I am also somewhat hopeful that this is a “you don’t know what you don’t know” situation w.r.t. timescaledb usage, e.g. “oh, tsdb does this specific background work shortly after being queries, you should use a continuous aggregate” or some such.

Thanks for taking the time to read! Looking forward to hearing your thoughts

Tech details:

self-hosted
tsdb v2.8.1
pgdb v14

noctarius · July 21, 2023, 6:19am

Have you considered updating your timescaledb version? I remember there was a bug with some effect on the cpu, but I don’t remember the version anymore. Anyhow, I’d try to update the database first and see if it’s a gone issue.

iamvery · July 21, 2023, 11:36am

Thank you for your reply! Yes, in fact that’s exactly what I’m working on today. I will report back here once we get the extension updated.

After iterating on this more yesterday, I’ve begun to come to the conclusion that our particular case is just CPU-intensive and we need to 1) update dependencies and 2) consider scaling our db instances. I’m growing less enthusiastic that “this is a timescaledb problem”, but I figured it worth raising here in case there’s obvious performance expectations that would be known by the community and not myself.

Thank you!

iamvery · July 26, 2023, 12:45pm

We got through the update, and ended up scaling our db a little bit in the process. One important learning was timescale-specific process required for restoring a database. See Timescale Documentation | Migrations using pg_dump and pg_restore

Other than that, we’re running OK. Still seeing CPU and memory patterns that we’re not used to for non-tsdbs, however we’re concluding at this point that it may just be normal operational semantics for tsdb. Calling this “good” for now. Ty!