Hi everyone,
We have a query that is using time_bucket_gapfill, but it’s slower than what we intend to give to user experience and we are discussing internally the best options to improve it.
For context, we have a device that only logs the temperature change of a room. If the temperature is constant then we don’t record on the database. How can we display the temperature in a graph overtime without using time_bucket_gapfill?
One of the ideas we’re discussing is to leverage the CAGG on a materialised view and move that same query into there. Some people are concerned on whether that will compromise the whole CAGG concept, as the time_bucket_gapfill requires a time constant.
- Is it true because of the time constant we will compromise the whole concept of CAGG? And that query will force the recalculation of every single record on each refresh?
- What are the negative consequences of that?
Thank you!
If you materialize with time bucket gapfill you’ll lose the storage benefits of just storing the deltas right?
The negative consequences are probably related to more storage being used. It will also need more IO to scan all disk repeated values. Doing it in real-time will fabricate the missing points to fill the gap and use less IO because data is not in the disk but will use more processing to generate the points.
Have you run any benchmark scenarios to test where are things really slowing down?
any chance to share more of your scenario, queries and query plans?
The query with time_bucket_gapfill
takes ± 4s:
SELECT time_bucket_gapfill(‘1 hour’, p.created) AS period_time, p.device_id, locf(MIN(p.close), ‘0’) AS temperature
FROM device_p AS p
WHERE p.device_id=18549
AND p.created BETWEEN ‘2020-01-01T00:00:00’::timestamp AND NOW()
GROUP BY period_time, p.device_id
ORDER BY period_time ASC
We are also taking sometime on presenting the information to the user as we are executing another query before the one above to find when the device got its first record and we can use it in the time_bucket_gapfill
query.
Contrary to the other query, using time_bucket
is much quicker and the query takes ±1s:
Blockquote
SELECT time_bucket(‘1 hour’, p.created) AS period_time, p.device_id, MIN(p.close) AS quantity
FROM device_p AS p
WHERE p.device_id=18549
GROUP BY period_time, p.device_id
ORDER BY period_time ASC
We even discussed the possibility of forcing the devices to record the temperature every minute, or every hour, but that would create an issue at scale as we have millions of this device sold.
Thanks for the details!
Have you thought about having a layer/temporary data around for whom are processing and requesting the data? Like if someone visits the last 10 days you store it and discard it after a few days if nobody visits it.
Mind sharing more about it please?
The minimal example you could build is an engine that automatically creates a continuous aggregate based on the user search and then you just use refresh_continuous_aggregates only for the specific timeframe that your user is doing.
You can have a caggs_track to check which searches are repeated and which of them are useful and what are abandoned and just drop the abandoned views.
But backing up your issue, another team member said that it’s surprisingly long for 26k points and a huge difference compared to the time bucket one. Which might be longer but it feels like something else is going on here.
Can you share what is in the output of the “explain analyze” of the query in question or maybe filling an issue on the toolkit with a POC example to facilitate us reproducing the issue?
I couldn’t paste here both query plans, so I’m sharing it via link.
time_bucket: | QUERY PLAN time_bucket: - Pastebin.com
time_bucket_gapfill: | QUERY PLAN time_bucket_gapfill - Pastebin.com
The time taken for the execution didn’t surprise me much as the machine is also a small one. I assumed the difference was because of the gapfill manipulations. But if we can find any underlying issue that we can improve this same query, that would be ideal!
Hi @blackout12 -
So looking at your query plans I have a few comments / questions:
- I looked at the time_bucket vs time_bucket_gapfill plans here, and I’m a bit surprised to find that it’s upsampling the data from 105 points in the time_bucket one to more than 450,000 points in the gapfill plan - are you sure you need this many points? I think that’s a huge source of the extra time? Can you cut this down a bit? Also, how many years is this over? It seems like you might be going a bit too far back in time, as if you’re only going back to 2020, it should be only 26000 points, 1 per hour. I guess I’d love to know what you’re doing with this data and maybe we can suggest a better way of solving it than this as going from 100 → 450k points seems a bit weird.
- The time_bucket query is only taking 10s of ms, and the gapfill query is only taking on the order of ~500ms in the explain plans, which tells me that at least some of the latency you’re seeing is due to network or some sort of effect on the client side.
- Even with the gapfill, most of the time is in the sort after gapfilling. For some reason it looks like we’re sorting after the gapfill, which I don’t think we need to do here, and I think it might be because the sort key doesn’t match the group by, can you add device_id to the sort key and see if that changes the plan and makes the gapfill faster? Also try removing the sort key? I think you may still end up with the data in the order you expect anyway.
Hi @davidkohn thanks for the insights here!
-
As I said these devices only send a signal when the temperature gets changed, that’s the reason we go from 100 → 450k points. Can we cut down this sample? That’s what we are looking into by tweaking the query, without compromising the business requirements we have to meet.
Let me try to give you as much context as I can share. These devices are inside of very large cold rooms and we do need to record any change in the temperature. We are building a dashboard where the user can analyse in detail the temperature variance. To make things even more complex, we have to display those records for a set of rooms at the same time, so we struggle to be specific on the start date.
This query is supposed to go to a line graph that shows the temperature of the room correlated with the variation of elements entering/leaving the room. As the starting date is required to be written in the query (per documentation) we are thinking about getting the date value of the first record ever written temperature change by the devices associated to the room.
As we have gaps in the data, we don’t know when was the last valuable recorded data point of each device, so we are kinda forced to give a large date to ensure we encapsulate all the data points for each device.
-
We are definitely seeing some latency here as this database is remote and it’s a small staging database. The truth is that both queries are run against the same database from the same machine
-
We have tried a couple of indexes there but none were relevant and not in place, what we see is likely to be the order by. Do you have any recommendation on what’s the most effective index for this query?