Time_bucket with excact number of samples

Jake · February 4, 2023, 9:56pm

Hi. I’m trying to figure out, whether it is possible (and makes sense) to use time_bucket to calculate the average for for instance 1 hour of sensor readings with the caveat that the average should only be calculated if there is exactly 60 samples in the bucket.

The sensors do their reading once per minute and in my case it doesn’t make sense to calculate the average if a sensor reading is missing.

I imagine it is doable using a combination of time_bucket, Postgres’ CASE and then COUNT = 60. But I am unsure as to whether this is an efficient and meaningful approach. Any inputs would be much appreciated.

noctarius · February 5, 2023, 8:17am

I guess if your use case requires exactly 60 samples per bucket, I’d cladd a count aggregation and have a filter clause on it, removing all result rows which are not made off of 60 samples.

davidkohn · February 5, 2023, 5:23pm

I think there are a couple answers here, if you just want to do this, you can do it with a count and a HAVING clause like so:

SELECT time_bucket('1 hour', time) as bucket, avg(foo), count(foo) 
FROM bar
WHERE id = 'x'
GROUP BY bucket
HAVING count(foo) = 60;

But that leaves the question of why you want to do this. Is it because you might be missing data? If it’s only a few points it might not matter for the average, but you can also use a time weighted average , which will account for measurements being valid for more time if you miss messages and if you use ‘Linear’ weighting you can linearly interpolate between the measurements you have.

So that’d look something like this:

SELECT 
    time_bucket('1 hour', time) as bucket, 
    time_weight('Linear', time, foo) -> average() as time_wt_avg
FROM bar
WHERE id = 'x'
GROUP BY bucket;

I’d worry more about missing points if you are trying to do say, the sum of the points and so if you don’t have all 60 with a normal sum the value would be way off and you couldn’t really compare between different buckets well. For that, I’d use the integral accessor function of time_weight and that will give you something that’s comparable even if you’re missing a few data points:


SELECT 
    time_bucket('1 hour', time) as bucket, 
    time_weight('Linear', time, foo) -> average() as time_wt_avg, 
    time_weight('Linear', time, foo) -> toolkit_experimental.integral('hour') as hourly_sum
FROM bar
WHERE id = 'x'
GROUP BY bucket;

So anyway, there are a couple options for you! Let me know if that helps!

PS You can still use the HAVING clause to say that you should avoid samples where you’re missing really big sections of data. Say if count(foo) < 30 or something.
If you want to get really fancy, you could use our new heartbeat_agg for that as well to see if there are any sections that don’t have data for a period longer than say 5 or 10 minutes. Timescale Documentation | heartbeat_agg()

Jake · February 7, 2023, 1:17pm

Hi. Thank you very much for some really useful info. That definitely gives me something to move forward with.

Might do all the the calculations at the sensors which will guarantee that I have the necessary data points in the database, but it will also give me a bit less flexibility when it comes to presenting data for the end user.

Anyway: Thank you.

davidkohn · February 7, 2023, 4:20pm

Hmmm…I’m not sure why you think you need to do that, it’s often better to write raw data and figure the rest out later. Much more possibility of data loss on the sensor side than at the database. But maybe I’m missing something…

Jake · February 15, 2023, 12:11pm

Hello again. I’m doing logarithmic averaging (decibel calculations). So for instance if I set the device to give me the logarithmically weighed sound pressure in decibels for one minute I can use that value to calculate the average sound pressure of for instance 1 hour when I have 60 consecutive minute-samples - it doesn’t make sense to do the 1 hour calculation based on for instance only 35 minutes.

It’s roughly explained here.

So my take is that it makes the most sense to do the calculations client side and send a for instance 1 hour average when 60 consecutive minute samples have been collected.

That however means that I have to predefine the sample size (i.e. 15 minutes, 30 minutes, 60 minutes etc) client side instead of doing the calculations dynamically server side where the obvious benefit would be what I could do ad hoc time weighed averages based on the minute inputs.

But as far as I can tell doing the whole thing server side would potentially mean a lot of CPU usage as I wouldn’t know beforehand if there is enough consecutive minutes to do for instance a 1 hour-bucket average for a given period.

Sorry for the rant. Hope it makes sense. And if I’ve got something wrong please let me know.

davidkohn · February 15, 2023, 3:11pm

I think if you just do the calculation on the server side like in the link you gave, so the inverse log and then do the time weighted average, and then take the log again, you should be fine. Another option would be to just store the data “raw” ie without the decibel logarithmic scale and then do all of your calculations and apply the decibel transformation after the average.

Additionally, I’d say that the time_weight() -> integral() function might be incredibly helpful as it should be able to help you calculate things like time weighted cumulative exposure to sound levels for things like OSHA compliance.

If you want to provide a schema, maybe a little sample data and the specs for the calculation I’m happy to write an example query to help clarify.