Hello,
I am trying to use time_bucket_gapfill()
to fill gaps in missing sensor data with a continuous aggregate, or on a continuous aggregate. Let me build some context:
My raw data could look as follows, note the gaps, from
09:00:40
to 09:01:00
and
09:02:00
to 09:03:00
:
sensor_id | timestamp | value |
---|---|---|
a | 2023-02-15 09:00:00.022 | 0 |
a | 2023-02-15 09:00:10.019 | 10 |
a | 2023-02-15 09:00:20.019 | 20 |
a | 2023-02-15 09:00:30.028 | 30 |
a | 2023-02-15 09:00:40.017 | 40 |
a | 2023-02-15 09:01:00.022 | 100 |
a | 2023-02-15 09:01:10.019 | 110 |
a | 2023-02-15 09:01:20.019 | 120 |
a | 2023-02-15 09:01:30.028 | 130 |
a | 2023-02-15 09:01:40.017 | 140 |
a | 2023-02-15 09:01:50.019 | 150 |
a | 2023-02-15 09:03:00.018 | 300 |
a | 2023-02-15 09:03:10.014 | 310 |
a | 2023-02-15 09:03:20.016 | 320 |
a | 2023-02-15 09:03:30.022 | 330 |
a | 2023-02-15 09:03:40.017 | 340 |
a | 2023-02-15 09:03:50.019 | 350 |
a | 2023-02-15 09:04:00.018 | 400 |
a | 2023-02-15 09:04:10.014 | 410 |
a | 2023-02-15 09:04:20.016 | 420 |
a | 2023-02-15 09:04:30.022 | 430 |
a | 2023-02-15 09:04:40.017 | 440 |
a | 2023-02-15 09:04:50.019 | 450 |
It’s basically an ever-increasing index value that is reported every 10 seconds. The index never becomes a smaller value. For simplicity’s sake in this example, it’s just increasing by 10 for every datapoint and the minute value prepended.
On this data I have a few continuous aggregates running to aggregate data over different time resolutions, i.e. in a minute by minute or day by day fashion.
Continuing with the minute-by-minute as an example, I want to find the min() and max() values for each minute, i.e. with the following continuous aggregate:
CREATE MATERIALIZED VIEW data_aggregation_minute_min_max
WITH (timescaledb.continuous, timescaledb.materialized_only= false)
AS
SELECT sensor_id,
time_bucket(INTERVAL '1 minute', timestamp) AS timestamp,
min(timestamp) AS first_data_received,
max(timestamp) AS last_data_received,
min(value),
max(value)
FROM raw_data
GROUP BY sensor_id, time_bucket(INTERVAL '1 minute', timestamp)
WITH DATA;
SELECT add_continuous_aggregate_policy('data_aggregation_minute_min_max',
start_offset => INTERVAL '6 days',
end_offset => INTERVAL '2 minutes',
schedule_interval => INTERVAL '1 minute');
This will give me a table that looks like this with the example raw data:
sensor_id | timestamp | first_data_received | last_data_received | min_value | max_value |
---|---|---|---|---|---|
a | 2023-02-15 09:00:00.000000 | 2023-02-15 09:00:00.000000 | 2023-02-15 09:00:40.000000 | 0 | 40 |
a | 2023-02-15 09:01:00.000000 | 2023-02-15 09:01:00.000000 | 2023-02-15 09:01:50.000000 | 100 | 150 |
a | 2023-02-15 09:03:00.000000 | 2023-02-15 09:03:00.000000 | 2023-02-15 09:03:50.000000 | 300 | 350 |
a | 2023-02-15 09:04:00.000000 | 2023-02-15 09:04:00.000000 | 2023-02-15 09:04:50.000000 | 400 | 450 |
Now I want to calculate the amount the index increased between each time_bucket.
I need the max() values to calculate the difference between two rows of this cagg in order to arrive at the amount the index increased over time, i.e. with the following query:
SELECT sensor_id,
timestamp,
CASE
WHEN lag(max_value, 1) OVER (PARTITION BY sensor_id ORDER BY timestamp) IS NULL OR
lag(last_data_received, 1) OVER (PARTITION BY sensor_id ORDER BY timestamp) < first_data_received - INTERVAL '19 seconds'
THEN max_value - min_value
ELSE max_value - lag(imported_energy_total_max, 1) OVER (PARTITION BY sensor_id ORDER BY timestamp)
END AS value_diff
FROM data_aggregation_minute_min_max
WHERE sensor_id = 'a';
This would give me the following result:
sensor_id | timestamp | value_diff |
---|---|---|
a | 2023-02-15 09:00:00.000000 | 40 |
a | 2023-02-15 09:01:00.000000 | 50 |
a | 2023-02-15 09:03:00.000000 | 50 |
a | 2023-02-15 09:04:00.000000 | 100 |
If I only use the max() values to arrive at these results, it will shift an increase of the index into a wrong time-period in case of a gap in the data:
In the given raw data, the sensor was not reporting data for 9:02:00
to 9:02:50
.
So if I used the lag() function to only use the max_value
column, it would put the increase of the index that happened in the non-reported time-window into the 9:03:00
time_bucket, resulting in a value_diff
of 200
for the 09:03:00
bucket, which is wrong and result in a ‘peak’ in the next data-point.
Hence why I am using the reported min()
value in case the gap between the first_data_received
and last_data_received
of the previous entry is too big.
I still want to make the diff between two max()
values as often as I can or else the increase of the index during the jump between two buckets is lost. The value_diff
for the 09:04:00
bucket would report an increase of 50
instead of 100
, where 100
is the value that I want.
Now, I have been experimenting around with using time_bucket_gapfill()
and interpolate()
to arrive at a more “complete” diff
table/query, but I haven’t managed to arrive at the result that I want, which would be a table that looks as follows with the raw data and cagg given above:
First idea:
sensor_id | timestamp | first_data_received | last_data_received | min_value | max_value | value_diff |
---|---|---|---|---|---|---|
a | 2023-02-15 09:00:00.000000 | 2023-02-15 09:00:00.000000 | 2023-02-15 09:00:40.000000 | 0 | 40 | 40 |
a | 2023-02-15 09:01:00.000000 | 2023-02-15 09:01:00.000000 | 2023-02-15 09:01:50.000000 | 100 | 150 | 50 |
a | 2023-02-15 09:02:00.000000 | null | null | null | null | null |
a | 2023-02-15 09:03:00.000000 | 2023-02-15 09:03:00.000000 | 2023-02-15 09:03:50.000000 | 300 | 350 | 50 |
a | 2023-02-15 09:04:00.000000 | 2023-02-15 09:04:00.000000 | 2023-02-15 09:04:50.000000 | 400 | 450 | 100 |
This would allow me to easily detect gaps in the raw data / cagg and further treat them in java with interpolation for instance. It’s not perfect but would allow me to circumvent some problems and at least have interpolation where data is missing.
An ideal table would look as follows but I don’t think it’s possible to achieve in the database:
sensor_id | timestamp | first_data_received | last_data_received | min_value | max_value | value_diff | interpolation |
---|---|---|---|---|---|---|---|
a | 2023-02-15 09:00:00.000000 | 2023-02-15 09:00:00.000000 | 2023-02-15 09:00:40.000000 | 0 | 40 | 40 | 60 |
a | 2023-02-15 09:01:00.000000 | 2023-02-15 09:01:00.000000 | 2023-02-15 09:01:50.000000 | 100 | 150 | 50 | null |
a | 2023-02-15 09:02:00.000000 | null | null | null | null | null | 150 |
a | 2023-02-15 09:03:00.000000 | 2023-02-15 09:03:00.000000 | 2023-02-15 09:03:50.000000 | 300 | 350 | 50 | 50 |
a | 2023-02-15 09:04:00.000000 | 2023-02-15 09:04:00.000000 | 2023-02-15 09:04:50.000000 | 400 | 450 | 100 | null |
And then have these as a cagg too, so I can query the tables with minimum effort from java and differentiate between actual values and interpolated values.
The trouble that I am having is that I do not know how to integrate time_bucket_gapfill()
into a cagg, nor do I yet know how to use interpolate()
in order to get the values that I gave in my example.
I have managed to use time_bucket_gapfill()
in a query like this one for instance:
SELECT sensor_id,
time_bucket_gapfill('1 minute', timestamp) as bucket_timestamp,
max(value)
FROM raw_data
WHERE sensor_id = 'a'
AND timestamp > '2023-02-15 08:55:00.000000'::timestamp
AND timestamp < '2023-02-15 09:05:00.000000'::timestamp
GROUP BY sensor_id, bucket_timestamp
ORDER BY bucket_timestamp DESC;
But it’s not giving me the results that I am looking for since it’s also producing redundant data between the required timestamps in the WHERE
clause. As far as I can tell this is the only way to use time_bucket_gapfill()
currently, but maybe I am wrong.
I hope with the example given above, the problem I am trying to solve is clear and someone can help me out or even tell me that what I am trying to solve is not possible with the tools that I am looking at.