Using gap fill with a continuous aggregate

Streamline · February 16, 2023, 12:35pm

Hello,

I am trying to use time_bucket_gapfill() to fill gaps in missing sensor data with a continuous aggregate, or on a continuous aggregate. Let me build some context:

My raw data could look as follows, note the gaps, from
09:00:40 to 09:01:00 and
09:02:00 to 09:03:00 :

sensor_id	timestamp	value
a	2023-02-15 09:00:00.022	0
a	2023-02-15 09:00:10.019	10
a	2023-02-15 09:00:20.019	20
a	2023-02-15 09:00:30.028	30
a	2023-02-15 09:00:40.017	40
a	2023-02-15 09:01:00.022	100
a	2023-02-15 09:01:10.019	110
a	2023-02-15 09:01:20.019	120
a	2023-02-15 09:01:30.028	130
a	2023-02-15 09:01:40.017	140
a	2023-02-15 09:01:50.019	150
a	2023-02-15 09:03:00.018	300
a	2023-02-15 09:03:10.014	310
a	2023-02-15 09:03:20.016	320
a	2023-02-15 09:03:30.022	330
a	2023-02-15 09:03:40.017	340
a	2023-02-15 09:03:50.019	350
a	2023-02-15 09:04:00.018	400
a	2023-02-15 09:04:10.014	410
a	2023-02-15 09:04:20.016	420
a	2023-02-15 09:04:30.022	430
a	2023-02-15 09:04:40.017	440
a	2023-02-15 09:04:50.019	450

It’s basically an ever-increasing index value that is reported every 10 seconds. The index never becomes a smaller value. For simplicity’s sake in this example, it’s just increasing by 10 for every datapoint and the minute value prepended.

On this data I have a few continuous aggregates running to aggregate data over different time resolutions, i.e. in a minute by minute or day by day fashion.
Continuing with the minute-by-minute as an example, I want to find the min() and max() values for each minute, i.e. with the following continuous aggregate:

CREATE MATERIALIZED VIEW data_aggregation_minute_min_max
    WITH (timescaledb.continuous, timescaledb.materialized_only= false)
AS
SELECT sensor_id,
       time_bucket(INTERVAL '1 minute', timestamp) AS timestamp,
       min(timestamp)                              AS first_data_received,
       max(timestamp)                              AS last_data_received,
       min(value),
       max(value)
FROM raw_data
GROUP BY sensor_id, time_bucket(INTERVAL '1 minute', timestamp)
WITH DATA;

SELECT add_continuous_aggregate_policy('data_aggregation_minute_min_max',
                                       start_offset => INTERVAL '6 days',
                                       end_offset => INTERVAL '2 minutes',
                                       schedule_interval => INTERVAL '1 minute');

This will give me a table that looks like this with the example raw data:

sensor_id	timestamp	first_data_received	last_data_received	min_value	max_value
a	2023-02-15 09:00:00.000000	2023-02-15 09:00:00.000000	2023-02-15 09:00:40.000000	0	40
a	2023-02-15 09:01:00.000000	2023-02-15 09:01:00.000000	2023-02-15 09:01:50.000000	100	150
a	2023-02-15 09:03:00.000000	2023-02-15 09:03:00.000000	2023-02-15 09:03:50.000000	300	350
a	2023-02-15 09:04:00.000000	2023-02-15 09:04:00.000000	2023-02-15 09:04:50.000000	400	450

Now I want to calculate the amount the index increased between each time_bucket.

I need the max() values to calculate the difference between two rows of this cagg in order to arrive at the amount the index increased over time, i.e. with the following query:

SELECT sensor_id,
       timestamp,
       CASE
           WHEN lag(max_value, 1) OVER (PARTITION BY sensor_id ORDER BY timestamp) IS NULL OR
                lag(last_data_received, 1) OVER (PARTITION BY sensor_id ORDER BY timestamp) < first_data_received - INTERVAL '19 seconds' 
               THEN max_value - min_value
           ELSE max_value - lag(imported_energy_total_max, 1) OVER (PARTITION BY sensor_id ORDER BY timestamp)
           END                                                                            AS value_diff
FROM data_aggregation_minute_min_max
WHERE sensor_id = 'a';

This would give me the following result:

sensor_id	timestamp	value_diff
a	2023-02-15 09:00:00.000000	40
a	2023-02-15 09:01:00.000000	50
a	2023-02-15 09:03:00.000000	50
a	2023-02-15 09:04:00.000000	100

If I only use the max() values to arrive at these results, it will shift an increase of the index into a wrong time-period in case of a gap in the data:
In the given raw data, the sensor was not reporting data for 9:02:00 to 9:02:50.
So if I used the lag() function to only use the max_value column, it would put the increase of the index that happened in the non-reported time-window into the 9:03:00 time_bucket, resulting in a value_diff of 200 for the 09:03:00 bucket, which is wrong and result in a ‘peak’ in the next data-point.

Hence why I am using the reported min() value in case the gap between the first_data_received and last_data_received of the previous entry is too big.

I still want to make the diff between two max() values as often as I can or else the increase of the index during the jump between two buckets is lost. The value_diff for the 09:04:00 bucket would report an increase of 50 instead of 100, where 100 is the value that I want.

Now, I have been experimenting around with using time_bucket_gapfill() and interpolate() to arrive at a more “complete” diff table/query, but I haven’t managed to arrive at the result that I want, which would be a table that looks as follows with the raw data and cagg given above:

First idea:

sensor_id	timestamp	first_data_received	last_data_received	min_value	max_value	value_diff
a	2023-02-15 09:00:00.000000	2023-02-15 09:00:00.000000	2023-02-15 09:00:40.000000	0	40	40
a	2023-02-15 09:01:00.000000	2023-02-15 09:01:00.000000	2023-02-15 09:01:50.000000	100	150	50
a	2023-02-15 09:02:00.000000	null	null	null	null	null
a	2023-02-15 09:03:00.000000	2023-02-15 09:03:00.000000	2023-02-15 09:03:50.000000	300	350	50
a	2023-02-15 09:04:00.000000	2023-02-15 09:04:00.000000	2023-02-15 09:04:50.000000	400	450	100

This would allow me to easily detect gaps in the raw data / cagg and further treat them in java with interpolation for instance. It’s not perfect but would allow me to circumvent some problems and at least have interpolation where data is missing.

An ideal table would look as follows but I don’t think it’s possible to achieve in the database:

sensor_id	timestamp	first_data_received	last_data_received	min_value	max_value	value_diff	interpolation
a	2023-02-15 09:00:00.000000	2023-02-15 09:00:00.000000	2023-02-15 09:00:40.000000	0	40	40	60
a	2023-02-15 09:01:00.000000	2023-02-15 09:01:00.000000	2023-02-15 09:01:50.000000	100	150	50	null
a	2023-02-15 09:02:00.000000	null	null	null	null	null	150
a	2023-02-15 09:03:00.000000	2023-02-15 09:03:00.000000	2023-02-15 09:03:50.000000	300	350	50	50
a	2023-02-15 09:04:00.000000	2023-02-15 09:04:00.000000	2023-02-15 09:04:50.000000	400	450	100	null

And then have these as a cagg too, so I can query the tables with minimum effort from java and differentiate between actual values and interpolated values.

The trouble that I am having is that I do not know how to integrate time_bucket_gapfill() into a cagg, nor do I yet know how to use interpolate() in order to get the values that I gave in my example.

I have managed to use time_bucket_gapfill() in a query like this one for instance:

SELECT sensor_id,
       time_bucket_gapfill('1 minute', timestamp) as bucket_timestamp,
       max(value)
FROM raw_data
WHERE sensor_id = 'a'
  AND timestamp > '2023-02-15 08:55:00.000000'::timestamp
  AND timestamp < '2023-02-15 09:05:00.000000'::timestamp
GROUP BY sensor_id, bucket_timestamp
ORDER BY bucket_timestamp DESC;

But it’s not giving me the results that I am looking for since it’s also producing redundant data between the required timestamps in the WHERE clause. As far as I can tell this is the only way to use time_bucket_gapfill() currently, but maybe I am wrong.

I hope with the example given above, the problem I am trying to solve is clear and someone can help me out or even tell me that what I am trying to solve is not possible with the tools that I am looking at.