Result weirdness with window query above

rmk · October 6, 2024, 5:09pm

First, apologies if this is not the correct place to post this.

I’m seeing weirdness with some window queries in TimescaleDB. I am inserting data into the table at one minute intervals, the time stamp is aligned to the minute. The data consists of a sensor id and its value, in this case the temperature of a tank of water in Celsius. I’m computing the power for a change in temperature over 5 minute periods.

For this, I’m using the query:

SELECT
  time_bucket('5m',time) AS time,
  id,
  (AVG(value) - LAG(AVG(value)) OVER w) * 55 * 4186 / (5 * 60)
 FROM sensor
 WHERE (id = '3c32e38150f2' OR id = '3ce104571ea2') AND
    time BETWEEN '2024-10-06T09:00:00Z' AND '2024-10-06T13:10:00Z'
 GROUP BY 1, 2
 WINDOW w AS (ROWS BETWEEN 1 PRECEDING AND CURRENT ROW)
 ORDER BY 1, 2;

This was converted from my Influx v1 query:

SELECT
 derivative(mean("value"), 1s) * 55 * 4186
FROM "rp_2d"."sensor"
WHERE "id"::tag =~ /3c32e38150f2|3ce104571ea2/ AND
 time >= 1728169200000ms AND time <= 1728255599999ms
GROUP BY time(5m), "id"::tag
FILL(null)
ORDER BY time ASC

which I’m plotting with Grafana - and while transitioning, I’ve been plotting the results from Influx and TimescaleDB on the same graph allowing me to confirm my queries are properly converted. This highlighted the following issue as the results were different depending on the time range.

If I request a time period of 12:00 to 13:10, then I get:

          time          |      id      |      ?column?
------------------------+--------------+---------------------
...
 2024-10-06 13:20:00+01 | 3ce104571ea2 | -19.032346666665934
 2024-10-06 13:25:00+01 | 3ce104571ea2 |   9.516173333332967
 2024-10-06 13:30:00+01 | 3ce104571ea2 |   9.516173333332967
 2024-10-06 13:35:00+01 | 3ce104571ea2 |                   0
 2024-10-06 13:40:00+01 | 3ce104571ea2 |                   0
 2024-10-06 13:45:00+01 | 3ce104571ea2 |   9.516173333332967
 2024-10-06 13:50:00+01 | 3ce104571ea2 |  -9.516173333332967
 2024-10-06 13:55:00+01 | 3ce104571ea2 |                   0
 2024-10-06 14:00:00+01 | 3ce104571ea2 |                   0
 2024-10-06 14:05:00+01 | 3ce104571ea2 |                   0
 2024-10-06 14:10:00+01 | 3ce104571ea2 |                   0

This corresponds to the same data loaded in Influx and queried using derivative(value,1s) * 55 * 4186, and is what I expect. Bringing the start time earlier to 11:00 gives the same values for these times. As does 10:00. So, everything seems fine.

However, if I query from 09:00, then the results from Postgresql/TimescaleDB go haywire:

          time          |      id      |        ?column?
------------------------+--------------+------------------------
...
 2024-10-06 13:20:00+01 | 3ce104571ea2 |      76.89682000000073
 2024-10-06 13:25:00+01 | 3ce104571ea2 |                      0
 2024-10-06 13:30:00+01 | 3ce104571ea2 |       47.5808666666662
 2024-10-06 13:35:00+01 | 3ce104571ea2 |      86.25950666666603
 2024-10-06 13:40:00+01 | 3ce104571ea2 |                      0
 2024-10-06 13:45:00+01 | 3ce104571ea2 |      47.73435333333251
 2024-10-06 13:50:00+01 | 3ce104571ea2 |                      0
 2024-10-06 13:55:00+01 | 3ce104571ea2 |
 2024-10-06 14:00:00+01 | 3ce104571ea2 |                      0
 2024-10-06 14:05:00+01 | 3ce104571ea2 |      95.92916666666666
 2024-10-06 14:10:00+01 | 3ce104571ea2 |     -9.516173333332967

It’s the same query, only the start time has been changed in the FROM clause. Notice that not only are the values for each time completely different, but also there is an apparent NULL value at 13:55.

Trying different start times, it seems the critical point is 09:50 vs 09:55. 09:55 gives the first results, 09:50 gives the second set of results - trying within the 5 minute period, 09:54 produces the same weirdness as 09:50.

Verifying the underlying data, no data is missing from the table - each minute in these ranges is populated. To prove that the number of data points is correct, there should be 251 datapoints, and:

SELECT COUNT(time), COUNT(id) FROM sensor
 WHERE (id = '3c32e38150f2' OR id = '3ce104571ea2') AND
   time BETWEEN '2024-10-06T09:00:00Z' AND '2024-10-06T13:10:00Z';
 count | count
-------+-------
   251 |   251
(1 row)

This is using Postgresql 15.8-0+deb12u1 with the pre-built TimescaleDB 2.16.1~debian12.

Can someone help me understand what is going on please? Thanks.

rmk · October 6, 2024, 5:48pm

Some further investigations show that the problem is centred around LAG().

SELECT
  time_bucket('5m',time) AS time, id,
  AVG(value), LAG(AVG(value)) OVER w
 FROM sensor WHERE (id = '3c32e38150f2' OR id = '3ce104571ea2') AND
       time BETWEEN '2024-10-06T09:00:00Z' AND '2024-10-06T13:10:00Z'
 GROUP BY 1, 2
 WINDOW w AS (ROWS BETWEEN 1 PRECEDING AND CURRENT ROW)
 ORDER BY 1, 2;

Gives me:

 2024-10-06 13:15:00+01 | 3ce104571ea2 |              12.25 | 12.099799999999998
 2024-10-06 13:20:00+01 | 3ce104571ea2 | 12.225200000000001 |             12.125
 2024-10-06 13:25:00+01 | 3ce104571ea2 |            12.2376 |            12.2376
 2024-10-06 13:30:00+01 | 3ce104571ea2 |              12.25 |             12.188
 2024-10-06 13:35:00+01 | 3ce104571ea2 |              12.25 |            12.1376
 2024-10-06 13:40:00+01 | 3ce104571ea2 |              12.25 |              12.25
 2024-10-06 13:45:00+01 | 3ce104571ea2 |            12.2624 |            12.2002
 2024-10-06 13:50:00+01 | 3ce104571ea2 |              12.25 |              12.25
 2024-10-06 13:55:00+01 | 3ce104571ea2 |              12.25 |
 2024-10-06 14:00:00+01 | 3ce104571ea2 |              12.25 |              12.25
 2024-10-06 14:05:00+01 | 3ce104571ea2 |              12.25 |             12.125
 2024-10-06 14:10:00+01 | 3ce104571ea2 |              12.25 |            12.2624

Clearly, LAG() is not giving the value of AVG(value) from the previous row as I understand it should.

If I give LAG() a default value (in other words, LAG(AVG(value), 1, 999) then the NULL value of course changes to 999.

I’ve also noticed that the threshold that this weirdness happens seems to change - the start time threshold has increased sporadically to (at the time of writing) 10:17… and after having written that, it’s now at 10:20. Now it’s back to 09:55 being okay, and then write that and re-execute the same query and it isn’t okay.

jonatasdp · October 7, 2024, 12:42pm

Hi Russel,

Thank you for bringing this issue to our attention. I’ve analyzed your query and created a minimal Proof of Concept (POC) to investigate the behavior you’re seeing. While I couldn’t fully reproduce the exact issue you’re experiencing, I can offer some insights and suggestions that might help resolve the problem.

Window Function Behavior: The window function in your query (LAG(AVG(value)) OVER w) is sensitive to the number of rows in the result set. When you change the time range, you’re changing the number of rows the window function operates on, which can lead to different results for the same time periods in different queries.
Minimal POC: I’ve created a minimal POC that demonstrates how changing the time range can affect results. You can find it attached to this message. It includes:

A sample table structure
Sample data insertion
A function that runs your query with different time ranges
Test queries to demonstrate the behavior

-- Create a simple table to simulate the sensor data
CREATE TABLE sensor_data (
    time TIMESTAMPTZ NOT NULL,
    id TEXT NOT NULL,
    value FLOAT NOT NULL
);

-- Insert sample data with some irregularities
INSERT INTO sensor_data (time, id, value) VALUES
    ('2024-10-06 09:00:00+00', 'sensor1', 20.0),
    ('2024-10-06 09:05:00+00', 'sensor1', 20.5),
    ('2024-10-06 09:10:00+00', 'sensor1', 21.0),
    ('2024-10-06 09:15:00+00', 'sensor1', 21.5),
    ('2024-10-06 09:20:00+00', 'sensor1', 22.0),
    ('2024-10-06 09:25:00+00', 'sensor1', 22.5),
    ('2024-10-06 09:30:00+00', 'sensor1', 23.0),
    ('2024-10-06 09:35:00+00', 'sensor1', 23.5),
    ('2024-10-06 09:40:00+00', 'sensor1', 24.0),
    ('2024-10-06 09:45:00+00', 'sensor1', 24.5),
    ('2024-10-06 09:50:00+00', 'sensor1', 25.0),
    ('2024-10-06 09:55:00+00', 'sensor1', 25.5),
    ('2024-10-06 10:00:00+00', 'sensor1', 26.0);

-- Query function
CREATE OR REPLACE FUNCTION get_power(start_time TIMESTAMPTZ, end_time TIMESTAMPTZ)
RETURNS TABLE (
    bucket TIMESTAMPTZ,
    id TEXT,
    power FLOAT
) AS $$
BEGIN
    RETURN QUERY
    SELECT
        time_bucket('5m', time) AS bucket,
        sensor_data.id,
        (AVG(value) - LAG(AVG(value)) OVER w) * 55 * 4186 / (5 * 60) AS power
    FROM sensor_data
    WHERE time BETWEEN start_time AND end_time
    GROUP BY 1, 2
    WINDOW w AS (ORDER BY time_bucket('5m', time))
    ORDER BY 1, 2;
END;
$$ LANGUAGE plpgsql;

-- Test queries
SELECT * FROM get_power('2024-10-06 09:00:00+00', '2024-10-06 10:00:00+00');
SELECT * FROM get_power('2024-10-06 09:50:00+00', '2024-10-06 10:00:00+00');

Potential Solutions: a. Use a fixed window size: Instead of ROWS BETWEEN 1 PRECEDING AND CURRENT ROW, you could use RANGE BETWEEN INTERVAL '5 minutes' PRECEDING AND CURRENT ROW. This ensures you’re always looking at the same time range, regardless of how many rows are in your result set. b. Use a subquery or CTE: Calculate the AVG(value) for each bucket first, then apply the LAG function. This can help ensure consistency across different query ranges. c. Consider using TimescaleDB’s time_bucket_gapfill(): This function can help ensure you have consistent buckets even when there are gaps in your data.
Additional Considerations:

Time zones: Ensure all your timestamps are in the same time zone to avoid unexpected behavior.
Data distribution: Be aware of any significant changes in data density or distribution that might affect your results.
Continuous aggregates: If you’re using these, make sure they’re properly configured and up-to-date.

Next Steps: If you’re still seeing unexpected results after trying these suggestions, it would be helpful if you could provide:

A small sample of your actual data, particularly from time periods where you’re seeing inconsistencies
The EXPLAIN ANALYZE output for your query with different time ranges
More details about the exact differences you’re seeing in the results

I hope this information helps you troubleshoot the issue. Feel free to run the POC and let me know if you have any questions or if you need further assistance. We’re here to help you resolve this and ensure you get consistent, accurate results from your queries.

rmk · October 7, 2024, 11:46pm

Thanks for the response.

I’ve tried modifying the query in Grafana to:

SELECT
  time_bucket('5m',time) AS time,
  id,
  (AVG(value) - LAG(AVG(value)) OVER w) * 55 * 4186 / (5 * 60) AS "Calorifier power (timescale)"
 FROM sensor
 WHERE $__timeFilter("time") AND (id = '3c32e38150f2' OR id = '3ce104571ea2')
 GROUP BY 1, 2
 WINDOW w AS (ORDER BY sensor.time RANGE BETWEEN INTERVAL '5 minutes' PRECEDING AND CURRENT ROW)
 ORDER BY 1, 2

The $__timeFilter() part is expanded to the appropriate time BETWEEN ... clause.

PostgreSQL complained about the lack of ORDER BY:

RANGE with offset PRECEDING/FOLLOWING requires exactly one ORDER BY column

so I added ORDER BY time and then tried ORDER BY sensor.time as per the above, but PostgreSQL still doesn’t want to know:

column "sensor.time" must appear in the GROUP BY clause or be used in an aggregate function

GROUP BY doesn’t seem to be usable inside the window, and the time-bucketed column is already in the GROUP BY. Either I don’t understand your suggestion properly, or the syntax is wrong.

Moving on to your second potential solution. I’ve tried:

WITH subquery AS (
  SELECT
   time_bucket('5m',time) AS time,
   id,
   AVG(value) AS average
  FROM sensor
  WHERE $__timeFilter("time") AND (id = '3c32e38150f2' OR id = '3ce104571ea2')
  GROUP BY 1, 2
  ORDER BY 1, 2)
SELECT
  time,
  (average - LAG(average) OVER w) * 55 * 4186 / (5 * 60) AS "Power"
 FROM subquery
 GROUP BY 1
 WINDOW w AS (ROWS BETWEEN 1 PRECEDING AND CURRENT ROW)
 ORDER BY 1

Both with my original window and your window, both cause PostgreSQL to spit out:

column "subquery.average" must appear in the GROUP BY clause or be used in an aggregate function

As a last ditch attempt, I tried:

WITH subquery AS (
  SELECT
   time_bucket('5m',time) AS time,
   id,
   AVG(value) AS average
  FROM sensor
  WHERE $__timeFilter("time") AND (id = '3c32e38150f2' OR id = '3ce104571ea2')
  GROUP BY 1, 2
  ORDER BY 1, 2)
SELECT
  time,
  (average - LAG(average) OVER (PARTITION BY id ORDER BY time)) * 55 * 4186 / (5 * 60) AS "Power"
 FROM subquery
 GROUP BY 1
 ORDER BY 1

and that gives me the same error. This is very similar to what I’ve seen elsewhere, such as https://www.timescale.com/forum/t/migrate-grafana-dashboards-from-influxdb-datasource-to-timescaledb/2395/3

You suggest using time_bucket_gapfill() - yes, this may be useful in the future, but at the moment there are no gaps in the data as previously confirmed. There are also no duplicates either, and there are the correct number of data points for the time range - I already confirmed that. At one minute intervals, there should be 251 data points between 9:00 and 13:10 inclusive (that’s 4 * 60 + 10 + 1 data points). However, I will probably need to add the gap filling at some point, thanks for the suggestion.

As for time zones, data is inserted using UTC. It is also my understanding from having read the PostgreSQL documentation that timestamptz columns are stored as UTC with microseconds.

For continuous aggregates, I attempted to configure an materialized aggregate view, but I haven’t been able to get it to work - so it’s now been dropped, and dropping it hasn’t made any difference.

I’ve now tried:

SELECT
  time_bucket('5m',time) AS time,
  (AVG(value) - LAG(AVG(value)) OVER w) * 55 * 4186 / (5 * 60) AS "Power"
 FROM sensor
 WHERE $__timeFilter("time") AND (id = '3c32e38150f2' OR id = '3ce104571ea2')
 GROUP BY 1
 WINDOW w AS (ORDER BY time_bucket('5m',time))
 ORDER BY 1

after reading your minimal implementation - and this appears to mostly work. It looks like there’s differences between the original Influx query and this query for the first and last data points, but the rest of the data seems correct, which is a vast improvement.

Thanks!

jonatasdp · October 8, 2024, 2:10pm

That’s great Russel. Thanks for sharing!