SQL for returning time series as contiguous tranches?

leisurehound · April 21, 2024, 7:06pm

Is there a way to select from a hypertable data grouped by tranches of contiguous data?

I have a table that has 10hz data, but that data sent via RF transmission and then written to timescale, so there are often gaps between groups of data (The transmission is best effort, and gives up easily so as to not create a retry storm, and frankly I’m just a consumer of the data once its received by the producer, have not control to improve or change that strategy).

If I select from the table the data between two timestamps, can SQL group the data into an array of tranches that have all rows within 1s? So basically any time gap in the data larger than 1s will start a new tranch.

I’m currentlyl gathering the data in one select statement and then dividing it into tranches at the application layer, but this seems inefficient. I suspect my SQLfu is not up to snuff here, and curious if there is a way that I can make Postgresql/Timescale do this for me.

TIA

jonatasdp · April 22, 2024, 1:33pm

Yes, just use the time_bucket('1 second', time) and use array_agg to aggregate the data in the batches that you want.

jonatasdp · April 22, 2024, 1:35pm

I also see you started the same topic in the slack.

Please, avoid cross-posting same issue because the solution may break parallel discussions and sources.

leisurehound · April 24, 2024, 6:45am

Thanks. Can you share an example of how to make that work. I’m finding in the docs something like
SELECT time_bucket('5 minutes', time) AS five_min, avg(cpu) but the examples are all scalars, not arrays.

I’m basically trying to extend a pretty simple select statement into returning an array of buckets:

SELECT <ordered_comma_separated_column_names> 
FROM <hypertable_name> 
WHERE time 
BETWEEN <begin_timestamp> AND <end_timestamp>

Do I SELECT time_bucket('1 second', time) as array_agg(tranch) from the result of the previous select?
e.g.:

SELECT ARRAY_AGG(tranch) FROM (
  SELECT time_bucket('1 second', time) AS tranch FROM (
    SELECT <ordered_comma_separated_column_names> 
    FROM <hypertable_name> 
    WHERE time 
    BETWEEN <begin_timestamp> AND <end_timestamp>
  ) AS derivedTable1
) AS derivedTable2

This seems to always return null.

Does time_bucket('1 second', time) break them into 1s buckets? Or into multi-second arrays where the distance between any two adjacent rows is always less than 1s?
TIA.

Tim
(sorry for the cross post, am new here, not sure which was the more suitable venue).

jonatasdp · April 25, 2024, 1:30pm

Just try:

SELECT time_bucket('5 minutes', time) AS five_min, array_agg(cpu) ...

Remember that it’s all about learning how group by works. check a small generated series:

tsdb=> select a from generate_series(now(), now()+interval  '1 min', interval '20 s') a;
               a
-------------------------------
 2024-04-25 13:27:01.436482+00
 2024-04-25 13:27:21.436482+00
 2024-04-25 13:27:41.436482+00
 2024-04-25 13:28:01.436482+00
(4 rows)

Now let’s build a bucket of 30 seconds:

tsdb=> select time_bucket('30 seconds',a) from generate_series(now(), now()+interval  '1 min', interval '20 s') a;
      time_bucket
------------------------
 2024-04-25 13:27:00+00
 2024-04-25 13:27:30+00
 2024-04-25 13:28:00+00
 2024-04-25 13:28:00+00
(4 rows)

now let’s use array_agg to explore:

tsdb=> select time_bucket('30 seconds',a), array_agg(1) from generate_series(now(), now()+interval  '1 min', interval '20 s') a;
ERROR:  column "a.a" must appear in the GROUP BY clause or be used in an aggregate function
LINE 1: select time_bucket('30 seconds',a), array_agg(1) from genera...
                                        ^

Note that aggregation functions combined with other tuples forces you to group by:

tsdb=> select time_bucket('30 seconds',a), array_agg(1) from generate_series(now(), now()+interval  '1 min', interval '20 s') a group by 1;
      time_bucket       | array_agg
------------------------+-----------
 2024-04-25 13:27:30+00 | {1}
 2024-04-25 13:28:30+00 | {1}
 2024-04-25 13:28:00+00 | {1,1}
(3 rows)

Note that you can also use elements in the array_agg:

tsdb=> select time_bucket('30 seconds',a), array_agg(a) from generate_series(now(), now()+interval  '1 min', interval '20 s') a group by 1;
      time_bucket       |                             array_agg
------------------------+-------------------------------------------------------------------
 2024-04-25 13:27:30+00 | {"2024-04-25 13:27:53.916687+00"}
 2024-04-25 13:28:30+00 | {"2024-04-25 13:28:33.916687+00","2024-04-25 13:28:53.916687+00"}
 2024-04-25 13:28:00+00 | {"2024-04-25 13:28:13.916687+00"}

leisurehound · April 26, 2024, 2:45am

Thanks. I don’t think that gets me to where I’m going tho. That breaks them into 30s buckets, , if the bucket has a row at 13:28:15m 13:28:28 and the next one at 13:28.48, they’d be in two different buckets with your example instead of in one group of ‘contiguous readings that have at most 30s separation between rows’.

Again, I’m trying to find gaps in the data and return each of the tranches of rows that are contiguous between those gaps. So I’m not really trying to bucket them into 1 second tranches, I’m trying to create multisecond tranches which have no adjacent rows more than 1s apart.

With your example I’d want (assuming your adjacency example of 30s instead of my 1s):

      time_bucket       |                             array_agg
------------------------+-------------------------------------------------------------------
1                       | {"2024-04-25 13:27:33.916687+00"}
2                       | {"2024-04-25 13:28:13.916687+00","2024-04-25 13:28:33.916687+00","2024-04-25 13:28:53.916687+00","2024-04-25 13:29:14.916687+00"}
3                       | {"2024-04-25 13:29:54.916687+00","2024-04-25 13:30:14.916687+00"}

jonatasdp · April 26, 2024, 3:07pm

Have you tried LAG?

 WITH timed_data AS (
  SELECT
    timestamp,
    -- Calculate the gap to the previous timestamp; mark if it's more than 1 second
    EXTRACT(EPOCH FROM (timestamp - LAG(timestamp) OVER (ORDER BY timestamp))) > 1 AS is_gap
  FROM
    (VALUES
      ('2024-04-26 14:49:50.959899+00'::timestamp),
      ('2024-04-26 14:49:51.459899+00'::timestamp), -- less than 1 second from previous
      ('2024-04-26 14:49:51.959899+00'::timestamp), -- less than 1 second from previous
      ('2024-04-26 14:50:00.959899+00'::timestamp), -- more than 1 second from previous
      ('2024-04-26 14:50:01.959899+00'::timestamp)  -- less than 1 second from previous
    ) AS t(timestamp)
),
grouped_data AS (
  SELECT
    timestamp,
    -- Sum the gaps to form group_ids
    SUM(CASE WHEN is_gap THEN 1 ELSE 0 END) OVER (ORDER BY timestamp) AS group_id
  FROM
    timed_data
)
SELECT
  group_id,
  MIN(timestamp) AS start_time,
  MAX(timestamp) AS end_time,
  ARRAY_AGG(timestamp) AS timestamps
FROM
  grouped_data
GROUP BY
  group_id
  start_time;

does the result looks right?

 group_id |         start_time         |          end_time          |                                        timestamps
----------+----------------------------+----------------------------+------------------------------------------------------------------------------------------
        0 | 2024-04-26 14:49:50.959899 | 2024-04-26 14:49:51.959899 | {"2024-04-26 14:49:50.959899","2024-04-26 14:49:51.459899","2024-04-26 14:49:51.959899"}
        1 | 2024-04-26 14:50:00.959899 | 2024-04-26 14:50:01.959899 | {"2024-04-26 14:50:00.959899","2024-04-26 14:50:01.959899"}
(2 rows)

leisurehound · April 27, 2024, 6:01am

Thanks. Was not a ware of LAG. This does look like what I need. I will integrate this into my code and let you know how it goes.

Thanks for your help. Very helpful and appreciate the time you’ve put in to helping me.

Tim

jonatasdp · April 29, 2024, 1:11pm

Awesome Tim, if you have the chance, paste back the final example that worked for you!

Happy coding