Querying from continous aggregate is orders of magnitude slower

Sankar_P · July 15, 2022, 4:20pm

I have a wide table with 40 columns. I have a continuous aggregate for 1min on this table. The SQL that I used for the table creation and the continous aggregate creation is at: timescale-commands.sql · GitHub

I then insert about 2 million records to the test_cagg table. I then call the refresh function to ensure that the cagg is caught up with all the latest data in the table. There are about 1 million records in the count(*) of the continuous aggregate.

Then I do:

select * from public.test_cagg LIMIT 5;
select * from public.test_cagg_1min limit 5;

The select count(*) on the raw table completes in less than 1 second. However, the select count(*) on the materialized view, takes ~10 seconds.

As the table size grows, the continuous aggregate also grows and the delay is even more. When I have about 70 million records in the parent table, and even when the materialized view is all refreshed, the LIMIT 5 query (just 5 random records) takes 30+ seconds. The same query on the raw table however, completes in less than 1 second, irrespective of the number of records there.

This makes me suspicious of the MAX function in the aggregate definition. But when I have ensured refresh_materialized view is done, I expect the call to finish instantaneously.

I accidentally stumbled on to this 5 random records query. The actual query that I was finding slowness and trying to optimize was different, but on the same test_cagg_1min continuous aggregate.

Any pointers on what I can do to speed this up ? Thanks.

ryanbooz · July 15, 2022, 5:26pm

Thanks for following up here on the forum, @Sankar_P!

When I look at your scripts a few things come to mind, but could you first please confirm what version of TimescaleDB you’re using and any relevant server configuration.

Also, in the meantime, give these two blog articles a read if you can to build some context for the next set of answers.

Thanks!

Sankar_P · July 18, 2022, 5:02am

I am using timescaledb:2.0.2-pg12 and the server has 8 GB RAM and 4 cpu cores. It is deployed as a statefulset on a kubernetes cluster, with a 200GB EBS volume mounted as a persistentvolume.

Sankar_P · July 18, 2022, 6:23pm

To give some more context: I tried creating a table with 2 million records. Then I did the CALL refresh_continuous_aggregate('test_cagg_1min', '2022-07-14', '2022-07-18'); and it has been running for about 6 hours now and not yet completed.

The CPU usage is quite less and it is not even using 1 core. The disk is also not thrashed. This makes me feel that there is something wrong with the query that we used for creating the materialised-view/continuous-aggregate. May be are we missing some indexes ?

I will test again with an updated timescaledb version tomorrow and will update here. In the meantime, if there is anything obviously wrong with the cagg creation, please let me know. Thanks.

ryanbooz · July 18, 2022, 6:33pm

OK, thanks for the follow-up.

The first two things I’ll mention should provide some pretty quick improvement, although I realize the first point (upgrading TimescaleDB) will take a little effort.

Upgrade to TimescaleDB 2.7
If you had a chance to read both blog posts above, you’ll remember that prior to TimescaleDB 2.7, the materialized results are not stored in their “finalized” form. Therefore, whenever you query data with something like SELECT * FROM [cagg]... without a time predicate, the underlying view query must be run in its entirety to generate the finalized data before it can LIMIT the result. TimescaleDB 2.7 changed that architecture and results in much better performance and works the way that most users expect (even with most SELECT * FROM... queries).

One caveat: to gain the advantage of the new finalized materialized data, you will need to create a new CAGG first. In your case, you can upgrade to 2.7+, drop the test table and CAGG, and then re-run your script.

Update the refresh policy
By default, TimescaleDB continuous aggregates are real-time by default. This means that whenever you query a continuous aggregate, it grabs all of the data from the materialized view that matches the query and then appends (with a UNION) any data that has been INSERTED into the hypertable after the last materialized bucket.

Your example refresh policy currently looks like this in the script:

SELECT
    add_continuous_aggregate_policy ('test_cagg_1min', start_offset := INTERVAL '1 day', end_offset := INTERVAL '6 hours', schedule_interval := INTERVAL '6 hours');

This policy will run every 6 hours and only materialize/update data in the CAGG from between 1 day ago and 6 hours ago (~18 hours of time). Therefore, the most recent 6 hours of time will never have any data materialized, which will require TimescaleDB to UNION the CAGG data and a real-time query over the last six hours and then sort all fo the data before it can determine the five rows to return. With a GROUP BY of ~20 columns, that’s just going to be a slower process.

Without knowing what you’re trying to test, my suggestion would be to materialize the data more frequently given your small bucket size. You’ll have to test the settings a little bit to find the optimal values for this policy. The example below tries to increase the refresh rate so that the CAGG is only 10 minutes behind the hypertable, while still allowing data that might get inserted/updated over the last 24 hours to be updated as necessary. If you don’t have much late arriving data (ie. an append-only application), then there’s often no reason to calculate the CAGG over the last 24 hours every time the process runs.

SELECT
    add_continuous_aggregate_policy ('test_cagg_1min', start_offset := INTERVAL '1 day', end_offset := INTERVAL '10 minutes', schedule_interval := INTERVAL '10 minutes');

Disable real-time aggregation
Totally dependent on your application needs, you don’t have to use continuous aggregates in real-time mode. If you keep the materialized data up-to-date within a few minutes, turning off real-time aggregation will prevent TimescaleDB from doing the UNION and simplifies the queries in most cases.

Again, this is very dependent on your application needs and not an indication that something is broken with real-time aggregation. The materialized data will continue to be updated with the refresh policy, it just won’t append the most recent set of raw data (based on your start/end offset intervals).

To turn real-time aggregates on/off, just modify the materialized_only setting in the following SQL command.

ALTER MATERIALIZED VIEW test_cagg_1min SET (materialized_only=true);

Sankar_P · July 18, 2022, 6:36pm

Thank you. I will try these tomorrow and will update you.

Sankar_P · July 21, 2022, 5:48am

I have tested and found that changing the refresh_interval to 5 minutes, has considerably reduced the query time. I added the timescaledb.materialized_only also which helped in speeding up most of the queries.

However, there are some select queries on the view, that still take time, which I expect to happen immediately, as it is a materialized view. I will spend some more time on these queries to see if there is something that I am missing in those. Otherwise, I will ask here again with a reproducible script.

Thanks a lot @ryanbooz for all your help.

ryanbooz · July 21, 2022, 5:23pm

Glad to hear about the progress. Certainly provide more detail on the other queries if you get stuck and we’ll see how we can help!