Slow SELECT * FROM table LIMIT 1 Query on TimescaleDB Hypertable

alexandergerber · August 3, 2022, 8:04am

We run a simple SELECT * FROM table LIMIT 1 on a Hypertable with around 12 million rows.
From my understanding, this query should take close to no time since it just can take the first record it can find. However, it takes around 40 seconds to complete.

If I understand the query plan correctly, it looks like the database is iterating over each chunk.

Is this the expected behavior?

erikhh · August 4, 2022, 10:01am

I think this happens because chunks are sharded on the time index and the query does not give any time constraint

In the documentation here https://docs.timescale.com/timescaledb/latest/how-to-guides/query-data/select/

They add a order by time clause when they do a limit query. That probably helps in directing it to only look at the first or the last chunk.

ryanbooz · August 4, 2022, 6:08pm

If you run it with EXPLAIN ANALYZE rather than just EXPLAIN, what’s the ouput? Also, I haven’t used pgAdmin for a long time, is it possible to get the raw text of the EXPLAIN out instead of this row form?

alexandergerber · August 12, 2022, 9:07am

With EXPLAIN ANALYSE it looks like this:

It seems it only executes the query on the first chunk as one would expect I guess.
I am not sure how to get the raw text. Are you looking for something specific?

alexandergerber · August 12, 2022, 9:10am

With order by it takes even a bit longer. I get the feeling there is something wrong with our table

ryanbooz · August 15, 2022, 8:04pm

@alexandergerber,

As I said, I haven’t used pgAdmin for a very long time, but I think you can just highlight all the rows in that grid and copy them, which will give you the text version and is easier to read in this forum.

One thing to keep in mind is that a hypertable is partitioned on time (at least). This means that any query that doesn’t include a time predicate will cause the query planner to assume it has to look in every partition (chunk) even if it doesn’t. It will stop early once the data is retrieved (exactly what you saw with never executed), but it still has to plan that work.

However, looking at this output, a few things seem off to me, and I’m wondering if this is impacting your query in some way.

You mentioned in the first message that the query was taking ~40 seconds. In the second screenshot for the EXPLAIN, it shows 23ms. Any input there?
I also noticed that you said there were only 12 million rows in the hypertable, but TimescaleDB/Postgres seem to think there are 154+ million rows based on the plan estimate. Did I misunderstand how many rows there are or does this table just need to be ANALYZED to update statistics?

alexandergerber · August 16, 2022, 1:42pm

@ ryanbooz

Regarding the first point:
I think I took the screenshot in a second run to get the query plan and the result was still cached. Hence, the short run time.
Regarding the second point:
The table seems to be growing rapidly. It now has 246,864,405 rows. I think I just remembered an old number. Sorry for that.

At the moment I have a problem reproducing the issue. But I think it could be because of the cache.

We also upgraded from Postgres 11 to 13 not sure if this could have fixed the problem.
Sorry for being not specific enough. I will try to reproduce the problem again.

Here is the output in raw format. It worked as you said.

Limit  (cost=0.00..0.03 rows=1 width=56) (actual time=0.050..0.071 rows=1 loops=1)
  ->  Append  (cost=0.00..3179153.80 rows=125612187 width=56) (actual time=0.041..0.061 rows=1 loops=1)
        ->  Seq Scan on _hyper_26_3812_chunk  (cost=0.00..16867.14 rows=830514 width=56) (actual time=0.040..0.040 rows=1 loops=1)
        ->  Seq Scan on _hyper_26_3813_chunk  (cost=0.00..17135.06 rows=843706 width=56) (never executed)
        ->  Seq Scan on _hyper_26_3814_chunk  (cost=0.00..16981.40 rows=836140 width=56) (never executed)
        ->  Seq Scan on _hyper_26_3815_chunk  (cost=0.00..17060.20 rows=840020 width=56) (never executed)
        ->  Seq Scan on _hyper_26_3816_chunk  (cost=0.00..16997.16 rows=836916 width=56) (never executed)
        ->  Seq Scan on _hyper_26_3817_chunk  (cost=0.00..17083.84 rows=841184 width=56) (never executed)
        ->  Seq Scan on _hyper_26_3818_chunk  (cost=0.00..16959.73 rows=835073 width=56) (never executed)
        ->  Seq Scan on _hyper_26_3819_chunk  (cost=0.00..17144.91 rows=844191 width=56) (never executed)
        ->  Seq Scan on _hyper_26_3820_chunk  (cost=0.00..17105.51 rows=842251 width=56) (never executed)
        ->  Seq Scan on _hyper_26_3821_chunk  (cost=0.00..16940.03 rows=834103 width=56) (never executed)
        ->  Seq Scan on _hyper_26_3822_chunk  (cost=0.00..17064.14 rows=840214 width=56) (never executed)
        ->  Seq Scan on _hyper_26_3823_chunk  (cost=0.00..16965.64 rows=835364 width=56) (never executed)
        ->  Seq Scan on _hyper_26_3824_chunk  (cost=0.00..16867.14 rows=830514 width=56) (never executed)
        ...
        ->  Seq Scan on _hyper_26_4253_chunk  (cost=0.00..334.90 rows=16490 width=56) (never executed)
        ->  Seq Scan on _hyper_26_4254_chunk  (cost=0.00..19.70 rows=970 width=56) (never executed)
        ->  Seq Scan on _hyper_26_4255_chunk  (cost=0.00..19.70 rows=970 width=56) (never executed)
        ->  Seq Scan on _hyper_26_4256_chunk  (cost=0.00..19.70 rows=970 width=56) (never executed)
        ->  Seq Scan on _hyper_26_4257_chunk  (cost=0.00..19.70 rows=970 width=56) (never executed)
        ->  Seq Scan on _hyper_26_4258_chunk  (cost=0.00..19.70 rows=970 width=56) (never executed)
        ->  Seq Scan on _hyper_26_4259_chunk  (cost=0.00..19.70 rows=970 width=56) (never executed)
        ->  Seq Scan on _hyper_26_4260_chunk  (cost=0.00..19.70 rows=970 width=56) (never executed)
        ->  Seq Scan on _hyper_26_4261_chunk  (cost=0.00..19.70 rows=970 width=56) (never executed)
        ->  Seq Scan on _hyper_26_4262_chunk  (cost=0.00..19.70 rows=970 width=56) (never executed)
        ->  Seq Scan on _hyper_26_4263_chunk  (cost=0.00..19.70 rows=970 width=56) (never executed)
        ->  Seq Scan on _hyper_26_4264_chunk  (cost=0.00..19.70 rows=970 width=56) (never executed)
        ->  Seq Scan on _hyper_26_4265_chunk  (cost=0.00..19.70 rows=970 width=56) (never executed)
        ->  Seq Scan on _hyper_26_4266_chunk  (cost=0.00..19.70 rows=970 width=56) (never executed)
Planning Time: 32.395 ms
Execution Time: 3.553 ms

ryanbooz · August 16, 2022, 2:35pm

Sometimes it’s hard to tell what change had the specific impact.

I will add again, however, that as your data grows and more chunks are added, that planning time will go up unless you add a WHERE filter on the time column in some way. And, this planning time is incurred on every run. Assuming this is real-time data and you’re looking to get the most recent row, even doing something like this will help:

...WHERE time > now()-'1 day'::interval...

As of TimescaleDB 2.7.2, using a dynamic time predicate like that will still pre-exclude the older chunks and significantly reduce execution time.

alexandergerber · September 21, 2022, 7:20am

Thanks! Using a WHERE in the queries makes everything much faster.

AbrahimSA · October 6, 2022, 6:14pm

I’m having a similar issue with SELECT * FROM ht_06 ORDER BY irowversion DESC LIMIT 1, and I’d like to get values from the most recently irow column when the chuck is compressed
I can’t use the WHERE filter on the time column because the irowversion (bigint) column values come from a sequence and can be in any chunk.
I tried to set the irowversion column on the compress_orderby and compress_segmentby parameters, but the query did not performance well.
Using no compress chunk or partition table, the query performance is excellent, it is just few ms.

Is there a way to performance this type of query on compress chunks without using a WHERE filter on the time column?

I am using timescaledb version 2.8.0, and PostgreSQL 12.12 (Ubuntu 12.12-1.pgdg20.04+1)
The full table have more than 9 billion rows, 74 compressed chunk, and 7 no compressed
table_size | index_size | toast_size | total_size
------------±-----------±-----------±-----------
258 GB | 1176 GB | 153 GB | 1587 GB

ryanbooz · October 11, 2022, 3:22pm

Unfortunately, there isn’t in this specific use case. When the chunks aren’t compressed, you’re able to take advantage of the index on each chunk to some level. Without a time filter (the partition value), I would assume that your query plan still looks at each chunk to make sure it’s getting the biggest irowversion and then taking one final LIMIT. Because the chunks are not compressed, that’s a simple index lookup.

In compressed chunks, indexes can only be created on segment_by columns, but depending on the cardinality/density of irowversion in each chunk, that might mean you get poor compression. Even if you did, TimescaleDB will have to request a compressed row (in theory) from each chunk, uncompress all columns to form the rows (up to 1,000), and then filter to the one that you’re looking for. That’s a lot more work and I (again) assume you see something like this in the query plan.

IMO, your best best if you need this type of query often is to keep a tracking table that retains this data each time a new record for each irowversion comes in so that you can more easily query that table. Something akin to “Option 5” as outlined in this blog article from earlier this year.