Best practices for storing forecasts

bouvierr · January 14, 2025, 2:59pm

I work with various kinds of forecasts that are frequently revised. The typical example is for weather forecasts.

Let’s say that every 6 hours, a new curve is published for each location (ex: a predefined list of cities). Each curve represents the forecasted temperature at every hour for the next 240 hours (10 days). In a classical SQL DB, we store these in a table that looks like this:

CREATE TABLE WeatherForecasts (
  PublicationTime DATETIME,  -- when the forecast was published
  TargetTime DATETIME,       -- when the temperature is forecasted to occur
  LocationId INT,            -- where the temperature is forecasted to occur
  Temperature DECIMAL
);

We end up with many rows for the same TargetTime and LocationId pair.
So one of the key requirements is to be able to rapidly select the last forecast WHERE LocationId=$someLocationId AND PublicationTime<=$someTimeReference. We generally achieve this with a window function, like ROW_NUMBER() OVER (PARTITION BY LocationId, TargetTime ORDER BY PublicationTime DESC) which is often pretty slow. The filter on PublicationTime is required for backtesting (obtaining the latest forecast that was known at a given time in the past).

With many locations and several years of forecasts, this table rapidly grows to billions of rows. We create different indexes to help with select performance, but these are tricky, take lots of space and sometimes fall short for improving query performance.

DuckDB, a column-oriented OLAP DB, can achieve very high compression with this kind of data, depending on the order in which the rows are inserted (ideally with many repeated or incremental values within a column), but DuckDB is an in-process DB that cannot handle multiple read-write connections from multiple users.

I am wondering if a timeseries database in general, and TimeScale in particular , is a good fit for this use case?

I always see cases where the data is a essentially a log of observations, forming a collection of independent curves (one curve for every location for example) to which you “simply” append new observations, in contrast to a collection of sets of curves to which we continuously add new curves that overlap with existing curves.

Any insights and best practices you can offer when working with this kind of data are much appreciated.

Ventsislav · January 14, 2025, 3:25pm

You can use TimescaleDB for this purpose by leveraging hypertables. In our experience, we manage two hypertables, with data growth of 1–2 TB per month. However, there are some important considerations to keep in mind:

Replication Limitations: Native PostgreSQL replication does not fully support TimescaleDB hypertables, which may impact your high-availability setup.
Backup Tools Compatibility: Common PostgreSQL backup tools, such as Pgbackrest or Barman, are not compatible with TimescaleDB hypertables.
TimescaleDB Cloud: To overcome these limitations, you may need to use TimescaleDB Cloud, which offers managed solutions but comes with additional costs

bouvierr · January 14, 2025, 4:13pm

Very interesting comments regarding replication and backup limitations.
I forgot to mention that the data is only for internal use and only at a single office so no hard requirement for replication nor high-availability.
Thank you very much for your insights.

Snorri_Bergmann · January 16, 2025, 3:05pm

I’m not sure why @Ventsislav insists that replication and backup tools are not compatible w/TimescaleDB.
We use TimescaleDB, Pgbackrest and Postgres replication without any issues.

As for your question @bouvierr then using TimescaleDB would improve your performance enormously. You could even use the the table schema you provided unchanged.

Ventsislav · January 20, 2025, 1:52pm

Hi, @Snorri_Bergmann

Thanks for the insights. We don’t use master-slave replication in our setup; instead, we use bi-directional replication (BDR). We’ve encountered an issue with hypertables when using this approach with TimescaleDB, but we’ve found a solution for consistency and performance when replicating hypertables across nodes by utilizing a tool we developed in C++.

However, we are currently facing challenges with PgBackRest during backup and restore operations on hypertables (10+ TB). The backups aren’t functioning as expected, and we’re struggling to restore the tables correctly. Please note that we don’t use a cloud environment—everything is stored in our on-prem data center.

If you have any advice or guidance on resolving this issue with PgBackRest, I would be very grateful. I’m open to paying for professional advice if needed.

Snorri_Bergmann · January 20, 2025, 4:22pm

Hi @Ventsislav.
I have not used BDR, but I read somewhere that logical replication is not working well with TimescaleDB. BDR is probably similar.
If you have issues with pgBackrest I suggest you try to contact Timescale support. They use pgBackrest for all their cloud infrastructure and are probably backing up and restoring in the TB+ scale.
Good luck,
-S