I work with various kinds of forecasts that are frequently revised. The typical example is for weather forecasts.
Let’s say that every 6 hours, a new curve is published for each location (ex: a predefined list of cities). Each curve represents the forecasted temperature at every hour for the next 240 hours (10 days). In a classical SQL DB, we store these in a table that looks like this:
CREATE TABLE WeatherForecasts (
PublicationTime DATETIME, -- when the forecast was published
TargetTime DATETIME, -- when the temperature is forecasted to occur
LocationId INT, -- where the temperature is forecasted to occur
Temperature DECIMAL
);
We end up with many rows for the same TargetTime
and LocationId
pair.
So one of the key requirements is to be able to rapidly select the last forecast WHERE LocationId=$someLocationId AND PublicationTime<=$someTimeReference
. We generally achieve this with a window function, like ROW_NUMBER() OVER (PARTITION BY LocationId, TargetTime ORDER BY PublicationTime DESC)
which is often pretty slow. The filter on PublicationTime
is required for backtesting (obtaining the latest forecast that was known at a given time in the past).
With many locations and several years of forecasts, this table rapidly grows to billions of rows. We create different indexes to help with select performance, but these are tricky, take lots of space and sometimes fall short for improving query performance.
DuckDB, a column-oriented OLAP DB, can achieve very high compression with this kind of data, depending on the order in which the rows are inserted (ideally with many repeated or incremental values within a column), but DuckDB is an in-process DB that cannot handle multiple read-write connections from multiple users.
I am wondering if a timeseries database in general, and TimeScale in particular , is a good fit for this use case?
I always see cases where the data is a essentially a log of observations, forming a collection of independent curves (one curve for every location for example) to which you “simply” append new observations, in contrast to a collection of sets of curves to which we continuously add new curves that overlap with existing curves.
Any insights and best practices you can offer when working with this kind of data are much appreciated.