Using SkipScan to retrieve the most recent value with multiple dimensions
A question was recently asked on Twitter about finding the most recent set of values based on two columns. While one of the fastest ways to do this in TimescaleDB can often be to use SkipScan, facilitated by DISTINCT ON(), the current implementation only allows skipping over one indexed column, not two.
As an example, consider a simple table that tracks (at least) these columns for financial data.
CREATE TABLE ticks (
ts timestamptz NOT NULL,
supply_rate float4 NULL,
borrow_rate float4 NULL,
manual float4 null,
exchange TEXT NOT NULL,
symbol TEXT NOT NULL
);
With data populated, it might be tempting to create a corresponding index and then query the DISTINCT of both columns for the timestamp like this:
-- Create supporting index
CREATE INDEX test_ss ON ticks (exchange, symbol, ts DESC);
SELECT DISTINCT ON (exchange, symbol) * FROM ticks
ORDER BY exchange, symbol, ts DESC;
Unfortunately, this DISTINCT query will chose a regular Index Scan instead of the TimescaleDB SkipScan implementation. Even though there is an index that appears to support a distinct query on multiple columns, SkipScan will only consider the first column of the index (as of TimescaleDB 2.7).
Depending on the cardinality of your data, there are a few options for querying this data using SkipScan and both columns.
First, letās create the indexes that we want TimescaleDB to use regardless of which approach we take to query the data.
CREATE INDEX exchange_ss ON ticks (exchange, ts DESC);
CREATE INDEX symbol_ss ON ticks (symbol, exchange, ts desc);
With the indexes created, letās look at each option.
In both examples below, weāre pulling the values for exchange
and symbol
out of the time-series hypertable rather than a supporting dimension table. This could be simplified if there was a small, already ādistinctā table that tracks exchanges for instance, so YMMV depending on your schema design.
Two DISTINCT queries using LATERAL
In this example, we can use a DISTINCT ON query to get the list of exchanges that exist in the ticks
table and then āloopā (LATERAL) over the inner DISTINCT ON query using the value of the exchange. This SQL is just a little bit longer to write, but works well as long as the proper indexes exist.
SELECT * FROM
(SELECT DISTINCT ON (exchange) exchange FROM ticks ORDER BY exchange, ts DESC) a,
LATERAL (SELECT DISTINCT ON (symbol) symbol, ts, supply_rate, borrow_rate, manual FROM ticks
WHERE exchange = a.exchange ORDER BY symbol, ts DESC) b;
Function to query DISTINCT rows per exchange
This approach does hide some SQL in a function to clean up the day-to-day querying. Notice that we donāt have to use a LATERAL query here because the function is run once per exchange and returns multiple rows, which creates an implicit CROSS JOIN
so that many rows are produced for each exchange.
-- first create the function that will return all rows for a specific
-- exchange/symbol combination.
CREATE OR REPLACE FUNCTION symbol_array(TEXT) RETURNS
TABLE (ts timestamptz, supply_rate float4, borrow_rate float4, manual float4, symbol text)
AS $$
select distinct on (symbol) ts, supply_rate, borrow_rate, manual, symbol from ticks
where exchange=$1
ORDER BY symbol, ts DESC;
$$
LANGUAGE SQL
-- Now query the function with a DISTINCT on exchange
WITH s1 AS (
SELECT DISTINCT (exchange) exchange FROM ticks ORDER BY exchange, ts DESC
)
SELECT exchange, (symbol_array(exchange)).* FROM s1;
In this example, we first get a list of exchanges from the data table using SkipScan (in case there are many of them!) and then use that value to query the function we created which uses skip scan and filters the data to a specific exchange using the second index symbol_ss
using SkipScan further.
The key is that for SkipScan to be selected, DISTINCT ON()
can only contain one column and the ORDER BY
columns must batch an existing index and index order.