Sure thing
Schema
Below is our (simplified) main hypertable, which stores when each product from an e-shop was dynamically repriced by which repricing strategy in our application (“groups”). The compression is segmented on product_id
, because a given product is expected to remain in the same group for a while, so this way group_id
should compress well.
CREATE TABLE "repricings" (
"client_id" integer NOT NULL,
"product_id" text NOT NULL,
"group_id" bigint DEFAULT NULL,
"time" timestamp NOT NULL,
UNIQUE ("instance_id", "product_id", "time")
);
SELECT create_hypertable('repricings', 'time', chunk_time_interval => INTERVAL '1 week');
ALTER TABLE "repricings" SET (timescaledb.compress, timescaledb.compress_orderby = 'time ASC', timescaledb.compress_segmentby = 'instance_id, product_id');
SELECT add_compression_policy('repricings', INTERVAL '3d');
We’re building analytics for our clients (e-shops), and our first goal is to be able to tell them how much revenue/cost/profit each “group” generated in a given time period. In addition to repricings
hyperable, we also have purchases
hypertable, which tells us when a product was purchased with which quantity and total price (“revenue”):
CREATE TABLE "purchases" (
"client_id" integer NOT NULL,
"product_id" text NOT NULL,
"quantity" integer,
"revenue" double precision,
timestamp NOT NULL,
UNIQUE ("instance_id", "product_id", "time")
);
SELECT create_hypertable('purchases', 'time', chunk_time_interval => INTERVAL '1 week');
CREATE INDEX "purchases_instance_id_index" ON "purchases" ("instance_id", "time");
SELECT add_reorder_policy('purchases', 'purchases_instance_id_index');
Finally, in order to calculate cost & profit, we need to know purchase prices of products at any given moment. For this, we have the 3rd products
hypertable, with compression segmented on product_id
because the purchase price shouldn’t change too often:
CREATE TABLE "products" (
"client_id" integer NOT NULL,
"product_id" text NOT NULL,
"purchase_price" double precision,
"price" double precision,
"time" timestamp NOT NULL,
UNIQUE ("instance_id", "product_id", "time")
);
SELECT create_hypertable('products', 'time', chunk_time_interval => INTERVAL '1 week');
ALTER TABLE "products" SET (timescaledb.compress, timescaledb.compress_orderby = 'time ASC', timescaledb.compress_segmentby = 'instance_id, product_id');
SELECT add_compression_policy('products', INTERVAL '3d')
We’re storing this data in 3 tables, because in our application this data is imported into our system via independent processes. To get an idea about the volume of data, we currently have about 40 active clients, and our biggest client has about 67k active products, and we’re repricing all of their products 12 times a day. So, we ingest about 800k new repricings
& products
records each day.
Query
In order to determine to which group to assign revenue/profit/costs from which product at which time, we start by aggregating the time intervals for each group & product, to know exactly what time did products spend in each group. We do so with the following query:
SELECT "product_id", range_agg(tsrange("entered_at", "exited_at")) AS "period", "group_id"
FROM (
SELECT *, lead("entered_at", 1) OVER (PARTITION BY "product_id" ORDER BY "entered_at" ASC) AS "exited_at"
FROM (
SELECT "product_id", "group_id", min("time") AS "entered_at"
FROM (
SELECT *, (row_number() OVER (PARTITION BY "product_id" ORDER BY "time" ASC) - row_number() OVER (PARTITION BY "product_id", "group_id" ORDER BY "time" ASC)) AS "partition" -- gaps and islands
FROM "repricings"
WHERE (("instance_id" = 29) AND ("time" >= '2022-03-03 11:14:13.763475+0000') AND ("time" < '2022-03-02 23:00:00.000000+0000'))
) AS "t1"
GROUP BY "product_id", "group_id", "partition"
) AS "t1")
) AS "t1"
GROUP BY "product_id", "group_id"
We used subselects each time we needed to work with the result of window aggregates (gaps & islands, min, lead), though I’m guessing CTEs would be more readable.
We use the same approach to get time ranges of unchanged purchase prices from the products
table. We then join these aggregated products into purchases, in order to determine profit & costs from purchase prices:
SELECT "purchases"."time", "purchases"."product_id", "revenue", ("quantity" * "purchase_price") AS "costs"
FROM "purchases"
LEFT JOIN (
-- products aggregation similar to repricings
) AS "products" ON (("products"."product_id" = "purchases"."product_id") AND ("products"."period" @> "purchases"."time"))
WHERE (("instance_id" = 29) AND ("time" >= '2022-03-08 23:00:00.000000+0000') AND ("time" < '2022-03-02 23:00:00.000000+0000'))
Finally, we join purchases with revenue/profit/costs/margin into the repricings aggregate, in order to assign them to the appropriate “group”:
SELECT *, coalesce(("profit" / nullif("revenue", 0)), 0) AS "margin"
FROM (
SELECT "group_id", coalesce(sum("revenue"), 0) AS "revenue", coalesce(sum(("revenue" - "costs")), 0) AS "profit", count(DISTINCT "locations"."product_id") AS "product_count"
FROM (
SELECT "product_id", range_agg(tsrange("entered_at", "exited_at")) AS "period", "group_id"
FROM ( /* the repricings aggregate */) AS "locations"
LEFT JOIN ( /* the purchases aggregate */ ) AS "purchases" ON (("purchases"."product_id" = "locations"."product_id") AND ("locations"."period" @> "purchases"."time"))
GROUP BY "group_id"
) AS "stats"
This is pretty-much what we have for now. We haven’t been able to figure out how to use continuous aggregates, due to usage of window functions & JOINs, but I would like to find a divide & conquer approach (possibly with materialized views).
For a time period of 28 days, the query for a smaller customer (5k products, 1 repricing per day) executes in about 1 second, while for a larger customer (67k products, 12 repricings per day) the query executes in about 50 seconds. I tried various combinations of indices (without compression), but I couldn’t find a way for gaps & islands to use indices on both row_number()
window functions, and from my profiling that’s one of the bottlenecks.
I hope this gives you enough information, without overwhelming you .