For context, I am building a database that has to deal with ingesting lots of crawled data per day. My schema is similar to the following:
CREATE TABLE public.data (
timestamp TIMESTAMPTZ,
id INT REFERENCES public.user(id),
rank INT,
score INT,
mode MODE_ENUM,
category CATEGORY_ENUM
);
SELECT create_hypertable('public.data', 'timestamp');
ALTER TABLE public.data SET (timescaledb.compress);
The queries I need to be able to do are:
- Query records by user ID within some time duration (day granularity)
- Query records by mode and category within a certain rank interval within some time duration (ordering the score doesn’t work as the ordering of ties is done via other means which is not captured in the data, hence a
rank
column is needed). Rank values are unique within a category per mode, and are ordered from 1 to N.
For the sake of this example, we’ll assume that there are 10 modes, and 100 categories per mode (all categories are the same per mode). I tried normalizing out both ways (by mode and by category) but it involved plenty of joins and indexes, and queries of type (1) were still very slow (30s - 60s or more). I am in the midst of continuing experimenting around with compressing, segmenting, and ordering this table, together with different indexes.
Insert times have a lot more flexibility. They happen in bulk with COPY
once a day and can take several hours without issue.
Given all the information above, is there an indexing/compression scheme that would speed these queries up? On the current hardware I have (HDD), I am hoping to get it down to around 5s for either of the queries above. The queried data will go through aggressive caching further upstream. Inserted data is never updated.