300m row ingestion into a hypertable per day

ajaypillay · October 22, 2024, 5:40am

For context, I am building a database that has to deal with ingesting lots of crawled data per day. My schema is similar to the following:

CREATE TABLE public.data (
    timestamp TIMESTAMPTZ,
    id INT REFERENCES public.user(id),
    rank INT,
    score INT,
    mode MODE_ENUM,
    category CATEGORY_ENUM
);

SELECT create_hypertable('public.data', 'timestamp');

ALTER TABLE public.data SET (timescaledb.compress);

The queries I need to be able to do are:

Query records by user ID within some time duration (day granularity)
Query records by mode and category within a certain rank interval within some time duration (ordering the score doesn’t work as the ordering of ties is done via other means which is not captured in the data, hence a rank column is needed). Rank values are unique within a category per mode, and are ordered from 1 to N.

For the sake of this example, we’ll assume that there are 10 modes, and 100 categories per mode (all categories are the same per mode). I tried normalizing out both ways (by mode and by category) but it involved plenty of joins and indexes, and queries of type (1) were still very slow (30s - 60s or more). I am in the midst of continuing experimenting around with compressing, segmenting, and ordering this table, together with different indexes.

Insert times have a lot more flexibility. They happen in bulk with COPY once a day and can take several hours without issue.

Given all the information above, is there an indexing/compression scheme that would speed these queries up? On the current hardware I have (HDD), I am hoping to get it down to around 5s for either of the queries above. The queried data will go through aggressive caching further upstream. Inserted data is never updated.

jonatasdp · October 22, 2024, 1:21pm

Hi @ajaypillay, please, consider using continuous aggregates for your most frequent queries.

Also, if you need to get raw data and the data is too much to present in the screen, check the downsampling to improve performance and reduce bandwidth/IO/memory.

As your data come in controlled batches, you may also want to control your continuous aggregates refreshes right after you update the data instead of using a refresh policy.

If you can control and insert in order, it will also make things faster.

Also, if your system become profitable, consider replacing your HDD with an SSD/NVME hardware because the reading/writing of data on old disks is also very low compared to modern storages.

Just leaving some useful ratios here too:

SATA SSDs vs HDDs (sequential): 3-4x faster.
SATA SSDs vs HDDs (random): ~600x faster.
NVMe SSDs vs HDDs (sequential): Up to 44x faster.

If you have any specific problem with your query, please run EXPLAIN ANALYZE <your query>; and then you can use the output to interpret and see what indices and other techniques can be applied to improve performance too.