Multi-node Ingestion Performance

adam · March 10, 2022, 10:35am

In brief:
Hi all, I’ve been trying to set up a multi-node TimescaleDB cluster in Kubernetes and I’m having a few issues with ingestion and space partitioning.

We’re currently running a single node cluster successfully which is able to ingest around 3.5x our incoming messages. With the multi-node cluster set up we can achieve these numbers if we use solely time partitioning, but I’d like to use space partitioning on a specific UUID to speed up query times. With space partitioning enabled we can achieve around 1.5x our incoming messages with 2 or 4 nodes, or 0.9x with 16 nodes. CPU and memory consumption are well within limits, and I don’t believe it to be an issue with the disks as IOPS and throughput are very low, and turning fsync off yielded no better results.

Would anyone be able to give me a hand with this? I’ll post more details in the thread

TimescaleDB version: 2.6.0
PostgreSQL version: 14.2
TimescaleDB image: timescale/timescaledb-ha:pg14.2-ts2.6.0-p1
Platform: Kubernetes

I created a thread on Slack for this discussion that I’m linking as I’m not sure which is the preferred venue for this type of discussion

adam · March 10, 2022, 10:36am

Access node - 40CPU - 90GB memory
Data nodes - 10CPU - 30GB memory
Disks - EBS GP3 volumes - 16TB - 16,000 IOPS - 1,000 MB/s throughput

We ingest dynamic data from our users, so I can’t be too specific around database design and usage patterns, but these are some statistics I’ve been able to gather:
Num tables: 207 – Partitioned by ID: 161 – Partitioned by time only: 46
Distinct IDs – max: 7,859 – mean: 1,408.90
Rows per day: 5,997,405,714 – max: 593,721,087 – mean: 28,833,681.32
Database growth per day – before compression: 800.61GB – after compression: 121.99GB
Chunk sizes
- before compression – max: 4.41GB – mean: 0.17GB
- after compression – max: 0.49GB – mean: 0.03GB
Num chunks: 132,882
Message batches per day: 111,000,000
Average rows per batch: 54.03

We’re using at most 120 apps for inserting utilising 80 connections with PgBouncer transaction pooling, we found that more than that caused performance degradation.
We’re also using chunks of 1 hour, with a compression policy that kicks in after 48 hours, and a 30 day retention policy

LorraineP · March 11, 2022, 10:36am

Hi Adam thank you for your question. The forum is a great place for this as there’s quite a lot of detail. Let’s see if we can get you some help.

Erik_Nordstrom · March 11, 2022, 10:52am

Hi @adam. Typically, space partitioning should give better performance for ingest since time-only partitioning often means that you are only writing to one data node at a time, which is the one hosting the current time chunk. Adding space partitioning should ensure you write to multiple nodes in the same time range, thus spreading the load much better.

Therefore, your results are surprising so I am wondering what is really going on. Can you somehow assess how well the ingest is spread across data nodes?

General things to think about for good ingest performance:

Try to use COPY instead of INSERT when possible
Insert in large batches (many rows in single transaction and statement)
Avoid using ON CONFLICT, triggers and RETURNING clauses if possible

adam · March 11, 2022, 12:34pm

Hi, thanks for getting back to me. It seems like the ingest is spread fairly evenly across the 4 nodes:

As to those guidelines unfortunately we’re pulling our data from SNS and leverage the safety guarantees it provides to simplify app code. What this means for us is that we are limited to inserting up to 10 message batches at once, which skirts those safety guarantees but increases ingest performance slightly over inserting 1 batch at a time (but still not enough to keep up with single node ingest performance). However, because the data is completely dynamic we can’t rule out that each message in a batch isn’t for a different table, would COPY still provide a benefit in this case?

We also don’t use any of the things from point 3.

adam · March 11, 2022, 3:28pm

From running a couple of tests regarding the above, inserting a batch of 10 batches as a single batch gets us from ~1.5x to ~2x, while using COPY doesn’t work at all for our workload yielding about ~0.03x

Edit: The above is incorrect for COPY, it’s closer to ~0.5x