Hello,
I’m one of the developers of Armada (https://armadaproject.io/) which is high-throughput job scheduler for Kubernetes. We’re investigating TimescaleDb to see if it will meet our needs for storing Job-level metrics (Cpu, Memory, IO, etc.).
Because Armada is high throughput, one of the real challenges here is making sure it scales. In terms of our numbers:
- At any one time we might have 300K jobs running.
- Every minute each job produces 10 distinct metrics- all at the same timestamp.
- Average Job runtime is 30 minutes so average of 30 rows per job. Some jobs may run for much longer and have up to 100K data points, other jobs may fail immediately so just a single datapoint.
- Jobs have around 10 tags associated with them (queue, jobset, dag, stage etc)- we need to be able to query and aggregate by tag.
- We’re looking to store the data in a database that can either retreive the timeseries for a given job or retrieve an aggregate over at time range with latencies appropriate for displaying in a UI (i.e. most queries should be sub second, and nothing should really take more than about 10 seconds).
- We want to be able to store a year’s worth of data at full resolution.
In terms of the numbers:
- 300K rows a minutes is around 5K rows a second.
- Storing a year’s worth of history means 157 Billion Rows
- 157 Billion Rows of 10 Doubles is around 12.5TB. Tags and primary Key will increase this, but on the other hand, the data should compress quite well as metrics data is usually highly compressible (e.g. because many jobs peg the cpu to 100% for long periods).
- The cardinality of the tags is huge.
What I’ve Done So far:
I’ve performed the following tests on a single node with 32 physical cores and 256GB memory running Postgres 15. Schema looks as follows:
CREATE TABLE .job_metrics (
time TIMESTAMPTZ NOT NULL,
jobId TEXT NOT NULL, -- these are uuids.
tags JSONB NOT NULL,
cpu DOUBLE PRECISION NULL,
memory DOUBLE PRECISION NULL,
diskRead DOUBLE PRECISION NULL,
diskWrite DOUBLE PRECISION NULL,
networkReceive DOUBLE PRECISION NULL,
networkSend DOUBLE PRECISION NULL,
gpu DOUBLE PRECISION NULL,
nfsRead DOUBLE PRECISION NULL,
nfsWrite DOUBLE PRECISION NULL
);
CREATE INDEX ix_jobid_time ON job_metrics (jobId, time DESC);
CREATE INDEX idxgin ON job_metrics USING GIN (tags);
SELECT create_hypertable(job_metrics','time', chunk_time_interval => INTERVAL '1 day');
Note that I’m storing the tags at the row level as a jsonb col. This has a serious penalty in terms of the amount of data written, but my initial attempt to normalise these out into a separate tags table, led performance problems when joining back to create continuous aggregations.
I’ve then created some continuous aggregates:
CREATE MATERIALIZED VIEW one_min_aggregates
WITH (timescaledb.continuous) AS
SELECT tags#>'{queue}' as queue, tags#>'{jobset}' as jobset, tags#>'{stage}' as stage,
time_bucket(INTERVAL '1 minute', time) AS bucket_1min,
SUM(cpu) as cpu,
SUM(memory) as memory,
SUM(diskRead) as diskRead,
SUM(diskWrite) as diskWrite,
SUM(networkReceive) as networkReceive,
SUM(networkSend) as networkSend,
SUM(gpu) as gpu,
SUM(nfsRead) as nfsRead,
SUM(nfsWrite) as nfsWrite
FROM chrismaTest.job_metrics
GROUP BY tags#>'{queue}', tags#>'{jobset}', tags#>'{stage}', bucket_1min;
CREATE INDEX ix_one_min_aggregates ON .one_min_aggregates (bucket_1min, queue, jobset, stage);
Using this one min aggregate- I can create other aggreagates (5min, 15min, 1hr, 4hr, 1day) each derived from the previous aggregates.
So far my testing has shown the following.
- Write performance is good. I can write at around 70-80K rows/sec sustained. This is an order of magnitude more than we need.
- Looking up the timeseries for a given job is expectedly fast (10s ms) as it’s using a postgres index. Filtering this to a time range is also fast.
- Creating the aggregations has been challenging. The one minute aggregation above takes 17 Seconds to run for a 5 minute window and almost 2 hours for a 1 day window. This is borderline acceptable as it is probably fast enough for now but doesn’t give us much overhead if data volumes increase or we need more aggregations.
- One multiple occasions I seem to have been able to hose the database by running a big aggregation- e.g. 1 min aggregation for a day. In this case the dbas have had to come in an kill the offending query.
Based on the following- I wonder if the community has any advice for me. My specific questions would be:
- Am I right in thinking that Timescale would be a good fit for my use case? Specifically can I get the performance I want with the data volume I have.
- Is there a way of making the continuous aggregations more performant? One thing I have noticed is the continuous aggregation seems to be single threaded, but the query plan itself seems to be (in theory) multithreaded. I did wonder if we could potentially increase performance by getting the aggregation to run in parallel.
- Is there a better way of storing my tags than a jsonb on every row. The size of the tags is a lot bigger than the size of the data so it’s a it sad I’m doing this, but when I tried to normalise this, the continuous aggrgation performance suffered, I presume because it now (for a one-day window) it’s trying to join 413 million rows to 18 million rows and that’s never going to be quick.
- So far I’ve only been able to get 10 days worth of data into the db as even ingesting at 70K rows a second- it takes a long time to fill up. Do you think I’ll see any issues I’ve not already encountered at higher data volumes? I’m hoping not as the hypertable is partitioned by day.
- My continuous aggregate above is just one of many we may want to produce (e.g that aggregate is just totals, we may want different aggregates for max, min, avg etc). Given that a single CA is causing performance issues, is it reasonable to want to produce so many aggregates?
sorry for such the long post- and any views (even if partial) would be much appreciated!
many thanks,
Chris