We’re just getting our fingers wet with timescale db. What optimizations could we use to optimize everything in the way of downloading large amounts of timeseries data into computer RAM? We’re using the pqxx c++ sql client.
Hi @sbeer, what type of optimization do you need?
Using hypertables you’ll touch the disk in a smarter way, partitioning your data and just access slices of time series data as you need.
Can you share a bit more of what you’re trying to achieve?
Postgresql is a database of a transactional nature, which means everything should go to disk persistence first.
If you’re looking for ways to optimize your ingestion rate and insert more data in parallel, you may take a look on timescaledb-parallel-copy which will allow you to ingest large amounts async.
Yes we have terrabytes of timeseries data that we need to store and ingest into memory upon restart of our application. We need timescale for disk persistence of the data, but then we try to download everything to ram and keep it some internal representation (we might skip this step if timescale proves to be fast enough for directly handing out application data). Essentially our data consists of weather parameters like precipitation, temperature, humidity, etc.
Hi @sbeer, in this case, just batch the information. Split the large file into manage ones. In my personal benchmarks, weather data can be ingested from 5k rows to 15k rows in a medium machine like 8 cpus easily.
You just need to commit the open transaction often
Make transactions of a few thousand rows. I also recommend you check the code of timescaledb-parallel-copy that is quite easy to understand and it implements the batches concepts and parallelization as well. My guess is that you’re not paginating your process in atomic proportions that your hardware can handle.
Just start by making it single process, paginate your data and load like year by year or month by month. Then, the memory will be released and reused by further process.
Thank you! We typically don’t use pagination as we’re offering the data in csv or binary format to the customers. So the only thing that counts is that we can assemble that files in the shortest possible time. Is there such a thing as binary or csv downloads from timescale database as well?
Got it! seems like a copy issue maybe not fully exclusively related to timescaledb.
Can you confirm you’re using copy? maybe not fully related to the previous case, but can you try to share more details about how you’re exporting the full dataset?
Have you ran the explain analyze
on the query and checked the output?
Hi @jonatasdp , thanks for the hints. Unfortunately our project got suspended by management, but we might return to it in the future.
We don’t use copy, in fact we’re using a cursor through the pqxx library that iterates in batches of around 10k rows through the db and stores the results in a custom data structure. There are of course obvious inefficiencies, I think there is one too many indirection to copy from library -owned buffer to our datastruct, and there’s of course the thing that we could download the next batch already and dispatch copying the data to a different thread. For the former I don’t exactly know how to work around it or what pqxx can offer in this regard, if we can somehow hook the data extraction process to deploy straight into our datastruct as opposed to building the result object?
I’d just build my own parser from scratch if the data is not that complex and stream the changes and keep flushing the batches to guarantee it does not lock again.
I’d just build my own parser from scratch if the data is not that complex and stream the changes and keep flushing the batches to guarantee it does not keep allocating more memory again.
You can do something that emits a COPY ... WHERE (<make a filter by month) to <month-csv>
. and then you’re already limiting the memory and you just need to sync more calls.