PostgreSQL

Nov 28, 2024

Benchmarking PostgreSQL Batch Ingest

Hi, I'm James. I work in Developer Advocacy at Timescale. Recently, I’ve been diving deep into performance topics and thinking about the kind of content I enjoy reading and writing. This is a new series that will focus on using data visualizations to spotlight single concepts in Timescale or PostgreSQL performance. No marketing fluff, no unrealistic benchmarks, just straightforward explorations of performance topics I find interesting, with some notes on why I think they matter.

Benchmarking PostgreSQL Batch Ingest

In a previous article in this series, I explored the magic of INSERT...UNNEST for improving PostgreSQL batch INSERT performance. While it’s a fantastic technique, I know it’s not the fastest option available (although it is very flexible). Originally, I hadn't intended to loop back and benchmark all the batch ingest methods, but I saw a lot of confusion out there, so I'm back, and this time I'm looking at COPY too. As usual for this series, it’s not going to be a long post, but it is going to be an informative one. 

I flipped my approach for this post, comparing not just the PostgreSQL database performance in isolation but the practical performance from an application. To do this, I built a custom benchmarking tool in Rust to measure the end-to-end performance of each method. In this article, I’ll walk you through the batch ingest options you’ve got, and how they stack up (spoiler alert, the spread is over 19x!).

The Introduction: Batch Ingest in PostgreSQL

I’m defining batch ingest as writing a dataset to PostgreSQL in batches or chunks. You’d usually do this because the data is being collected in (near) real time (think a flow of IoT data from sensors) before being persisted into PostgreSQL (hopefully with TimescaleDB, although that's out of scope for this post).

Writing a single record at a time is incredibly inefficient, so writing batches makes sense (the size probably depends on how long you can delay writing). Just to be clear this isn't about loading a very large dataset in one go, I’d call that bulk ingest not batch ingest (and you'd usually do that from a file).

Broadly speaking, there are two methods for ingesting multiple values at once in PostgreSQL: INSERT and COPY. Each of these methods has a few variants, so let's look at the differences.

INSERT: VALUES and UNNEST

The most common method to ingest data in PostgreSQL is the standard INSERT statement using the VALUES clause. Everyone recognizes it, and every language and ORM (object-relational mapper) can make use of it. While you can insert a single row, we are interested in batch ingest here, passing multiple values using the following syntax (this example is a batch of three for a table with seven columns).

INSERT INTO sensors
VALUES 
    ($1,  $2,  $3,  $4,  $5,  $6,  $7),
    ($8,  $9,  $10, $11, $12, $13, $14),
    ($15, $16, $17, $18, $19, $20, $21);

The other method is INSERT...UNNEST. Here, instead of passing a value per attribute (so batch size * columns total values), we pass an array of values per column.

INSERT INTO sensors
SELECT * FROM unnest(
    $1::int[],
    $2::timestamp[],
    $3::float8[],
    $4::float8[],
    $5::float8[]
    $6::float8[]
    $7::float8[]
);

If you’re after a discussion on the difference between the two, then check out Boosting Postgres INSERT Performance by 2x With UNNEST.

Each of these queries can be actually sent to the database in a few ways:

  • You could construct the query string manually with a literal value in place of the $ placeholders. I haven’t benchmarked this because it’s bad practice and can open you up to SQL injection attacks (never forget about Little Bobby Tables). 
  • You could use your framework to send a parameterized query (which looks like the ones above with $ placeholders) which sends the query body and the values to Postgres as separate items. This protects against SQL injection and speeds up query parsing.
  • You could use a prepared statement (which would also be parameterized) to let the database know about your query ahead of time, then just send the values each time you want to run it. This provides the benefits of parameterization, and also speeds up your queries by reducing the planning time.

Most frameworks implement prepared statements using the binary protocol directly, but you can use the PREPARE and EXECUTE SQL commands to do the same thing from SQL.

Keep in mind that PostgreSQL has a limit of 32,767 parameterized variables in a query. So if you had seven columns, then your maximum batch size for INSERT…VALUES would be 4,681. When you’re using INSERT…UNNEST, you’re only sending one parameter per column. Because PostgreSQL can support at most 1,600 columns, you'll never hit the limit.

💡
When using PREPARED statements for batch ingest, ensure that plan_cache_mode is not set to force_custom_plan. This setting is designed for queries that benefit from being re-planned for each execution, which isn’t the case for batch inserts.

By default, plan_cache_mode is set to auto, meaning PostgreSQL will use custom plans for the first five executions before switching to a generic plan. To optimize performance, you could consider changing your session to force_generic_plan, ensuring the query is planned just once and reused for all subsequent executions.

COPY: Text and Binary

COPY is a PostgreSQL-specific extension to the SQL standard for bulk ingestion (strictly speaking we are talking about COPY FROM here because you can also COPY TO which moves data from a table to a file).

COPY shortcuts the process of writing multiple records in a number of ways, with two of the most critical ones being:

  1. WAL MULTI-INSERT: PostgreSQL optimizes COPYwrite-ahead operations by writing  MULTI_INSERT records to the WAL (write-ahead log) instead of logging each row individually. This results in less data being written to the WAL files, which means less I/O (input/output) on your database.
  2. COPY ring buffer: To avoid polluting shared buffers, COPY uses a dedicated ring buffer for its I/O operations. This minimizes the impact on the buffer cache used for regular queries, preserving performance for other database operations. So, less about raw speed and more about not being a noisy neighbor.

COPY can read data from multiple sources, local files, local commands or standard input. For batch ingestion, standard input makes the most sense as the data can be sent directly from the client without an intermediate step. I was actually surprised by the amount of people who reached out on Reddit following my last post saying they couldn’t use COPY because they would need to write out their stream of data as a file, that’s 100 percent what the STDIN setting is for!

COPY can use two different protocols, text and binary.

  • Text supports formats like CSV and involves sending text strings over the wire to be parsed by the server before ingestion. You can actually just dump raw CSV records into COPY.
  • Binary supports writing data in the native PostgreSQL format from the client, removing the need for parsing on the server side. It’s much faster but also much less flexible, with limited support in many languages. To do this you need to type your data in your client, so you know the format to write it in.

The two variants of COPY we'll be testing are the text version using:

COPY sensors FROM STDIN;

And the binary version using:

COPY sensors FROM STDIN WITH (FORMAT BINARY);

COPY isn't a normal SQL statement, so it can’t exist within a larger query. If also can’t perform an upsert (like INSERT … ON CONFLICT), although from PostgreSQL 17, a text COPY can now simulate INSERT … ON CONFLICT DO NOTHING by ignoring errors with ON_ERROR IGNORE.

The Setup 

I created my own Rust CLI tool to run this benchmark. That might seem like overkill, but I did it for the following reasons:

  • I needed something that supported COPY FROM STDIN and COPY WITH (FORMAT BINARY) directly, ruling out Grafana K6 and the PostgreSQL native pgbench.
  • I needed something that would let me run parameterized and prepared queries using the binary protocol directly, not using PREPARE and EXECUTE, because this is how most frameworks operate.
  • I wanted to measure the timing from an application's viewpoint, including data wrangling, network round-trip latency, and database operations.
  • I start measuring time after I've read the CSV file from disk and loaded it into a Rust data structure. This is to avoid measuring the I/O limits of the benchmark client. Batch ingest normally takes place in a stream without data being read from files.
  • I love Rust (if you’re after more Rust x PostgreSQL content, check out my talk at PGConf.EU 2024)!

The tool tests the insertion of data from a CSV file into the database (the default file is one million records) with multiple batch sizes and ingest methods into a table with five metric columns (the actual schema isn’t important, I just love the fact a lot of power and renewables companies use Timescale ♺):

CREATE TABLE IF NOT EXISTS power_generation (
    generator_id INTEGER, 
    timestamp TIMESTAMP WITH TIME ZONE,
    power_output_kw DOUBLE PRECISION, 
    voltage DOUBLE PRECISION,
    current DOUBLE PRECISION,
    frequency DOUBLE PRECISION,
    temperature DOUBLE PRECISION
 );

It supports a combination of the following methods (or you can use the –all shortcut) for insertion over multiple batch sizes per run:

  • Batch insert (parameterized)
  • Prepared batch insert
  • UNNEST insert (parameterized)
  • Prepared UNNEST insert
  • COPY
  • Binary COPYcomma-separated

The tool supports a few options, the most important being the comma-separated list of batch sizes.

Usage: pgingester [OPTIONS] [METHODS]...

Arguments:
  [METHODS]...  [possible values: insert-values, prepared-insert-values, insert-unnest, prepared-insert-unnest, copy, binary-copy]

Options:
  -b, --batch-sizes <BATCH_SIZES>              [default: 1000]
  -t, --transactions                           
  -c, --csv-output                             
  -a, --all                                    
  -c, --connection-string <CONNECTION_STRING>  [env: CONNECTION_STRING=]
  -f, --input-file <INPUT_FILE>                [default: ingest.csv]
  -h, --help                                   Print help
  -V, --version                                Print version

I tested with a connection to a Timescale 8 CPU/32 GB memory server (although it was only using a single connection, so this is overkill). 

The Results

Running the CLI tool with the following arguments will output a bit list of ingest performance, including the relative speed for each tested method.

pgingester --all --batch-sizes 1000,5000,10000,100000,1000000

I ran all queries with multiple batch sizes with the default CSV input (one million lines). The Insert VALUES and Prepared Insert VALUES queries will only run for the 1,000 batch size as above there are too many parameters to bind (the warnings to standard error have been removed below).

We can make a number of interesting conclusions from this data:

  1. With a larger batch size (anything other than 1,000) binary, COPY is substantially faster (at least 3.6x)  than anything else (19x faster than a naive parameterized INSERT...VALUES). This is because it doesn’t have to do any data parsing on the server side. The more data you load in a single batch, the more pronounced the difference will become.
  2. Text COPY also performs well, but surprisingly it’s surpassed in speed by prepared statements for batches of 10,000 or less. 
  3. Both COPY variants perform poorly with batches of 1,000. Interestingly, I've seen a lot of batch ingest tools actually use this.
  4. When you’re using INSERT for batch ingest, prepared statements always outperform parameterized ones. If you want maximum speed, the same number of parameters regardless of the batch size, and to avoid the maximum number of parameters being hit on larger batches then use INSERT…UNEST.
  5. INSERT...UNNEST at a batch size of 100,000 does a lot better against any of the text COPY variants than I thought it would: there is actually only 3 ms in it 👀!
💡
I ran this with a larger dataset of 100 million rows as well. Performance is slightly worse, probably because PostgreSQL is checkpointing in the background. However, the general numbers and relative speeds remain very similar.

So, Which Should You Use?

If you're looking to optimize batch ingestion in PostgreSQL, the right method depends on your specific use case, batch size, and application requirements. Here’s how the options stack up:

  1. Small Batch Sizes (<= 10,000 rows): Prepared INSERT...UNNEST can be surprisingly competitive. Down at a batch size of 1,000, COPY is actually much slower. 
  2. Large Batch Sizes (> 10,000 rows): For maximum throughput with larger batches, binary COPY is unbeatable. Its ability to bypass server-side parsing and its use of a dedicated ring buffer make it the top choice for high-velocity data pipelines. If you need speed, you can have larger batches, and your application can support the binary protocol, this should be your default.
  3. Ease of Implementation: If you prioritize ease of implementation or need compatibility across a wide range of tools, text COPY is a great middle-ground. It doesn't require complex client-side libraries and is supported in nearly every language that interacts with PostgreSQL. You can also just throw your CSV data at it.
  4. Considerations Beyond Speed:
    • Upserts: If you need conflict handling (INSERT...ON CONFLICT), COPY isn't an option, and you'll need to stick with INSERT (unless you just want to ignore errors and you're happy with text COPY, in which case PostgreSQL 17 has your back with ON_ERROR).
    • Framework support: Ensure your preferred framework supports your chosen method; COPY usually requires a different API to be used and binary COPY may require an extension library or not be supported.
    • Batch size limits: Watch for the 32,767-parameter limit when using parameterized INSERT...VALUES.
    • Memory and disk write overheads: COPY is designed to have the least impact on your system, writing less data to disk and not polluting shared_buffers. This is actually a big consideration! In fact, both the COPY methods write 62 MB of WAL for the one million row test, while INSERT writes 109 MB. This ~1.7x rule seems to hold across any ingest size.

Final Thoughts for Developers

When it comes to PostgreSQL batch ingestion, there is no one-size-fits-all solution. Each method offers trade-offs between performance, complexity, and flexibility:

  • For maximum raw speed over larger batches, binary COPY is your best bet.
  • For flexibility and ease of use over larger batches, text COPY balances speed with broad support.
  • For smaller batches or compatibility-focused workflows, prepared INSERT...UNNEST statement can hold its own, offering competitive speeds with maximum flexibility (but remember, if you have a heavy ingest pipeline, you risk disrupting shared_buffers, and you will be writing more to WAL).

Remember, the “best” method isn’t just about ingest speed; it’s about what fits your workflow and scales with your application. Happy ingesting!

Originally posted

Nov 26, 2024

Last updated

Nov 28, 2024

Share

Subscribe to the Timescale Newsletter

By submitting you acknowledge Timescale's Privacy Policy.