Benchmarking PostgreSQL Batch Ingest
In a previous article in this series, I explored the magic of INSERT...UNNEST
for improving PostgreSQL batch INSERT
performance. While it’s a fantastic technique, I know it’s not the fastest option available (although it is very flexible). Originally, I hadn't intended to loop back and benchmark all the batch ingest methods, but I saw a lot of confusion out there, so I'm back, and this time I'm looking at COPY
too. As usual for this series, it’s not going to be a long post, but it is going to be an informative one.
I flipped my approach for this post, comparing not just the PostgreSQL database performance in isolation but the practical performance from an application. To do this, I built a custom benchmarking tool in Rust to measure the end-to-end performance of each method. In this article, I’ll walk you through the batch ingest options you’ve got, and how they stack up (spoiler alert, the spread is over 19x!).
The Introduction: Batch Ingest in PostgreSQL
I’m defining batch ingest as writing a dataset to PostgreSQL in batches or chunks. You’d usually do this because the data is being collected in (near) real time (think a flow of IoT data from sensors) before being persisted into PostgreSQL (hopefully with TimescaleDB, although that's out of scope for this post).
Writing a single record at a time is incredibly inefficient, so writing batches makes sense (the size probably depends on how long you can delay writing). Just to be clear this isn't about loading a very large dataset in one go, I’d call that bulk ingest not batch ingest (and you'd usually do that from a file).
Broadly speaking, there are two methods for ingesting multiple values at once in PostgreSQL: INSERT
and COPY
. Each of these methods has a few variants, so let's look at the differences.
INSERT: VALUES and UNNEST
The most common method to ingest data in PostgreSQL is the standard INSERT
statement using the VALUES
clause. Everyone recognizes it, and every language and ORM (object-relational mapper) can make use of it. While you can insert a single row, we are interested in batch ingest here, passing multiple values using the following syntax (this example is a batch of three for a table with seven columns).
INSERT INTO sensors
VALUES
($1, $2, $3, $4, $5, $6, $7),
($8, $9, $10, $11, $12, $13, $14),
($15, $16, $17, $18, $19, $20, $21);
The other method is INSERT...UNNEST
. Here, instead of passing a value per attribute (so batch size * columns
total values), we pass an array of values per column.
INSERT INTO sensors
SELECT * FROM unnest(
$1::int[],
$2::timestamp[],
$3::float8[],
$4::float8[],
$5::float8[]
$6::float8[]
$7::float8[]
);
If you’re after a discussion on the difference between the two, then check out Boosting Postgres INSERT Performance by 2x With UNNEST.
Each of these queries can be actually sent to the database in a few ways:
- You could construct the query string manually with a literal value in place of the
$
placeholders. I haven’t benchmarked this because it’s bad practice and can open you up to SQL injection attacks (never forget about Little Bobby Tables). - You could use your framework to send a parameterized query (which looks like the ones above with
$
placeholders) which sends the query body and the values to Postgres as separate items. This protects against SQL injection and speeds up query parsing. - You could use a prepared statement (which would also be parameterized) to let the database know about your query ahead of time, then just send the values each time you want to run it. This provides the benefits of parameterization, and also speeds up your queries by reducing the planning time.
Most frameworks implement prepared statements using the binary protocol directly, but you can use the PREPARE
and EXECUTE SQL
commands to do the same thing from SQL.
Keep in mind that PostgreSQL has a limit of 32,767 parameterized variables in a query. So if you had seven columns, then your maximum batch size for INSERT…VALUES
would be 4,681. When you’re using INSERT…UNNEST
, you’re only sending one parameter per column. Because PostgreSQL can support at most 1,600 columns, you'll never hit the limit.
plan_cache_mode
is not set to force_custom_plan
. This setting is designed for queries that benefit from being re-planned for each execution, which isn’t the case for batch inserts. By default,
plan_cache_mode
is set to auto
, meaning PostgreSQL will use custom plans for the first five executions before switching to a generic plan. To optimize performance, you could consider changing your session to force_generic_plan
, ensuring the query is planned just once and reused for all subsequent executions.COPY: Text and Binary
COPY
is a PostgreSQL-specific extension to the SQL standard for bulk ingestion (strictly speaking we are talking about COPY FROM
here because you can also COPY TO
which moves data from a table to a file). COPY
shortcuts the process of writing multiple records in a number of ways, with two of the most critical ones being:
- WAL MULTI-INSERT: PostgreSQL optimizes
COPY
write-ahead operations by writingMULTI_INSERT
records to the WAL (write-ahead log) instead of logging each row individually. This results in less data being written to the WAL files, which means less I/O (input/output) on your database. - COPY ring buffer: To avoid polluting shared buffers,
COPY
uses a dedicated ring buffer for its I/O operations. This minimizes the impact on the buffer cache used for regular queries, preserving performance for other database operations. So, less about raw speed and more about not being a noisy neighbor.
COPY
can read data from multiple sources, local files, local commands or standard input. For batch ingestion, standard input makes the most sense as the data can be sent directly from the client without an intermediate step. I was actually surprised by the amount of people who reached out on Reddit following my last post saying they couldn’t use COPY
because they would need to write out their stream of data as a file, that’s 100 percent what the STDIN
setting is for!
COPY
can use two different protocols, text and binary.
- Text supports formats like CSV and involves sending text strings over the wire to be parsed by the server before ingestion. You can actually just dump raw CSV records into
COPY
. - Binary supports writing data in the native PostgreSQL format from the client, removing the need for parsing on the server side. It’s much faster but also much less flexible, with limited support in many languages. To do this you need to type your data in your client, so you know the format to write it in.
The two variants of COPY
we'll be testing are the text version using:
COPY sensors FROM STDIN;
And the binary version using:
COPY sensors FROM STDIN WITH (FORMAT BINARY);
COPY
isn't a normal SQL statement, so it can’t exist within a larger query. If also can’t perform an upsert (like INSERT … ON CONFLICT
), although from PostgreSQL 17, a text COPY
can now simulate INSERT … ON CONFLICT DO NOTHING
by ignoring errors with ON_ERROR IGNORE
.
The Setup
I created my own Rust CLI tool to run this benchmark. That might seem like overkill, but I did it for the following reasons:
- I needed something that supported
COPY FROM STDIN
andCOPY WITH (FORMAT BINARY)
directly, ruling out Grafana K6 and the PostgreSQL native pgbench. - I needed something that would let me run parameterized and prepared queries using the binary protocol directly, not using
PREPARE
andEXECUTE
, because this is how most frameworks operate. - I wanted to measure the timing from an application's viewpoint, including data wrangling, network round-trip latency, and database operations.
- I start measuring time after I've read the CSV file from disk and loaded it into a Rust data structure. This is to avoid measuring the I/O limits of the benchmark client. Batch ingest normally takes place in a stream without data being read from files.
- I love Rust (if you’re after more Rust x PostgreSQL content, check out my talk at PGConf.EU 2024)!
The tool tests the insertion of data from a CSV file into the database (the default file is one million records) with multiple batch sizes and ingest methods into a table with five metric columns (the actual schema isn’t important, I just love the fact a lot of power and renewables companies use Timescale ♺):
CREATE TABLE IF NOT EXISTS power_generation (
generator_id INTEGER,
timestamp TIMESTAMP WITH TIME ZONE,
power_output_kw DOUBLE PRECISION,
voltage DOUBLE PRECISION,
current DOUBLE PRECISION,
frequency DOUBLE PRECISION,
temperature DOUBLE PRECISION
);
It supports a combination of the following methods (or you can use the –all
shortcut) for insertion over multiple batch sizes per run:
- Batch insert (parameterized)
- Prepared batch insert
UNNEST
insert (parameterized)- Prepared
UNNEST
insert COPY
- Binary
COPY
comma-separated
The tool supports a few options, the most important being the comma-separated list of batch sizes.
Usage: pgingester [OPTIONS] [METHODS]...
Arguments:
[METHODS]... [possible values: insert-values, prepared-insert-values, insert-unnest, prepared-insert-unnest, copy, binary-copy]
Options:
-b, --batch-sizes <BATCH_SIZES> [default: 1000]
-t, --transactions
-c, --csv-output
-a, --all
-c, --connection-string <CONNECTION_STRING> [env: CONNECTION_STRING=]
-f, --input-file <INPUT_FILE> [default: ingest.csv]
-h, --help Print help
-V, --version Print version
I tested with a connection to a Timescale 8 CPU/32 GB memory server (although it was only using a single connection, so this is overkill).
The Results
Running the CLI tool with the following arguments will output a bit list of ingest performance, including the relative speed for each tested method.
pgingester --all --batch-sizes 1000,5000,10000,100000,1000000
I ran all queries with multiple batch sizes with the default CSV input (one million lines). The Insert VALUES
and Prepared Insert VALUES
queries will only run for the 1,000 batch size as above there are too many parameters to bind (the warnings to standard error have been removed below).
We can make a number of interesting conclusions from this data:
- With a larger batch size (anything other than 1,000) binary,
COPY
is substantially faster (at least 3.6x) than anything else (19x faster than a naive parameterizedINSERT...VALUES
). This is because it doesn’t have to do any data parsing on the server side. The more data you load in a single batch, the more pronounced the difference will become. - Text
COPY
also performs well, but surprisingly it’s surpassed in speed by prepared statements for batches of 10,000 or less. - Both
COPY
variants perform poorly with batches of 1,000. Interestingly, I've seen a lot of batch ingest tools actually use this. - When you’re using
INSERT
for batch ingest, prepared statements always outperform parameterized ones. If you want maximum speed, the same number of parameters regardless of the batch size, and to avoid the maximum number of parameters being hit on larger batches then useINSERT…UNEST
. INSERT...UNNEST
at a batch size of 100,000 does a lot better against any of the textCOPY
variants than I thought it would: there is actually only 3 ms in it 👀!
So, Which Should You Use?
If you're looking to optimize batch ingestion in PostgreSQL, the right method depends on your specific use case, batch size, and application requirements. Here’s how the options stack up:
- Small Batch Sizes (<= 10,000 rows): Prepared
INSERT...UNNEST
can be surprisingly competitive. Down at a batch size of 1,000,COPY
is actually much slower. - Large Batch Sizes (> 10,000 rows): For maximum throughput with larger batches, binary
COPY
is unbeatable. Its ability to bypass server-side parsing and its use of a dedicated ring buffer make it the top choice for high-velocity data pipelines. If you need speed, you can have larger batches, and your application can support the binary protocol, this should be your default. - Ease of Implementation: If you prioritize ease of implementation or need compatibility across a wide range of tools, text
COPY
is a great middle-ground. It doesn't require complex client-side libraries and is supported in nearly every language that interacts with PostgreSQL. You can also just throw your CSV data at it. - Considerations Beyond Speed:
- Upserts: If you need conflict handling (
INSERT...ON CONFLICT
),COPY
isn't an option, and you'll need to stick withINSERT
(unless you just want to ignore errors and you're happy with textCOPY
, in which case PostgreSQL 17 has your back withON_ERROR
). - Framework support: Ensure your preferred framework supports your chosen method;
COPY
usually requires a different API to be used and binaryCOPY
may require an extension library or not be supported. - Batch size limits: Watch for the 32,767-parameter limit when using parameterized
INSERT...VALUES
. - Memory and disk write overheads:
COPY
is designed to have the least impact on your system, writing less data to disk and not polluting shared_buffers. This is actually a big consideration! In fact, both theCOPY
methods write 62 MB of WAL for the one million row test, whileINSERT
writes 109 MB. This ~1.7x rule seems to hold across any ingest size.
- Upserts: If you need conflict handling (
Final Thoughts for Developers
When it comes to PostgreSQL batch ingestion, there is no one-size-fits-all solution. Each method offers trade-offs between performance, complexity, and flexibility:
- For maximum raw speed over larger batches, binary
COPY
is your best bet. - For flexibility and ease of use over larger batches, text
COPY
balances speed with broad support. - For smaller batches or compatibility-focused workflows, prepared
INSERT...UNNEST
statement can hold its own, offering competitive speeds with maximum flexibility (but remember, if you have a heavy ingest pipeline, you risk disrupting shared_buffers, and you will be writing more to WAL).
Remember, the “best” method isn’t just about ingest speed; it’s about what fits your workflow and scales with your application. Happy ingesting!