Category: All posts
Nov 28, 2024
Hi, I'm James. I work in Developer Advocacy at Timescale. Recently, I’ve been diving deep into performance topics and thinking about the kind of content I enjoy reading and writing. This is a new series that will focus on using data visualizations to spotlight single concepts in Timescale or PostgreSQL performance. No marketing fluff, no unrealistic benchmarks, just straightforward explorations of performance topics I find interesting, with some notes on why I think they matter.
Posted by
James Blackwood-Sewell
In a previous article in this series, I explored the magic of INSERT...UNNEST
for improving PostgreSQL batch INSERT
performance. While it’s a fantastic technique, I know it’s not the fastest option available (although it is very flexible). Originally, I hadn't intended to loop back and benchmark all the batch ingest methods, but I saw a lot of confusion out there, so I'm back, and this time I'm looking at COPY
too. As usual for this series, it’s not going to be a long post, but it is going to be an informative one.
I flipped my approach for this post, comparing not just the PostgreSQL database performance in isolation but the practical performance from an application. To do this, I built a custom benchmarking tool in Rust to measure the end-to-end performance of each method. In this article, I’ll walk you through the batch ingest options you’ve got, and how they stack up (spoiler alert, the spread is over 19x!).
I’m defining batch ingest as writing a dataset to PostgreSQL in batches or chunks. You’d usually do this because the data is being collected in (near) real time (think a flow of IoT data from sensors) before being persisted into PostgreSQL (hopefully with TimescaleDB, although that's out of scope for this post).
Writing a single record at a time is incredibly inefficient, so writing batches makes sense (the size probably depends on how long you can delay writing). Just to be clear this isn't about loading a very large dataset in one go, I’d call that bulk ingest not batch ingest (and you'd usually do that from a file).
Broadly speaking, there are two methods for ingesting multiple values at once in PostgreSQL: INSERT
and COPY
. Each of these methods has a few variants, so let's look at the differences.
The most common method to ingest data in PostgreSQL is the standard INSERT
statement using the VALUES
clause. Everyone recognizes it, and every language and ORM (object-relational mapper) can make use of it. While you can insert a single row, we are interested in batch ingest here, passing multiple values using the following syntax (this example is a batch of three for a table with seven columns).
INSERT INTO sensors
VALUES
($1, $2, $3, $4, $5, $6, $7),
($8, $9, $10, $11, $12, $13, $14),
($15, $16, $17, $18, $19, $20, $21);
The other method is INSERT...UNNEST
. Here, instead of passing a value per attribute (so batch size * columns
total values), we pass an array of values per column.
INSERT INTO sensors
SELECT * FROM unnest(
$1::int[],
$2::timestamp[],
$3::float8[],
$4::float8[],
$5::float8[]
$6::float8[]
$7::float8[]
);
If you’re after a discussion on the difference between the two, then check out Boosting Postgres INSERT Performance by 2x With UNNEST.
Each of these queries can be actually sent to the database in a few ways:
$
placeholders. I haven’t benchmarked this because it’s bad practice and can open you up to SQL injection attacks (never forget about Little Bobby Tables). $
placeholders) which sends the query body and the values to Postgres as separate items. This protects against SQL injection and speeds up query parsing.Most frameworks implement prepared statements using the binary protocol directly, but you can use the PREPARE
and EXECUTE SQL
commands to do the same thing from SQL.
Keep in mind that PostgreSQL has a limit of 32,767 parameterized variables in a query. So if you had seven columns, then your maximum batch size for INSERT…VALUES
would be 4,681. When you’re using INSERT…UNNEST
, you’re only sending one parameter per column. Because PostgreSQL can support at most 1,600 columns, you'll never hit the limit.
plan_cache_mode
is not set to force_custom_plan
. This setting is designed for queries that benefit from being re-planned for each execution, which isn’t the case for batch inserts. plan_cache_mode
is set to auto
, meaning PostgreSQL will use custom plans for the first five executions before switching to a generic plan. To optimize performance, you could consider changing your session to force_generic_plan
, ensuring the query is planned just once and reused for all subsequent executions.COPY
is a PostgreSQL-specific extension to the SQL standard for bulk ingestion (strictly speaking we are talking about COPY FROM
here because you can also COPY TO
which moves data from a table to a file). COPY
shortcuts the process of writing multiple records in a number of ways, with two of the most critical ones being:
COPY
write-ahead operations by writing MULTI_INSERT
records to the WAL (write-ahead log) instead of logging each row individually. This results in less data being written to the WAL files, which means less I/O (input/output) on your database.COPY
uses a dedicated ring buffer for its I/O operations. This minimizes the impact on the buffer cache used for regular queries, preserving performance for other database operations. So, less about raw speed and more about not being a noisy neighbor.COPY
can read data from multiple sources, local files, local commands or standard input. For batch ingestion, standard input makes the most sense as the data can be sent directly from the client without an intermediate step. I was actually surprised by the amount of people who reached out on Reddit following my last post saying they couldn’t use COPY
because they would need to write out their stream of data as a file, that’s 100 percent what the STDIN
setting is for!
COPY
can use two different protocols, text and binary.
COPY
.The two variants of COPY
we'll be testing are the text version using:
COPY sensors FROM STDIN;
And the binary version using:
COPY sensors FROM STDIN WITH (FORMAT BINARY);
COPY
isn't a normal SQL statement, so it can’t exist within a larger query. If also can’t perform an upsert (like INSERT … ON CONFLICT
), although from PostgreSQL 17, a text COPY
can now simulate INSERT … ON CONFLICT DO NOTHING
by ignoring errors with ON_ERROR IGNORE
.
I created my own Rust CLI tool to run this benchmark. That might seem like overkill, but I did it for the following reasons:
COPY FROM STDIN
and COPY WITH (FORMAT BINARY)
directly, ruling out Grafana K6 and the PostgreSQL native pgbench.PREPARE
and EXECUTE
, because this is how most frameworks operate.The tool tests the insertion of data from a CSV file into the database (the default file is one million records) with multiple batch sizes and ingest methods into a table with five metric columns (the actual schema isn’t important, I just love the fact a lot of power and renewables companies use Timescale ♺):
CREATE TABLE IF NOT EXISTS power_generation (
generator_id INTEGER,
timestamp TIMESTAMP WITH TIME ZONE,
power_output_kw DOUBLE PRECISION,
voltage DOUBLE PRECISION,
current DOUBLE PRECISION,
frequency DOUBLE PRECISION,
temperature DOUBLE PRECISION
);
It supports a combination of the following methods (or you can use the –all
shortcut) for insertion over multiple batch sizes per run:
UNNEST
insert (parameterized)UNNEST
insertCOPY
COPY
comma-separatedThe tool supports a few options, the most important being the comma-separated list of batch sizes.
Usage: pgingester [OPTIONS] [METHODS]...
Arguments:
[METHODS]... [possible values: insert-values, prepared-insert-values, insert-unnest, prepared-insert-unnest, copy, binary-copy]
Options:
-b, --batch-sizes <BATCH_SIZES> [default: 1000]
-t, --transactions
-c, --csv-output
-a, --all
-c, --connection-string <CONNECTION_STRING> [env: CONNECTION_STRING=]
-f, --input-file <INPUT_FILE> [default: ingest.csv]
-h, --help Print help
-V, --version Print version
I tested with a connection to a Timescale 8 CPU/32 GB memory server (although it was only using a single connection, so this is overkill).
Running the CLI tool with the following arguments will output a bit list of ingest performance, including the relative speed for each tested method.
pgingester --all --batch-sizes 1000,5000,10000,100000,1000000
I ran all queries with multiple batch sizes with the default CSV input (one million lines). The Insert VALUES
and Prepared Insert VALUES
queries will only run for the 1,000 batch size as above there are too many parameters to bind (the warnings to standard error have been removed below).
We can make a number of interesting conclusions from this data:
COPY
is substantially faster (at least 3.6x) than anything else (19x faster than a naive parameterized INSERT...VALUES
). This is because it doesn’t have to do any data parsing on the server side. The more data you load in a single batch, the more pronounced the difference will become.COPY
also performs well, but surprisingly it’s surpassed in speed by prepared statements for batches of 10,000 or less. COPY
variants perform poorly with batches of 1,000. Interestingly, I've seen a lot of batch ingest tools actually use this.INSERT
for batch ingest, prepared statements always outperform parameterized ones. If you want maximum speed, the same number of parameters regardless of the batch size, and to avoid the maximum number of parameters being hit on larger batches then use INSERT…UNEST
.INSERT...UNNEST
at a batch size of 100,000 does a lot better against any of the text COPY
variants than I thought it would: there is actually only 3 ms in it 👀!If you're looking to optimize batch ingestion in PostgreSQL, the right method depends on your specific use case, batch size, and application requirements. Here’s how the options stack up:
INSERT...UNNEST
can be surprisingly competitive. Down at a batch size of 1,000, COPY
is actually much slower. COPY
is unbeatable. Its ability to bypass server-side parsing and its use of a dedicated ring buffer make it the top choice for high-velocity data pipelines. If you need speed, you can have larger batches, and your application can support the binary protocol, this should be your default.COPY
is a great middle-ground. It doesn't require complex client-side libraries and is supported in nearly every language that interacts with PostgreSQL. You can also just throw your CSV data at it.INSERT...ON CONFLICT
), COPY
isn't an option, and you'll need to stick with INSERT
(unless you just want to ignore errors and you're happy with text COPY
, in which case PostgreSQL 17 has your back with ON_ERROR
).COPY
usually requires a different API to be used and binary COPY
may require an extension library or not be supported.INSERT...VALUES
.COPY
is designed to have the least impact on your system, writing less data to disk and not polluting shared_buffers. This is actually a big consideration! In fact, both the COPY
methods write 62 MB of WAL for the one million row test, while INSERT
writes 109 MB. This ~1.7x rule seems to hold across any ingest size.When it comes to PostgreSQL batch ingestion, there is no one-size-fits-all solution. Each method offers trade-offs between performance, complexity, and flexibility:
COPY
is your best bet.COPY
balances speed with broad support.INSERT...UNNEST
statement can hold its own, offering competitive speeds with maximum flexibility (but remember, if you have a heavy ingest pipeline, you risk disrupting shared_buffers, and you will be writing more to WAL).Remember, the “best” method isn’t just about ingest speed; it’s about what fits your workflow and scales with your application. Happy ingesting!