Mar 21, 2024
Posted by
Carlota Soto
When we ask our community about the elementary challenges they face with their PostgreSQL production databases, we often hear about three pain points: query speed, optimizing large tables, and managing database backups. We’ve covered the first two topics in articles about partitioning and fine-tuning your database. We’ve also discussed how to reduce your database size to better manage large tables.
In this guide, we’ll answer some of the most frequently asked questions about database backup and recovery in PostgreSQL. We’ll also discuss how we handle things in the Timescale platform.
When we discuss backup and recovery, we’re referring to a set of processes and protocols established to safeguard your data from loss or corruption and restore it to a usable state:
Effective backup management is not only about creating copies of data. It’s also about ensuring those copies are healthy, accurate, and up-to-date.
To define a good backup strategy for your production PostgreSQL database, you need to consider several aspects. This includes how frequently you will back up your database, where these backups will be stored, and how often you will audit them.
But your job isn’t finished once you get up-to-date and healthy database backups. You must also establish an effective disaster recovery protocol. No matter how careful you are, it’s a fact of database management that failures will happen sooner or later. They can be caused by outages, failed upgrades, corrupted hardware, or human error—you name it.
Your disaster recovery plan must encompass all the steps to restore data as quickly as possible after an incident. This ensures that your database is not just backed up but also recoverable in a timely and efficient manner.
In PostgreSQL, there are two main types of database backups: physical backups and logical backups.
Logical backups can be highly granular, allowing for the backup of specific database objects like tables, schemas, or databases. They are also portable and can be used across different database systems or versions, making them popular for migrating small to medium databases. This is your common pg_dump/pg_restore
.
But a main drawback of logical backups is speed. For large databases, the process of restoring from a logical backup is too slow to be useful as a sole disaster recovery mechanism (or migration mechanism, for that matter). Restoring from physical backups is faster than restoring from logical backups, and it’s exact. When putting together a disaster recovery strategy, you’ll be dealing with physical backups.
Let’s explore some essential concepts around physical backups and how they can help you recover your database in case of failure.
Physical backups are referred to as file system backups in PostgreSQL. This refers to the process of directly copying the directories and files that PostgreSQL uses to store its data, resulting in a complete representation of the database at a specific moment in time.
Maintaining file system backups is an essential piece of every disaster recovery strategy and imperative in production databases. But putting together a solid disaster recovery plan requires other techniques beyond simply taking “physical” file system backups regularly. That’s especially true if you’re dealing with large production databases.
Taking physical backups of very large databases can be a rather slow and resource-intensive process that conflicts with other high-priority database tasks, affecting your overall performance. Physical backups are not enough to ensure consistency in case of failure, as they only reflect the database state at the time they were taken. To restore a database in case of failure, you’ll need another mechanism to be able to restore all the transactions that occurred between the moment the last backup was taken and the failure.
WAL stands for Write-Ahead Logging. It’s a protocol that improves the reliability, consistency, and durability of a PostgreSQL database by logging changes before they are written to the actual database files.
WAL is key for assuring atomicity and durability in PostgreSQL transactions. By writing changes to a log before they're committed to the database, WAL ensures that either all the changes related to a transaction are made or none at all.
WAL is also essential for disaster recovery since, in the event of a failure, the WAL files can be replayed to bring the database back to a consistent state. The process of regularly saving and storing these WAL records in a secondary storage location, ensuring that they are preserved over the long term, is usually referred to as continuous archiving.
Keeping WAL records and a recent, healthy physical database backup ensures that your PostgreSQL database can be successfully restored in case of failure. The physical backup will get PostgreSQL to the same state as it was when the backup was taken, which hopefully was not so long ago, and the WAL files will be rolled forward right before things start failing.
You might be wondering why it’s necessary to keep up-to-date backups if WAL can be replayed. The answer is speed. Replaying WAL during a recovery process is time-consuming, especially when dealing with large datasets with complex transactions. Backups provide a snapshot of the database at a specific point in time, enabling quick restoration up to that point.
In the optimal recovery scenario, you restore a recent backup (e.g., from the previous day. hen, you replay a WAL recorded post-backup to update the database to its most recent state. You don’t want to rely on WAL to reproduce two weeks’ worth of transactions.
Point-in-time recovery refers to restoring a PostgreSQL database to any specific point in time due to direct user input. For example, if I perform an upgrade and, for whatever reason, decide to revert the change, I could choose to recover the database from any day before.
Behind the scenes, PITR in PostgreSQL is often anchored in WAL. By integrating a backup with the sequential replay of WAL, PostgreSQL can be restored to an exact moment.
There are multiple tools that help with the creation of physical backups, two of the most popular being pg_basebackup
and pgBackRest
.
pg_basebackup
is the native tool offered by PostgreSQL for taking physical backups. It’s straightforward and reliable. It allows you to efficiently copy the data directory and include the WAL files to ensure a consistent and complete backup.
pg_basebackup
has important limitations. Taking full backups of a large database can be a lengthy and resource-intensive process. A good workaround to mitigate this is to combine full backups with incremental backups. For example, frequently copying the data that has changed since the last full backup (e.g., once a day) and creating full backups less frequently (e.g., once a week). However, incremental backups are not supported in pg_basebackup
.
pg_basebackup
also has limited parallelization capabilities, which can further slow down the creation of full backups. The process is mostly manual, requiring developers to closely monitor and manage the backup operations.
To address the constraints of pg_basebackup
, the PostgreSQL community built tools like pgBackRest
. pgBackRest
introduces several important improvements:
We use pgBackRest
to manage our own backup and restore process in Timescale, although we’ve implemented some hacks to speed up the full backup process (pgBackRest
can still be quite slow for creating backups in large databases).
Logical backups involve exporting data into a human-readable format, such as SQL statements. This type of backup is generally more flexible and portable, making it handy to reproduce a database in another architecture (i.e., for migrations). However, recovering from a logical backup is quite a slow process. That makes them practical only for migrating small to medium PostgreSQL production databases.
The most common way to create logical backups and restore from them is by using pg_dump/pg_restore
:
pg_dump
creates logical backups of a PostgreSQL database. It generates a script file or other formats that contain SQL statements needed to reconstruct the database to the state it was at the backup time. You can use pg_dump
to back up an entire database or individual tables, schemas, or other database objects.pg_restore
restores databases from backups created by pg_dump
. Just as pg_dump
offers granularity in creating backups, pg_restore
allows for selective restoration of specific database objects, providing flexibility in the recovery process. While it is typically used with backups created by pg_dump
, pg_restore
is compatible with other SQL-compliant database systems, enhancing its utility as a migration tool.Logical backups via pg_dump/pg_restore
are mostly useful for creating testing databases or for database migrations. In terms of migrations, if you’re operating a production database, we only recommend going the pg_dump/pg_restore
route if your database is small (<100 GB).
Migrating larger and more complex databases via pg_dump/pg_restore
might take your production database offline for too long. Other migration strategies, like the dual-write and backfill method, can avoid this downtime.
Physical backups are mostly used for disaster recovery and data archiving. If you’re operating a production database, you’ll want to maintain up-to-date physical backups and WAL to recover your database when failure occurs. If your industry requires you to keep copies of your data for a certain period of time due to regulations, physical backups will be the way to go.
In production applications, you’ll most likely use a combination of logical and physical backups. For disaster recovery, physical backups will be your foundational line of defense, but logical backups can serve as additional assurance (redundancy is a good thing). For migrating large databases, you’ll most likely use a staged approach, combining logical backups with other tactics, and so on.
Replicas are continuously updated mirrors of the primary database, capturing every transaction and modification almost instantaneously. They're not the same as backups, but their usefulness in disaster recovery is indisputable. In the event of a failure, you can promote replicas to serve as the primary database, ensuring minimal downtime while you restore the damaged database. Building a high-availability replica and failover mechanism generally involves the following steps:
pgBackRest
. This setup can be considerably complex to maintain. Most providers of managed PostgreSQL databases, including Timescale, offer fully managed replicas as one of their services. This simplifies running highly available databases.
The Timescale platform allows our customers to create fully managed PostgreSQL and TimescaleDB databases. That means we take care of the backup and disaster recovery process for them. Let’s run through how the platform handles backups, replication, upgrades, and restores.
Backups in Timescale are fully automated. Using pgBackRest
under the hood, Timescale automatically creates one full backup every week and incremental backups every day.
Timescale also keeps WAL files of any changes made to the database. This WAL can be replayed in the event of a failure to reproduce any transactions not captured by the last daily backup. For example, it can replay the changes made to your database during the last few hours. Timescale stores the two most recent full backups and WAL in S3 volumes.
On top of the full and incremental backups taken by pgBackRest
, Timescale also takes EBS snapshots daily. EBS snapshots create copies of the storage volume that can be restored, effectively making it a backup. They are significantly faster than taking full backups via pgBackRest (about 100x faster).
By taking EBS snapshots daily (on top of the weekly full backups by pgBackRest
), we introduce an extra layer of redundancy, ensuring that we always have a fresh snapshot that we can quickly restore if the customer experiences a critical failure that requires recovery from a full backup.
Timescale is built on AWS with decoupled compute and storage, something that makes the platform especially resilient against failures. There are two classes of failures that Timescale handles distinctly: compute and storage failures.
Compute failures are more frequent than storage failures, as they can be caused by things like unoptimized queries or other issues that result in a maxed-out CPU. To improve uptime for the customer, Timescale has developed a methodology that makes the platform recover extremely quickly from compute failures. We call this technique rapid recovery.
Timescale decouples the compute and storage nodes. So, if the compute node fails, Timescale automatically spins up a new compute node, attaching the undamaged storage unit to it. Any WAL that was in memory then replays.
The length of this recovery process depends on how much WAL needs replaying. Typically, it completes in less than thirty seconds. Under the hood, this entire process is automated via Kubernetes.
Storage failures are much less common than compute failures, but when they happen, they’re more severe. Having a high-availability replica can be a lifesaver in this circumstance; while your storage is being restored, instead of experiencing downtime, your replica will automatically take over.
To automatically restore your damaged storage, Timescale makes use of the backups it has on storage, reproducing WAL since the last incremental backup. The figure below illustrates the process:
In Timescale, you can create two types of replicas:
We’ve been talking about the importance of backups and disaster recovery. There’s a related concept that’s also important to consider: the concept of high availability. In broad terms, a “highly available” database describes a database that’s able to stay running without significant interruption (perhaps no more than a few seconds) even in case of failure.
The process of recovering a large database from backup might take a while, even when you’ve done everything right. That’s why it’s handy to have a replica running. Instead of waiting for the backup and restore process to finish, when your primary database fails, your connection will automatically failover to the replica. That saves your own users any major downtime.
Failover also helps remove downtime for common operations that would normally cause a service to reset, like upgrades. In these cases, Timescale makes changes to each node sequentially so that there is always a node available. And speaking of upgrades…
In Timescale, you’re running PostgreSQL databases with the TimescaleDB extension enabled. Therefore, during your Timescale experience, you’ll most likely experience three different types of upgrades:
These refer to upgrades between TimescaleDB versions, e.g., from TimescaleDB 2.11 to TimescaleDB 2.12. You don’t have to worry about these. They’re backward compatible, they require no downtime, and they will happen automatically during your maintenance window. Your Timescale services always run the latest available TimescaleDB version, so you can enjoy all the new features we ship.
We always run the latest available minor version of PostgreSQL in Timescale as well, mostly for security reasons. These minor updates may contain security patches, data corruption problems, and fixes to frequent bugs.
Timescale automatically handles these upgrades during your maintenance window, and they are also backward compatible. However, they require a service restart, which could cause some downtime (30 seconds to a few minutes) if you don’t have a replica. We alert you ahead of time about these, so you can set your maintenance window to a low traffic time (e.g., middle of the night) to minimize consequences.
These refer to upgrading, for example, from PostgreSQL 15 to 16. These upgrades are different and more serious since they’re often not backward compatible.
We can’t run these upgrades for you, as this might cause issues on your application. Besides, the downtime associated with upgrading major versions of PostgreSQL can be more severe (e.g., 20 minutes). Unfortunately, in this particular case, high-availability replicas can’t help you avoid downtime.
Major PostgreSQL upgrades are always a significant lift. Timescale has some tools that will make the transition smoother. For example, you can initiate the upgrade process in a particular database with a click of a button Before doing so, you can test your upgrade in a copy of your database to make sure nothing will break and have an accurate idea of how much downtime the upgrade will require. Read this article for more information.
Yes, you can! All Timescale services allow PITR to any point in the last three days. If you're using our Enterprise plan, this timespan expands up to 14 days.
Having a solid backup and recovery strategy is top of mind for every PostgreSQL user. We hope this introductory article answers some of your questions; if you’d like to see more articles diving deeper into this topic, tell us on Twitter/X.
If you prefer not to worry about maintaining your backups and taking care of recovering your database when things fail, try Timescale, our managed PostgreSQL platform. It takes care of all things backups so you can focus on what matters (building and running your application) while experiencing the performance boost of TimescaleDB. You can start a free trial here (no credit card required).