Replacing StatefulSets With a Custom K8s Operator in Our Postgres Cloud Platform
Hi, we’re Timescale! We build a faster PostgreSQL for demanding workloads like time series, vector, events, and analytics data. Check us out.
Over the last year, the platform team here at Timescale has been working hard on improving the stability, reliability and cost efficiency of our infrastructure. Our entire cloud is run on Kubernetes, and we have spent a lot of engineering time working out how best to orchestrate its various parts. We have written many different Kubernetes operators for this purpose, but until this year, we always used StatefulSets to manage customer database pods and their volumes.
StatefulSets are a native Kubernetes workload resource used to manage stateful applications. Unlike Deployments, StatefulSets provide unique, stable network identities and persistent storage for each pod, ensuring ordered and consistent scaling, rolling updates, and maintaining state across restarts, which is essential for stateful applications like databases or distributed systems.
However, working with StatefulSets was becoming increasingly painful and preventing us from innovating. In this blog post, we’re sharing how we replaced StatefulSets with our own Kubernetes custom resource and operator, which we called PatroniSets, without a single customer noticing the shift. This move has improved our stability considerably, minimized disruptions to the user, and helped us perform maintenance work that would have been impossible previously.
What Is a Kubernetes Operator?
Before delving into the details, it’s worth stepping back and ensuring that we understand what controllers and operators are.
A controller is any piece of software that manages Kubernetes resources according to a desired specification. Kubernetes contains many built-in controllers, for example:
- The Deployment Controller creates/updates/deletes ReplicaSets according to the specification in a Deployment.
- The ReplicaSet Controller creates/updates/deletes Pods according to the specification in a ReplicaSet.
All controllers work on a control-loop principle. They examine the desired state in the manifest, compare that with the actual state of the cluster and then perform actions necessary to move the cluster towards the desired state. We call this the reconciliation loop, and each execution a reconcile.
This is obviously very high-level; we’ll delve into the details a bit further when we discuss exactly how our operator works.
An Operator is essentially just a controller with some specificities:
- It uses a custom resource for specifying the desired state.
- It implements custom domain-specific logic.
If a Controller is your off-the-shelf piece of kit, an Operator is the artisanal, hand-crafted piece of kit that does just the job you need it to.
The Problems With StatefulSets
For many years, we have been using native StatefulSets and the StatefulSet controller to manage our database pods and volumes. Initially, it made great sense, as they provide a range of benefits: stable, persistent storage along with predictable deployments and updates. As our platform has grown and become increasingly sophisticated, some of those initial advantages have started to wane.
Volume management using StatefulSets initially looks pretty simple. You provide a VolumeClaimTemplate
in the specification, and all replicas have a volume created using that template. However, this template is immutable, and our customer’s storage demands are anything but.
Inevitably, the amount of storage a database needs will grow over time (and will often shrink again afterward as customers discover the benefits of hypercore, Timescale’s hybrid row-columnar storage engine which has impressive compression features). Volume management was something we had to do elsewhere, in our main operator—resizing existing volumes and provisioning new volumes every time a replica was created.
Predictable Rolling Updates is another useful feature that StatefulSets provide. StatefulSets number their pods, starting with pod zero, then pod one, and so on. When updating, they will always restart the highest numbered pod first and work their way down to pod zero. For our high-availability (HA) customers, they will have a pod zero and a pod one; the StatefulSet operator will always restart pod one first, wait for that to be healthy again, and then restart pod zero. Unfortunately, this doesn’t take into consideration which pod is the primary.
The worst-case scenario is that one is the primary, in which case we restart it first, a failover happens, and zero becomes the primary. We then restart pod zero, the new primary, and it fails back over to one, causing more disruption. We can see we always end up back with pod one being the primary, ready for the next time we need to perform an update. The worst case is actually the usual case.
StatefulSets also enforce deterministic pod ordering. If you have a single pod, it must be pod zero; for two pods, you will have pods zero and one. If we consider the case where a customer wants to remove a replica, we’ve seen that there is a good chance that pod one is the primary. However, when we scale the StatefulSet down to one pod, we must be left with pod zero. Removing a replica then would often mean disrupting your service because the primary would be removed.
The deterministic pod ordering problem is even worse when you consider the scenario where we need to replace the volume. This replacement can happen for many reasons: we want to recreate it in a different availability zone, with a different filesystem, or just because we want to reduce its size. It should be relatively easy to do this: we can create a new replica with the right spec, and when it’s ready, we can switch over to it.
With StatefulSets, however, that pod wouldn’t be pod zero so there was no way we could remove the redundant replica and be left with just the one we wanted. We would have to perform an elaborate dance where we spin up a new replica, then delete the original pod/volume and create yet another replica, this time with ordinal zero, which we could then finally use. In practice, we would also have to temporarily delete the StatefulSet itself (orphaning the pods and volumes) so that we could control the recreation of pod 0.
Ultimately, we reached the point where it felt like we were spending our time working around StatefulSets rather than having them actively help us. It was time to take a step back and consider replacing it with a Kubernetes operator that could handle all of our use cases and actively sought to minimize customer disruption.
Designing an Alternative to StatefulSets
We considered several approaches to replacing StatefulSets:
1. Managing the pods and volumes directly in our main operator
2. Using StatefulSets but in “OnDelete” mode and controlling the deletion from our main operator
3. Building a separate operator to manage the pods/volumes
After much discussion, we came to the conclusion that taking option three and building a direct replacement for the StatefulSet operator was the best bet. We wanted to avoid having a single “god” operator that did everything. Our main Kubernetes operator is already a complex beast and adding more functionality to it seemed like it would lead to problems.
The requirements for the operators are also slightly different—managing the pods and volumes often involves workflows that take multiple hours (such as spinning up a new replica), and the sequencing requirements involved weren’t something our main operator was designed for.
Some of our goals for the new system included the following:
- It should be a direct replacement for StatefulSets, and migration should be seamless.
- It must be PostgreSQL/Patroni1-aware; it needs to understand which pod is the primary, whether the replica is streaming or still recovering, etc.
- The Custom Resource (CR) should be entirely declarative, and the operator should handle all intermediate stages.
- Any actions should be done in order to minimize downtime. For example, we should always update the replica first in an HA setup.
- Pod ordinals should be arbitrary—if we’re removing a replica and pod one is the primary, then we should be able to just keep pod one and remove pod zero.
- It should be maintenance window/event-aware and able to time its changes appropriately.
- It should manage all volume changes.
Implementation
The key part of our new Kubernetes operator was that it should be aware of the role and status of each instance. As mentioned earlier, we use a tool called Patroni for managing primary elections, replication, and failovers. This reliance on Patroni led to our CRD being called a PatroniSet
and our operator to be called the patroniset-operator, or popper for short (not too imaginative, we know; we rejected wngman, brb-dll and omega star).
Instance matching
The first part of any reconcile loop is fetching the cluster’s current state. This seems simple enough: we grab the PatroniSet CR, list all the pods and volumes for that cluster, and match them up.
However, one consequence of removing the deterministic pod ordering is that working out which instances we want to keep, which need updating, and which we want to get rid of suddenly becomes interesting.
Our PatroniSet CR contains a set of members
, each of which has its own specification for the pod and volumes. The cluster contains a set of instances
, a volume and (usually) a pod.
After fetching the PatroniSet and the list of pods/volumes, we next need to perform an evaluation of how closely each pod and volume matches the member spec. We score this based on how much disruption it would cause the user to make it match.
Obviously, a perfect match means we need to make no changes and so cause no disruption. For a volume, patching the volume size also causes no disruption, while trying to make it smaller means replacing the whole volume—which cannot be done without disruption.
Once we know how each instance compares to the desired specifications, we can then calculate an optimal matching. When we do this, we also take into account the status of the pod, whether it has been successfully bootstrapped, and whether or not it is the primary. The matching is deterministic to ensure that successive reconciles will arrive at similar matching until there are significant changes to the state of the cluster.
We also consider the idea of replacements when performing this matching process. For example, if we need to replace a volume for some reason (e.g., we want to make it smaller or change the filesystem), then we spin up an additional ephemeral replica.
While this new replica is catching up to the primary, we have two matches for the member
in the spec—the current one and a replacement. This allows us to still perform any updates on the service that we need to while our new replica slowly catches up in the background.
For visibility purposes, we surface all of this matching data in the PatroniSet’s status. Here we can see what the outcome of this matching process looks like for a service that had a single instance but which has been updated and is in the process of replacing that instance with a new replica:
[
{
"index": 0,
"name": "vszax5rzmx-0",
"podCmp": "no-match",
"pvcCmp": "replace"
},
{
"index": 1,
"name": "vszax5rzmx-1",
"podCmp": "exact-match",
"pvcCmp": "exact-match",
"replacement": true
}
]
This tells us that the volume and pod with the name vszax5rzmx-1
are currently being created as a replacement for vszax5rzmx-0
. When the instance catches up, the matching will transform to:
[
{
"index": 1,
"name": "vszax5rzmx-1",
"podCmp": "exact-match",
"pvcCmp": "exact-match"
}
]
At this point, we know that we can remove vszax5rzmx-0
, because it is no longer relevant.
The instance matching process is the heart of popper; it helps us work out exactly what actions we need to perform to get our cluster into the desired state and is our main point of reference when making those decisions.
Taking actions
Now we know how our cluster matches up to the desired state, we can work out what actions to perform.
There is a temptation when designing an operator of this sort to start thinking in terms of “workflows.” For example, to handle the situation where the replica count increases from one to two, we might think we need some kind of “AddReplica” workflow that creates the volume and then creates the pod.
The trouble with this way of thinking is that these are very stateful items we are dealing with, and it can take multiple hours to perform actions like spinning up a replica or taking a backup. If a single reconcile loop is expected to perform all of the actions needed to get the cluster to the desired state, then we need to start worrying about blocking, controller restarts, or just what will happen if part of the workflow fails.
Our approach instead was to consider each action that popper could take separately. This is actually quite a small list, popper’s responsibilities are limited to just creating/updating/deleting the pods and volumes and managing the service’s backups (which we include because they are needed for creating replicas/forks, etc.). This is what the code that sets up our list of possible actions looks like currently (with some actions removed for clarity):
// An Action is a single thing that we might wish to perform.
type Action interface {
Name() string
Execute(ctx context.Context, ps *pset.PSet) Result
}
func New(k8sClient kube.Client, patroniClient patroni.Client, backupClient backups.Client, cfg *config.Config) []Action {
return []Action{
// Provision pod creates a new pod when we have a PVC, and
// we have a matching member that we need to create.
// We always prioritize this first so that when we lose pods
// due to node failure, deletion etc., then we get the service
// up and running asap.
&provisionPodAction{client: k8sClient, backupClient: backupClient},
// For any volumes that are patchable, we can try and update them.
// We do this before trying to provision a pvc, so that we don't
// recreate pvcs that can be patched.
&updateVolumeAction{client: k8sClient, readinessPollInterval: cfg.BackupPollInterval},
// provisionPVC creates new volumes for any members that are
// missing them.
&provisionPVCAction{client: k8sClient, backupClient: backupClient, backupPollInterval: cfg.BackupPollInterval},
// Restarting pods happens when we have a spec change that
// warrants a pod being deleted/recreated.
// Note: restart pod only actually deals with the removal of the
// pod, we rely on provisionPod to recreate it.
&restartPodAction{client: k8sClient},
// If we have redundant pods/pvcs (those that don't match any
// members of the patroniset), and can be removed, we do that now.
&deleteRedundantPVCAction{client: k8sClient},
&deleteRedundantPodAction{client: k8sClient},
// Perform switchover will switchover to a replica if that replica
// has the right spec and the leader doesn't.
// We check this last to ensure that all replicas are in the
// correct state before we switch over, to ensure
// that the candidate that is chosen is appropriate.
&performSwitchoverAction{patroni: patroniClient},
}
}
During the reconcile process, each action is called in turn and executed. Each action has its own series of checks that we perform to see if it can actually do anything. For example, here are some of the checks that the deleteRedundantPod
action makes:
- Have we got too many instances?
- Is the service paused or in the process of being deleted?
- Is the pod we’re considering for deletion the primary?
- If it’s not the primary, do we have an existing primary?
- Is this instance matched with a member in the spec?
- Are the other pods bootstrapped and ready?
This allows us to be very clear about the conditions under which we perform each action. We can be certain, for instance, that a pod will never be deleted unless we absolutely mean to.
Getting the conditions for each action right causes the workflows to emerge. When adding a new replica, the provisionPod
action will do nothing initially because there’s no volume while the provisionPVC
action succeeds. The next time we reconcile, there will be a volume, so provisionPod
succeeds, and we achieve exactly the workflow we wanted.
Each reconcile loop performs at most one action: it creates a single volume or restarts a single pod. We were very emphatic about this for a couple of reasons:
- It makes the state, and hence the decision-making, more reliable. We capture the state at the start of the reconcile loop and do all our matching work. Then, if the state changes because we created or deleted something, we need to redo that work. There can always be caching issues with getting the state, but our actions are built to be idempotent, so if the cache doesn’t update quickly enough for the next loop, it’s not a problem.
- We ensure that our set of actions is able to handle every possible state the cluster can get itself into. We can’t have any possible hidden gotcha states because we errored halfway through a sequence of actions in a reconcile.
- It makes testing very straightforward.
Results
We started using PatroniSets for new services in our clusters at the end of February 2024. By the middle of March, we were confident enough to start migrating existing services.
We were able to delete all of the existing StatefulSets, orphaning the pods and volumes, and create new PatroniSets that adopted them seamlessly. We had zero customer downtime during the migration, and honestly, nobody noticed. This felt a bit anticlimactic at the time, but on reflection was a massive success and a testament to all the hours of testing we’d done.
From the moment they were introduced into our clusters, PatroniSets have been providing significant improvements in stability and availability guarantees, particularly for HA customers.
Popper has also allowed us to perform significant maintenance work for customers under the hood that we would never have been able to coordinate with StatefulSets. For example, we transitioned to using the xfs
filesystem for all of our databases over a year ago because it’s proven more performant for our use case and also provides significant improvements during the PostgreSQL upgrade process.
Unfortunately, we had lots of older services stuck on ext4
, and no easy way to migrate them. However, with popper, we were able to just update the filesystem in the PatroniSet. We then sat back and watched as it spun up a new replica with an xfs
filesystem, waited for it to catch up, switch over, and then removed the old pod—all coordinated so that everything happened in a timely manner and any disruptive actions happened within the service’s maintenance window.
In May, we performed this across the fleet, updating every service that was not already using xfs
, so that everyone can benefit. We managed this all with only a single connection reset during each service’s maintenance window.
We have also continued to improve all of the small interactions: how we add/remove replicas, how/when we restart pods, etc. Recently, we upgraded all of our nodes to Kubernetes 1.28 during a maintenance event, which meant moving all pods to new nodes. Historically, we’ve seen issues when doing this because it can take time for the cluster autoscaler to spin up a new node and for it to join the cluster, ready for the pod.
This process has often resulted in a small downtime for some customers while their pod is stuck in a Pending state, waiting for the new node. To minimize downtime as much as possible, when we want to restart a pod, and we know it needs to go to a different node, popper now creates a “scout” pod and waits for it to be running before restarting your actual pod.
The scout pod can be easily evicted, ensuring guaranteed space on a running node for the pod and eliminating downtime caused by waiting for nodes to become available. We do something very similar whenever a service is resized to ensure that we have the capacity for it before restarting.
Popper has enabled us to continually make both big and small improvements to how we orchestrate all of our customers’ services. The amount of downtime and interruptions they experience has been reduced considerably, while we have also been able to roll out significant changes under the hood that benefit the customer and our business.
Final Words
In the end, building PatroniSets was a game-changer. We freed ourselves from the limitations of StatefulSets and created a solution that works with our architecture instead of against it. Our custom operator has not only increased the resilience and flexibility of our Postgres cloud platform, Timescale Cloud, but it has also helped improve cost efficiency—without anyone even noticing the shift.
If we could pull this off, you can imagine the kind of engineering prowess—and hard work—that goes into everything we do. If you want to solve similar complex problems, we’re hiring!
To experience firsthand our very boring cloud platform (where boring is awesome!), which merges the worry-free nature of a managed solution (automatic Postgres updates, free ingress, egress, and backups, and one-click HA) with a supercharged Postgres via automatic data partitioning, hybrid-row columnar storage, and always up-to-date materialized views, create a free Timescale Cloud account today.
And you too can see (or more accurately, not see) PatroniSets doing its work behind the scenes.