Nov 28, 2024
Posted by
Kirk Laurence Roybal
As a PostgreSQL guy, it really makes you wonder why a built-in job scheduler is not a part of the core PostgreSQL project. It is one of the most requested features in the history of ever. Yet, somehow, it just isn’t there.
Essentially, a job scheduler is a process that kicks off in-database functions and procedures at specified times and runs them independently of user sessions. The benefits of having a scheduler built into the database are obvious: no dependencies, no inherent security leaks, fits in your existing high availability plan, and takes part in your data recovery plan, too.
The PostgreSQL Global Development Group has been debating for years about including a built-in job scheduler. Even after the addition of background processes that would support the feature (all the way back in 9.6), background job scheduling is unfortunately not a part of core PostgreSQL.
So being the PostgreSQL lovers we are at Timescale, we decided to build such a scheduler so that our users and customers can benefit from a job scheduler in PostgreSQL. In TimescaleDB 2.9.1, we extended it to allow you to schedule jobs with flexible intervals and provide you with better visibility of error logs.
The flexible intervals enable you to determine whether the next run of the job occurs based on the scheduled clock time or the end of the last job run. And by “better visibility” of the job logs, we mean that they are also being logged to a table where they can be queried internally. These were extended to prevent overlapping job executions, provide predictable job timing, and provide better forensics.
We extensively use the advantage of this internal scheduler for our core features, enabling us to defer compression, data retention, and refreshing of continuous aggregates to a background process (among other things).
This scheduler makes Timescale much more responsive to the caller and results in more efficient processing of these tasks. For our own benefit, the job scheduler needs to be internal to the database. It also needs to be efficient, controllable, and scale with the installation.
We made all this power available to you as a PostgreSQL end user. If you're running PostgreSQL in your own hardware, you can install the TimescaleDB extension. If you're running in AWS, you can try our platform for free.
But not so fast. Before you start rejoicing, let’s review the reasons that the PostgreSQL Global Development Group chose not to include a scheduler in the database—there'll be educational for you as a word of caution.
Rather than rehashing the discussion list on the subject, let's summarize the obstacles that came up in the mailing list:
PostgreSQL is multi-process, not multi-thread. This simple fact makes having a one-to-one relationship of processes to user-defined tasks a fairly heavy implementation issue. Under normal circumstances, PostgreSQL expects to lay a process onto a CPU (affinity), load the memory through the closest non-uniform memory access (NUMA) controller, and do some fairly heavy data processing.
This works great when the expectation is that the process will be very busy the majority of the time. Schedulers do not work like that. They sit around with some cheap threads waiting to do something for the majority of the life of the thread. Just the context switching alone would make using a full-blown process very expensive.
Background workers' processes are a relatively small pool by design. This has a lot to do with the previous paragraph, but also that each process allocates the prescribed memory at startup. So, these processes compete with SQL query workers for CPU and memory. And the background processes have priority over both resources since they are allocated at system startup.
The next issue is more semantic. There are quite a few external schedulers available. Each one of them has a different implementation of the time management system. That is, there is a question about just how exactly the job should be invoked. Should it be invoked again if it is still running from the last time? Should the job be started again based on clock time or relative to the previous job run? From the beginning or the end of the last run?
There are quite a few more questions of this nature, but you get the idea. No matter how the community answers these questions, somebody will complain that the implementation is the wrong answer because \<insert silly mathematician answer here\>
.
Timescale doesn't have the luxury of debating how many angels can dance on the head of a pin. As a database service working with large volumes of data in PostgreSQL, we face a hard requirement of background maintenance for the actions of archival, compression, and general storage. Timescale's core features, excluding hyperfunctions, depend on the job scheduler.
But, rather than create a bespoke scheduler for our own purposes we built a general-purpose scheduler with a public application programming interface.
This general-purpose scheduler is generally available as part of TimescaleDB. You may use it to set a schedule for anything you can express as a procedure or function. In PostgreSQL, that's a huge advantage because you have the full power of the PostgreSQL extension system at your disposal. This list includes plug-in languages, which allow you to do anything the operating system can do.
Timescale assumes that the developer/administrator is a sane and reasonable person who can deal with a balance of complexity. That is longhand for "we trust you to do the right thing."
So, let's talk first about a few best design practices for using the Timescale (PostgreSQL) built-in job scheduler.
Now, assuming we are using the product fairly and judiciously, we can move on to the features and benefits of having an internal scheduler.
Now that we've covered the things that demand caution, here's a list of some of the benefits of using this scheduler:
📝 Editor's note: Quick reminder that you can install the TimescaleDB extension if you're running your own PostgreSQL database, or sign up for the Timescale platform (free for 30 days).
There is a quick introductory article in the Timescale documentation. Click that link if you want more detailed information.
The TL;DR version is that you make a PostgreSQL function or procedure and then call the add_job()
function to schedule it. Of course, you can remove it from the schedule using… Wait for it... delete_job()
.
That's it. Really. All that power is at your fingertips, and all you need to know is two function signatures.
Something to be aware of while you're using the scheduler is that the job may be scheduled to repeat from the end of the last run or from the scheduled clock time (in TimescaleDB 2.9.1 and beyond). This allows you to ensure that the previous job has completed (by picking from the end of the run) or that the job executes at a prescribed time (making job completion your responsibility).
If you feel a bit homesick and just want to look at your adorable job, there's also:
SELECT * FROM timescaledb_information.jobs;
And, of course, for completeness, there's always alter_job()
for rescheduling, renaming, etc.
Once your job has been created, it becomes the responsibility of the job scheduler to invoke it at the proper time. The job scheduler is a PostgreSQL background process. It wakes up every 10 seconds and checks to see if any job is scheduled in the near future.
If such a job is queued up, it will request another background process from the PostgreSQL master process. The database system will provide one (provided there are any available). The provided process becomes responsible for the execution of your job.
This basic operation has some ramifications. We have already mentioned that we need to use these background processes sparingly for resource allocation reasons. Also, there are only a few of them available. The maximum parallel count of background processes is determined by max_worker_processes
. If you need help configuring TimescaleDB background workers, check out our documentation.
📝 You can also check out this blog post on tuning TimescaleDB parameters.
On my system (Kubuntu 22.04.1, PostgreSQL 14.6), the default is 43. That number is just an example, as the package manager for each distribution of PostgreSQL has discretion about the initial setting. Your mileage **will** vary.
Changing this parameter requires a restart, so you will need to make a judgment call about how many concurrent processes you expect to kick off. Add that to this base number and restart your system. Of course, a reasonable number has been added for you in Timescale. Remember the CPU and memory limitations while you are making this adjustment.
The original reasons for creating this scheduler involve building out-of-the-box features involving data management. That includes compression, continuous aggregates, retention policy implementation, downsampling, and backfilling.
You may want to use this for event notifications, sending an email, clustered index maintenance, partition creation, pruning, archiving, refreshing materialized views, or summarizing data somewhere to avoid the need for triggers. These are just a few of the obvious ideas that jump into my consciousness. You can literally do anything that the operating system allows.
This would be a bad place to gum up the locking tables. That is, be sure that whatever you do here is done in a concurrent manner.
REFRESH INDEX CONCURRENTLY
is better than DROP
/ CREATE INDEX
. REFRESH MATERIALIZED VIEW CONCURRENTLY
is better than REFRESH MATERIALIZED VIEW
. You get it. Use CONCURRENTLY
, or design concurrently. Better yet, do things in a tiny atomic way that takes little time anyway.
Long-running transactions that create a lot of locks will interfere with the background writer, the planner, and the vacuum processes. If you crank up too many concurrent processes, you may also run out of memory. Please try to schedule everything to run in series. You’ll thank me later.
Now you have the power to do anything your little heart desires in the background of PostgreSQL without having any external dependencies. We hope you feel empowered, awed, and a little bit special. We also hope you will use your new powers for good!
The job scheduler is available in TimescaleDB 2.9.1 and beyond. If you’re self-hosting TimescaleDB, follow the upgrade instructions in our documentation. If you are using the Timescale platform, upgrades are automatic, meaning that you already have the scheduler at your fingertips.
If this article has inspired you to keep going with your PostgreSQL hacking, check out our collection of articles on PostgreSQL fine tuning.