Written by Semab Tariq
PostgreSQL is highly regarded for its flexibility, making it an ideal choice for databases that need custom functionality. A key feature supporting this adaptability is user-defined functions (UDFs). With user-defined functions, developers can create custom routines to implement business logic in PostgreSQL, keeping processing close to the data for better efficiency.
In this blog post, we’ll cover the concepts of UDFs, explore the different types available, and provide clear examples for each. Additionally, we will look at how you can use UDFs effectively on our managed PostgreSQL platform.
User-defined functions (UDFs) are custom functions created by users to perform specific operations in PostgreSQL. Unlike built-in functions, UDFs are developed based on individual requirements and are particularly useful for repetitive tasks or operations unique to a given workload. UDFs can be written in various languages, with PL/pgSQL (PostgreSQL’s procedural language) being one of the most common, alongside PL/Python, PL/Perl, and others.
PostgreSQL offers four types of functions:
SQL functions: functions written directly in SQL
Procedural language functions: functions written in languages like PL/pgSQL or PL/Tcl
Internal functions: built-in functions provided by PostgreSQL
C-language functions: functions written in C for advanced performance and customization
Let’s examine them.
In PostgreSQL, SQL functions are user-defined functions written using standard SQL syntax. They perform operations such as calculations, data manipulation, and retrieval and are helpful for creating reusable code blocks within the database. SQL functions are typically used for relatively simple operations because they are interpreted directly by the PostgreSQL SQL engine, making them fast and efficient.
Suppose you have a sales table and want to calculate the total sales amount for a specific year and month. To achieve this, you can create an SQL function like the one below:
CREATE FUNCTION monthly_sales(year INT, month INT)
RETURNS NUMERIC AS $$
SELECT SUM(amount)
FROM sales
WHERE EXTRACT(YEAR FROM sale_date) = year
AND EXTRACT(MONTH FROM sale_date) = month;
$$ LANGUAGE SQL;
To call the monthly_sales
function with the year 2024 and month 01, you would use the following SQL statement:
SELECT monthly_sales(2024, 01);
PostgreSQL allows you to write user-defined functions in various languages beyond SQL.
These are known as procedural languages (PLs).
When you create a function in a procedural language, PostgreSQL doesn't directly understand it. Instead, it uses a special handler to interpret and execute the function's code. This handler acts as a translator, either managing all aspects of the function, such as parsing and execution, or serving as a bridge between PostgreSQL and an external programming language.
For example, if you create a function in PL/Python, PostgreSQL will use the PL/Python handler to interpret and execute the Python code. The handler acts as a bridge, loading the necessary components to run Python code within the database environment.
Currently, PostgreSQL mainly offers four procedural languages as part of its standard distribution:
PL/pgSQL
PL/Tcl
PL/Perl, and
PL/Python
The handlers for these languages enable PostgreSQL to execute code written in these external languages by translating function calls and managing execution through the respective language's runtime environment.
Handlers are implemented as C functions that are compiled and loaded into the PostgreSQL server dynamically whenever a function in a procedural language is called. This setup enables PostgreSQL to expand its capabilities by supporting additional programming languages.
There are many other procedural languages available that are not part of PostgreSQL core and can be used on demand. For example:
PL/Java
PL/R
PL/Rust
PL/sh
Check out the complete list of available procedural languages.
PL/pgSQL is a powerful procedural language for the PostgreSQL database system that enables the creation of functions, procedures, and triggers with enhanced control structures, allowing for complex computations beyond what standard SQL functions can achieve. Unlike SQL functions, which execute individual SQL statements one at a time, PL/pgSQL enables you to group multiple queries and computations into a single block, significantly reducing interprocess communication and network overhead, thus improving performance.
Calculate the total sales for a given product in a specified year with PL/pgSQL functions:
CREATE FUNCTION total_sales_for_product(prod_id INT, sales_year INT)
RETURNS NUMERIC AS $$
DECLARE
total_sales NUMERIC := 0; -- Variable to store the total sales
BEGIN
-- Calculate total sales for the specified product and year
SELECT SUM(amount) INTO total_sales
FROM sales
WHERE product_id = prod_id -- Using the updated parameter name
AND EXTRACT(YEAR FROM sale_date) = sales_year;
-- Return the total sales amount
RETURN total_sales;
END;
$$ LANGUAGE plpgsql;
You can call this function to get the total sales for a specific product and year as follows:
SELECT total_sales_for_product(101, 2024);
PL/Tcl enables developers to create functions using Tcl, a flexible and powerful scripting language. It merges the strengths of C function writing with Tcl's rich string processing capabilities, allowing for the development of complex database logic within a secure environment. A key benefit of PL/Tcl is its safe execution context; it operates within a controlled Tcl interpreter that restricts available commands. This safety measure ensures that users without special privileges cannot access the database or the operating system inappropriately.
Before using PL/Tcl, we need to ensure that the PL/Tcl language is available in the current database. Once it is installed, you can create functions written in Tcl.
CREATE EXTENSION pltcl;
Create a PL/Tcl function that processes a list of integers, filters out even numbers, and returns the sum of the remaining odd numbers.
CREATE FUNCTION sum_odd_numbers_tcl(int[])
RETURNS int AS $$
set total 0
foreach num [split [string trim $1 "{}"] ","] {
if {([expr {$num % 2}]) != 0} {
set total [expr {$total + $num}]
}
}
return $total
$$ LANGUAGE pltcl;
Execute the above function with:
SELECT sum_odd_numbers_tcl(ARRAY[1, 2, 3, 4, 5, 6, 7]) AS sum_of_odds;
PL/Perl allows developers to write functions and procedures using the Perl programming language. To use PL/Perl in a specific database, you can simply run the query:
CREATE EXTENSION plperl;
Additionally, if PL/Perl is installed in the template1 database, it will automatically be available in all newly created databases. However, users who compile PostgreSQL from source must ensure that PL/Perl is enabled during the installation, while those using binary packages may find it in a separate subpackage.
Suppose you have a table of customer reviews and want to analyze the sentiment of each review to determine whether it's positive, negative, or neutral. We'll create a simple PL/Perl function that uses a basic keyword-based approach to classify the sentiment of a review.
CREATE FUNCTION analyze_sentiment(review_text TEXT)
RETURNS TEXT AS $$
my $text = lc $_[0]; # Convert text to lowercase for
case-insensitive matching
my @positive_keywords = ('good', 'great', 'excellent', 'happy', 'love');
my @negative_keywords = ('bad', 'terrible', 'poor', 'sad', 'hate');
my $positive_count = 0;
my $negative_count = 0;
foreach my $word (@positive_keywords) {
$positive_count++ while $text =~ /\b\Q$word\E\b/g;
}
foreach my $word (@negative_keywords) {
$negative_count++ while $text =~ /\b\Q$word\E\b/g;
}
return 'Positive' if $positive_count > $negative_count;
return 'Negative' if $negative_count > $positive_count;
return 'Neutral';
$$ LANGUAGE plperl;
You can run the sentiment analysis function to classify each review:
SELECT id, review_text, analyze_sentiment(review_text) AS sentiment FROM customer_reviews;
PL/Python enables developers to write PostgreSQL functions and procedures using the Python programming language. To install PL/Python in a specific database, you can execute the query:
CREATE EXTENSION plpython3u;
It’s important to note that PL/Python is an "untrusted" language, meaning it does not restrict user capabilities. Functions can perform any action that a database administrator could. Consequently, only superusers are permitted to create functions in PL/Python. If PL/Python is installed in the template1 database, it will automatically be available in any new databases created afterward.
We'll create a function that performs data normalization on a numeric column. Data normalization is a common preprocessing step in data analysis and machine learning, where you scale the data to a specific range, typically between zero (0) and one (1).
CREATE FUNCTION normalize_sales() RETURNS TABLE (id INT, normalized_amount NUMERIC) AS $$ result = plpy.execute("SELECT MIN(sales_amount) AS min_val, MAX(sales_amount) AS max_val FROM sales_data") min_val = result[0]['min_val'] max_val = result[0]['max_val'] if min_val == max_val: raise Exception("Normalization is not possible with constant values")
query = """ SELECT id, (sales_amount - %s) / (%s - %s) AS normalized_amount FROM sales_data """ % (min_val, max_val, min_val) return plpy.execute(query) $$ LANGUAGE plpython3u; You can now call the function to normalize the sales data:
SELECT * FROM normalize_sales();
id | normalized_amount
----+------------------------
1 | 0.00000000000000000000
2 | 0.25000000000000000000
3 | 0.50000000000000000000
4 | 0.75000000000000000000
5 | 1.00000000000000000000
(5 rows)
In PostgreSQL, internal functions are built-in functions implemented in C that are part of the PostgreSQL server's core functionality. These functions are compiled into the PostgreSQL binary and available without any additional setup or installation.
They offer essential operations and tools commonly required for managing and manipulating the database. In other words, these functions provide a set of built-in features that make it easier for users to perform routine tasks within a PostgreSQL database.
Here are some key points about internal functions:
Since internal functions are implemented in C and run directly within the PostgreSQL server, they are highly optimized for performance.
These functions are always available in a PostgreSQL installation, as they are part of the core system.
Internal functions cover a wide range of operations, including mathematical calculations, string manipulations, date and time processing, and more.
Here are some internal functions in PostgreSQL that can simplify our daily tasks and make things easier.
postgres=# SELECT
abs(-100) AS absolute_value, -- Mathematical function
length('Hello, PostgreSQL!') AS string_length, -- String function
now() AS current_timestamp, -- Date/Time function
round(123.456, 2) AS rounded_value, -- Mathematical function
upper('lowercase text') AS upper_case_text, -- String function
date_trunc('day', now()) AS truncated_date, -- Date/Time function
random() AS random_number -- Mathematical function;
-[ RECORD 1 ]-----+------------------------------
absolute_value | 100
string_length | 18
current_timestamp | 2024-11-11 13:34:37.155338+05
rounded_value | 123.46
upper_case_text | LOWERCASE TEXT
truncated_date | 2024-11-11 00:00:00+05
random_number | 0.8308769807656364
Typically, internal functions are declared automatically during the database cluster initialization, but users can create additional aliases for these functions using the CREATE FUNCTION
command with the LANGUAGE
internal specification. For example, a user can create an alias for the square root function as follows:
CREATE FUNCTION square_root(double precision) RETURNS double precision
AS 'dsqrt'
LANGUAGE internal
STRICT;
Calculating the square root of a number:
SELECT square_root(25);
User-defined functions in PostgreSQL can be written in C or compatible languages like C++. These functions are turned into shared libraries that the PostgreSQL server can load when needed. This feature sets them apart from internal functions, though the coding style is similar.
When creating a C function, you need to use the CREATE FUNCTION
command and specify both the library file's name and the C function name inside it. PostgreSQL follows specific steps to find the library file, allowing for different file paths. Each C function must include a "magic block" to ensure it works with the correct version of PostgreSQL.
There's also an optional function that can run right after the library loads. However, users must compile the C code into a shared library before using it in PostgreSQL, as the system won't compile it automatically.
The following example shows how to create and use a user-defined function in PostgreSQL using C. Let's say we want to calculate the factorial of a number.
First, you need to create a C file, say factorial.c, with the following:
#include "postgres.h"
#include "fmgr.h"
PG_MODULE_MAGIC;
PG_FUNCTION_INFO_V1(factorial);
Datum
factorial(PG_FUNCTION_ARGS)
{
int32 n = PG_GETARG_INT32(0);
int32 result = 1;
for (int32 i = 1; i <= n; i++) {
result *= i;
}
PG_RETURN_INT32(result);
}
Next, you need to compile this code into a shared library. You can do this using the following command in your terminal (make sure PostgreSQL development packages are installed):
gcc -shared -o factorial.so -fPIC -I/usr/include/postgresql/17 -I/usr/include/postgresql/17/server factorial.c
-L/usr/lib/postgresql/17/lib $(pg_config --ldflags)
Copy factorial.so to required directory:
cp factorial.so /usr/lib/postgresql/17/lib/
Now, you can create the function in PostgreSQL using the CREATE FUNCTION
command. Connect to your PostgreSQL database and run:
CREATE FUNCTION factorial(INT) RETURNS INT AS '$libdir/factorial', 'factorial' LANGUAGE C STRICT;
You can now call the factorial function in your SQL queries. For example:
SELECT factorial(5); -- This will return 120
User-defined functions are highly versatile tools for adding custom logic to PostgreSQL, allowing for a wide range of uses. One powerful way to leverage UDFs is by automating their execution, particularly for tasks that need to run on a schedule. In these cases, a job scheduler becomes an invaluable tool.
A job scheduler is a system component that automates the execution of tasks or jobs at specified intervals or in response to specific events, independent of user intervention. In the context of databases, a job scheduler facilitates the execution of database functions, procedures, and maintenance tasks at predetermined times, enhancing efficiency and reliability.
By allowing these operations to run in the background, a job scheduler reduces the need for manual oversight, ensures timely data processing, and optimizes resource management.
Moreover, since databases are typically I/O bound and the CPU is often underutilized, job schedulers help make better use of available resources by scheduling tasks during periods of low CPU demand, thus improving overall system performance and throughput.
PostgreSQL doesn't have a native job scheduler feature built into its core. However, we wrote an article explaining why the PostgreSQL community has been hesitant to incorporate job scheduling directly into the core.
TimescaleDB is built on top of PostgreSQL and designed to handle large volumes of data. Certain background maintenance tasks, such as data archival, compression, and general storage management, are necessary for this to work efficiently. These tasks need to run automatically in the background, which is why a job scheduler is crucial for TimescaleDB.
Rather than creating a custom scheduler just for internal use, the Timescale team built a general-purpose job scheduler. This scheduler can be used for a variety of tasks, not just Timescale's internal needs, and it comes with a public API (application programming interface), which means other users can also take advantage of it.
The scheduler is part of TimescaleDB and is built to work within the PostgreSQL ecosystem. It allows users to schedule functions and procedures that can be expressed using PostgreSQL.
Here are some benefits of using Timescale's built-in job scheduler:
When you use physical streaming replication, the job schedule is also replicated. This means that when you switch over to a replica, your scheduled jobs will already be set up and ready to run without needing additional configuration.
You don’t need a separate plan to make your job scheduler highly available (HA). If your PostgreSQL system is running, your scheduled jobs will also continue to run, as they are inherently tied to the system's availability.
The scheduler can report the success or failure of jobs by logging the results in internal tables and the PostgreSQL log file. This makes it easy to track and troubleshoot the execution of scheduled tasks.
The jobs can perform essential administrative tasks, such as dropping tables or altering table structures, based on the system's needs. This allows for more automated management of your database.
When you install TimescaleDB, the job scheduler is already included and ready to use. You don’t need to set up or configure a separate scheduler system.
Before scheduling a job, you need to create the UDF. If you're unsure about the type of UDF to create, refer to the previous section on UDF types. Once the UDF is created, you can register it with the job scheduler using the add_job function.
Provide essential configurations such as the UDF name, how often it should be executed, and any additional parameters needed to run the UDF.
Here’s the syntax to run a custom UDF every hour, where totalRecords is the name of the user-defined action:
SELECT add_job('process_hourly_reports', '1h');
If your function requires an additional parameter for execution, you can specify it using the following query, where hypertable is key and metrics is the value:
SELECT add_job('process_hourly_reports', '1h', config => '{"hypertable":"metrics"}');
Integrating user-defined functions with the job scheduler enables automated and efficient handling of data management tasks. Below are some key data management tasks we can automate.
You can use a UDF to aggregate daily totals from main_table
into daily_summary_table
and schedule it to run once per day using TimescaleDB's job scheduler.
CREATE OR REPLACE FUNCTION daily_summary()
RETURNS VOID
LANGUAGE SQL
AS $$
INSERT INTO daily_summary_table (day, total)
SELECT time_bucket('1 day', time), SUM(value)
FROM main_table
WHERE time_bucket('1 day', time) NOT IN (SELECT day FROM daily_summary_table)
GROUP BY 1;
$$;
SELECT add_job('daily_summary', '1d');
Similar to data aggregation, you can create a UDF to archive and delete records older than one year from main_table
into archive_table
. You can then schedule the function to run daily using TimescaleDB's job scheduler.
CREATE OR REPLACE FUNCTION archive_old_data()
RETURNS VOID LANGUAGE SQL AS $$
INSERT INTO archive_table (SELECT * FROM main_table WHERE time < NOW() - INTERVAL '1 year');
DELETE FROM main_table WHERE time < NOW() - INTERVAL '1 year';
$$;
SELECT add_job('archive_old_data', '1d');
Deleting rows in a large table creates table bloat and triggers costly autovacuum operations.
This method is not scalable for managing time-series data as your PostgreSQL table grows.
With hypertables, data is automatically partitioned into chunks based on time.
Instead of deleting rows, you can drop entire chunks using drop_chunks
, which instantly frees up space.
Hypertables avoid autovacuum overhead.
Read how you can reduce table bloat in your large PostgreSQL tables.
You can automate this process by defining a retention policy with TimescaleDB, which periodically removes old data without manual intervention.
Learn more about how data retention policies can help you manage large datasets.
We can automate maintenance tasks, such as reindexing, using UDFs. This query creates the simple_maintenance_task
function, which resets PostgreSQL statistics and reindexes the entire database to improve performance and clear accumulated data. The function is then scheduled to run once a week.
CREATE OR REPLACE FUNCTION simple_maintenance_task()
RETURNS VOID LANGUAGE plpgsql AS $$
BEGIN
PERFORM pg_stat_reset(); -- Resets statistics to clear any
accumulated data
PERFORM reindex_database(current_database()); -- Reindexes the
entire database
END;
$$;
SELECT add_job('simple_maintenance_task', '1w');
Monitoring your scheduled jobs is essential because it helps you keep track of their performance and ensures they run smoothly. With proper monitoring, you can quickly identify and debug any issues. Close monitoring allows you to fix problems early and improve the efficiency of your jobs, leading to better overall system performance.
Change the minimum log level shown to the client. Set it to DEBUG1
.
SET client_min_messages TO DEBUG1;
To check the current level for client_min_messages, you can query:
SHOW client_min_messages;
The available levels are typically:
debug5
debug4
debug3
debug2
debug1
log
notice
warning
error
In TimescaleDB, background workers are essential for performing maintenance tasks like compression, continuous aggregate refreshes, and data retention. Occasionally, you might need to restart these workers to address specific issues, including the following:
Resolving locks on hypertables caused by stalled jobs
Refreshing processes that might be in an unexpected state
To restart the background workers, run the following query:
SELECT _timescaledb_functions.start_background_workers();
For TimescaleDB versions earlier than 2.12, use this query:
SELECT _timescaledb_internal.start_background_workers();
When you execute this query, the workers are stopped and restarted gracefully, sending a friendly SIGTERM
signal to ensure tasks are paused safely without disrupting normal operations.
And what if PostgreSQL stops? See how TimescaleDB solves the problem of PostgreSQL locks.
timescaledb_information.jobs is a system view in TimescaleDB that provides information about scheduled jobs.
You can see a list of all your currently registered jobs by querying the job scheduler like this:
tsdb=> SELECT * FROM timescaledb_information.jobs;
-[ RECORD 1 ]-----+----------------------------------------
job_id | 3
application_name | Job History Log Retention Policy [3]
schedule_interval | 1 mon
max_runtime | 01:00:00
max_retries | -1
retry_period | 01:00:00
proc_schema | _timescaledb_functions
proc_name | policy_job_stat_history_retention
owner | postgres
scheduled | t
fixed_schedule | t
config | {"drop_after": "1 month"}
next_start | 2024-12-01 00:00:00+00
initial_start | 2000-01-01 00:00:00+00
hypertable_schema |
hypertable_name |
check_schema | _timescaledb_functions
check_name | policy_job_stat_history_retention_check
First, get the job ID from timescaledb_information.
To turn off the automatic scheduling of a job where 1000 is your job ID, use this query:
SELECT alter_job(1000, scheduled => false);
And to completely delete the job, you can use the following:
SELECT delete_job(1234);
If you want to see what jobs are triggered by user-defined actions, you can use the following query:
SELECT * FROM timescaledb_information.jobs where application_name like 'User-Define%';
Use this query to retrieve job statistics for a specific hypertable named test_table
in TimescaleDB. It pulls details from the timescaledb_information.job_stats view, which stores metrics about scheduled jobs managed by TimescaleDB’s job scheduler.
SELECT job_id,
total_runs,
total_failures,
total_successe
FROM timescaledb_information.job_stats
WHERE hypertable_name = 'test_table';
The following query retrieves error details for jobs managed by the TimescaleDB job scheduler from the timescaledb_information.job_errors view. This view brings information about errors encountered during job execution, providing valuable insights for debugging and maintaining the job scheduling system.
SELECT job_id, proc_schema, proc_name, pid, sqlerrcode, err_message from timescaledb_information.job_errors ;
User-defined functions (UDFs) in PostgreSQL are powerful tools that allow you to create custom operations to enhance data processing capabilities. When combined with Timescale’s job scheduler, you can significantly improve the execution of these functions at specified time intervals. Additionally, by monitoring your queries, you can track various aspects of your jobs, ensuring optimal performance and reliability.