Written by Dylan Paulus
Common Table Expressions (CTEs) are my go-to tool when writing complex SQL queries. They can be thought of as working like sub-queries or temporary tables but with their own superpowers.
In this article, we'll take a deep dive into what CTEs are and how they can simplify your queries, how recursive CTES can turn impossible PostgreSQL queries into possible, and how to use CTEs to create optimization fences to improve the performance of your queries.
Finally, at the end of the article, we'll discuss Timescale's hyperfunctions and how they improve and simplify our queries like CTEs do.
Common Table Expressions are auxiliary queries that provide a temporary result set to a primary query. In other words, think of CTEs as a temporary table that only exists for the duration of a query. To get started, let's break down the structure of CTEs in SQL:
WITH [cte name] AS (
[auxiliary query]
)
[primary query]
Breakdown:
WITH
clause: Starts the common table expression and comes before the primary query. (You may also encounter people calling CTEs "With statements.")
CTE name: Defines the name of the CTE that can be referenced in the primary query—the name must be unique within the query.
AS
: The As keyword starts the CTE definition.
Auxiliary query: This defines the CTE and "populates the temporary table."
Primary query: The primary query is our main SQL query—we can reference the CTE by name and use it like a regular table.
Common table expressions are not limited to just one auxiliary statement. We can separate auxiliary statements with a comma to provide any number of CTEs.
WITH [cte name] AS (
[query] -- auxiliary one statement
),
[cte 2 name] AS (
[query] -- auxiliary two statement
),
[cte 3 name] AS (
[query] -- auxiliary three statement
)
[primary query] -- primary statement
Now that we know the structure of CTEs, let's dive into an example to solidify our mental model of CTEs and learn why they're helpful!
For this example, we manage an employee management system. The database contains two tables: a departments
table and a employees
table. The database schema looks like this:
CREATE TABLE departments (
id SERIAL PRIMARY KEY,
name VARCHAR(255) NOT NULL
);
CREATE TABLE employees (
id SERIAL PRIMARY KEY,
name VARCHAR(255) NOT NULL,
department_id INTEGER REFERENCES departments(id),
salary DECIMAL(10, 2),
hire_date DATE
);
Our manager has given us a task to determine how much money it costs to operate each department. Using CTEs, we can break the task up into two steps.
First, we need to sum up the salaries for every employee per department. The SQL to do this would look like this:
SELECT
department_id,
SUM(salary) AS total_salary
FROM
employees
GROUP BY
department_id
Second, we need to list and order the departments by the highest-costing departments. The query would be:
SELECT
d.name,
[department budgets].cost
FROM
[department budgets]
ORDER BY
[department budgets].cost DESC;
Since we have yet to learn each department's cost, we need to write placeholders for now. We can combine the two queries created into a single query using CTEs. The final report would look like this:
WITH department_budgets AS (
SELECT
department_id,
SUM(salary) AS cost
FROM
employees
GROUP BY
department_id
)
SELECT
d.name,
db.cost
FROM
department_budgets db
JOIN departments d ON d.id = db.department_id
ORDER BY
db.cost DESC;
With the queries combined, we can replace the placeholders with the name of the CTE.
Of course, the SQL statement could be written without using CTEs. That query would be messier. Also, the GROUP BY
clause is a frequent source of frustration and confusion for many of us—especially within complex queries. CTEs allow us to organize our thoughts and tackle problems step-by-step. The end result is an easy-to-read query. Not only can we use CTEs to query data, but they are powerful tools for modifying data.
We can use INSERT
, UPDATE
, and DELETE
statements within a common table expression's auxiliary or primary query. Paired with RETURNING
clauses, which returns the data being modified, CTEs provide a great way to replicate or log changes to another table. For example, let's say we have a table to track employees who have left the company.
CREATE TABLE deactivated_employees (
id SERIAL PRIMARY KEY,
name VARCHAR(255) NOT NULL,
department_id INTEGER REFERENCES departments(id),
deactivation_date DATE
);
When an employee leaves the company, we want to delete that employee from the employees
table and add them to the deactivated_employees
table. This can be done by deleting the row within the CTE and RETURNING
the information we need from that row. Then, have the primary query INSERT
the deleted data into deactivated_employees
by referencing the CTE. The CTE would look like this:
WITH deactivated AS (
DELETE FROM
employees
WHERE
id = 1
RETURNING
id,
name,
department_id
)
INSERT INTO deactivated_employees
SELECT
id,
name,
department_id,
now()
FROM
deactivated;
Combining data-modifying statements with read statements opens up many use cases. But this is just the beginning. Next, we'll look at recursive queries and how near-impossible queries become simple when using CTEs.
Recursive CTEs are incredibly powerful, especially when dealing with hierarchical or tree-structured data, such as organizational charts, file systems, or categories with subcategories. They allow us to query data related to a parent-child relationship, iterating through the hierarchy levels until a particular condition is met. Only through a recursive common table expression can a relational database like PostgreSQL be turned into a graph database.
WITH
can be given an optional modifier, RECURSIVE
, to turn CTEs into recursive loops. The SQL structure looks like this:
WITH RECURSIVE [cte name] AS (
[base case]
UNION ALL
[recursive term]
)
[primary query]
A recursive CTE consists of a base case and a recursive term. The base case defines the starting point of the recursive query (e.g., the first node in a graph or the parent in a hierarchy). The recursive term describes the "loop" that will run until some endpoint is determined.
This recursive term will reference the CTE itself to create the loop. The base case and recursive term are combined using a UNION ALL
(or sometimes just UNION
if you want to eliminate duplicates, but be cautious as it can be more expensive in terms of performance) to create the full recursive CTE.
Returning to our employee management system from before, the employees
table has a new manager_id
column that references another employee.
CREATE TABLE employees (
id SERIAL PRIMARY KEY,
name VARCHAR(255) NOT NULL,
department_id INTEGER REFERENCES departments(id),
salary DECIMAL(10, 2),
hire_date DATE,
manager_id INTEGER REFERENCES employees(id)
);
Our task is to generate the company org chart starting with a given employee. Like the last example, breaking the task into small steps will help determine the recursive CTE query. The first step is to develop a base case, which would be to get the starting employee.
SELECT id, name, manager_id
FROM employees
WHERE id = [employee id];
Next, we need to figure out the recursive term or how to loop through the employee hierarchy. To do this, we can think of the employees
table performing a JOIN
against itself on current_employee.manager_id = manager.id
.
SELECT manager.id, manager.name, manager.manager_id
FROM employees manager
JOIN [current employee] employee ON employee.manager_id = manager.id
We cannot reference the current iteration in the (recursive) loop yet, so I'll leave a placeholder [current employee]
. Recursive terms will continue to execute until they reach a path when the recursive term returns no rows.
This is important to consider to prevent recursive queries from entering an infinite loop! In the recursive term, when gathering the employee hierarchy, the recursive query will stop executing when the manager_id
is null (e.g., the CEO at the top of the org chart).
With the base case and the recursive term written, let's combine them into a recursive CTE. Our recursive CTE will have a name we can reference in the recursive term so that we can replace the [current employee]
placeholder with the name of the CTE. The complete recursive query to gather the employee hierarchy would look like this:
WITH RECURSIVE employee_hierarchy AS (
-- Base case: select the starting employee
SELECT id, name, manager_id
FROM employees
WHERE id = 1 -- employee id
UNION ALL
-- Recursive term: The recursion will stop when `manager_id` = NULL
SELECT manager.id, manager.name, manager.manager_id
FROM employees manager
JOIN employee_hierarchy employee ON employee.manager_id = manager.id
)
SELECT * FROM employee_hierarchy;
A CTE will return all the columns specified in its SELECT
statement (in employee_hierarchy
, it is id,
name,
and manager_id
). But, we also have the option to explicitly define what columns CTEs return and their names—referred to as the column list. The names of columns can be changed by manually specifying each column after the CTE name in the WITH
expression:
WITH RECURSIVE employee_hierarchy(employee_id, name, manager_id) AS (
SELECT id, name, manager_id
... query
)
We can additionally add columns to column lists to provide generated data. For example, if we want to sort employee_hierarchy
by what level the employee is in the org chart, we can add a level
column to the column list. The base case would start with level
set to one. Then, each iteration of the recursive term would increment level
by one. This would look like:
WITH RECURSIVE employee_hierarchy(id, name, manager_id, level) AS (
SELECT id, name, manager_id, 1 -- level starting at one
FROM employees
WHERE id = 1
UNION ALL
SELECT manager.id, manager.name, manager.manager_id, level+1 -- increment level
FROM employees manager
JOIN employee_hierarchy employee ON employee.manager_id = manager.id
)
SELECT * FROM employee_hierarchy ORDER BY level DESC; -- order by level
Up to this point, we've explored using a common table expression to break down queries into simple steps and recursive CTEs to query data from graph or tree-like data structures. The last use case we'll examine is how CTEs can help us optimize our SQL queries. In Improving DISTINCT Query Performance Up to 8,000x on PostgreSQL, we discussed how recursive CTEs can improve DISTINCT
queries. This is one example of CTEs creatively optimizing queries and can be handy if you're running PostgreSQL yourself. However, TimescaleDB supports skip index scans without relying on CTEs.
Until PostgreSQL 12, CTEs materialize by default. In other words, a common table expression is only computed once and then cached (similar to materialized views). Any additional reference to the CTE would refer to the cached result. CTEs being materialized provide excellent options for optimizing expensive queries that are called multiple times in a primary query.
Starting in PostgreSQL 12, a query that doesn't have side effects isn't recursive and only referenced once and can be inlined by the query planner. This means the CTE gets turned into a subquery.
For example, a query that looks like this:
WITH my_cte AS (
SELECT * FROM my_table
)
SELECT * FROM other_table ot JOIN my_cte m ON ot.id = m.other_id;
It could be inlined to behave like this:
SELECT * FROM other_table ot
JOIN (SELECT * FROM my_table) m ON ot.id = m.other_id;
However, we can tell the query planner how we want it to behave. Materialization can be enabled or disabled by specifying MATERIALIZED
or NOT MATERIALIZED
after AS
in the WITH
expression.
WITH my_cte AS MATERIALIZED (
SELECT * FROM my_table
)
SELECT * FROM other_table ot JOIN my_cte m ON ot.id = m.other_id;
CTEs can be a tool to optimize queries. But with every optimization, always ANALYZE
queries before explicitly setting NOT MATERIALIZED
or MATERIALIZED
to ensure you're not accidentally making queries perform slower.
Much like how a common table expression streamlines complex SQL queries into more manageable units, Timescale's hyperfunctions enable us to condense verbose SQL statements into more straightforward, concise queries.
For example, if we're looking to calculate the time-weighted average price of a stock, taking into account the duration each price was adequate, the query might look somewhat complex:
SELECT
date_trunc('day', time) AS day,
SUM(price * duration) / SUM(duration) AS time_weighted_average_price
FROM (
SELECT
time,
price,
lead(time) OVER (ORDER BY time) - time AS duration
FROM
stock_prices
) sub
GROUP BY
day
ORDER BY
day;
This query calculates the duration each price was in effect by using the lead()
window function to get the next timestamp and subtracting the current timestamp from it. Then, it calculates the time-weighted average price by multiplying each price by its duration, summing those products, and dividing by the total duration.
Timescale offers the time_weight()
hyperfunction, combined with the average()
accessor, it makes calculating time-weighted averages straightforward:
SELECT
time_bucket('1 day', time) AS one_day_bucket,
average(time_weight('Linear', time, price)) AS time_weighted_average_price
FROM
stock_prices
GROUP BY
one_day_bucket
ORDER BY
one_day_bucket;
In this query, time_bucket()
groups the data into daily intervals. The time_weight('Linear', time, price)
function computes a linear time-weighted average, where time is the timestamp and price is the value to be averaged over time. The average()
accessor extracts the average value from the time-weighted calculations.
This method abstracts away the complex window functions and arithmetic operations, significantly simplifying the query and improving readability, much like how CTEs simplify complex query logic into more manageable parts.
Common table expressions are a powerful tool when writing queries. This article shows that CTEs allow us to break down complex queries into manageable chunks, traverse graph-like structures through recursive queries, and optimize queries through optimization fences. Timescale's hyperfunctions provide some ancillary to CTEs by simplifying complex queries and speeding time series analysis.
You can create a free Timescale account and utilize common table expressions and hyperfunctions.