SQL Queries

How to Use a Common Table Expression (CTE) in SQL

A tiger developer coding on his computing (and probably using a common table expression).

Written by Dylan Paulus

Common Table Expressions (CTEs) are my go-to tool when writing complex SQL queries. They can be thought of as working like sub-queries or temporary tables but with their own superpowers.

In this article, we'll take a deep dive into what CTEs are and how they can simplify your queries, how recursive CTES can turn impossible PostgreSQL queries into possible, and how to use CTEs to create optimization fences to improve the performance of your queries.

Finally, at the end of the article, we'll discuss Timescale's hyperfunctions and how they improve and simplify our queries like CTEs do.

What Is a Common Table Expression?

Common Table Expressions are auxiliary queries that provide a temporary result set to a primary query. In other words, think of CTEs as a temporary table that only exists for the duration of a query. To get started, let's break down the structure of CTEs in SQL:

WITH [cte name] AS ( [auxiliary query] ) [primary query]

Breakdown:

  • WITH clause: Starts the common table expression and comes before the primary query. (You may also encounter people calling CTEs "With statements.")

  • CTE name: Defines the name of the CTE that can be referenced in the primary query—the name must be unique within the query.

  • AS: The As keyword starts the CTE definition.

  • Auxiliary query: This defines the CTE and "populates the temporary table."

  • Primary query: The primary query is our main SQL query—we can reference the CTE by name and use it like a regular table.

Common table expressions are not limited to just one auxiliary statement. We can separate auxiliary statements with a comma to provide any number of CTEs.

WITH [cte name] AS ( [query] -- auxiliary one statement ), [cte 2 name] AS ( [query] -- auxiliary two statement ), [cte 3 name] AS ( [query] -- auxiliary three statement ) [primary query] -- primary statement

Now that we know the structure of CTEs, let's dive into an example to solidify our mental model of CTEs and learn why they're helpful!

Using CTEs

For this example, we manage an employee management system. The database contains two tables: a departments table and a employees table. The database schema looks like this:

CREATE TABLE departments ( id SERIAL PRIMARY KEY, name VARCHAR(255) NOT NULL ); CREATE TABLE employees ( id SERIAL PRIMARY KEY, name VARCHAR(255) NOT NULL, department_id INTEGER REFERENCES departments(id), salary DECIMAL(10, 2), hire_date DATE );

Our manager has given us a task to determine how much money it costs to operate each department. Using CTEs, we can break the task up into two steps.

First, we need to sum up the salaries for every employee per department. The SQL to do this would look like this:

SELECT department_id, SUM(salary) AS total_salary FROM employees GROUP BY department_id

Second, we need to list and order the departments by the highest-costing departments. The query would be:

SELECT d.name, [department budgets].cost FROM [department budgets] ORDER BY [department budgets].cost DESC;

Since we have yet to learn each department's cost, we need to write placeholders for now. We can combine the two queries created into a single query using CTEs. The final report would look like this:

WITH department_budgets AS ( SELECT department_id, SUM(salary) AS cost FROM employees GROUP BY department_id ) SELECT d.name, db.cost FROM department_budgets db JOIN departments d ON d.id = db.department_id ORDER BY db.cost DESC;

With the queries combined, we can replace the placeholders with the name of the CTE.

Of course, the SQL statement could be written without using CTEs. That query would be messier. Also, the GROUP BY clause is a frequent source of frustration and confusion for many of us—especially within complex queries. CTEs allow us to organize our thoughts and tackle problems step-by-step. The end result is an easy-to-read query. Not only can we use CTEs to query data, but they are powerful tools for modifying data.

Modifying Data With CTEs

We can use INSERT, UPDATE, and DELETE statements within a common table expression's auxiliary or primary query. Paired with RETURNING clauses, which returns the data being modified, CTEs provide a great way to replicate or log changes to another table. For example, let's say we have a table to track employees who have left the company.

CREATE TABLE deactivated_employees ( id SERIAL PRIMARY KEY, name VARCHAR(255) NOT NULL, department_id INTEGER REFERENCES departments(id), deactivation_date DATE );

When an employee leaves the company, we want to delete that employee from the employees table and add them to the deactivated_employees table. This can be done by deleting the row within the CTE and RETURNING the information we need from that row. Then, have the primary query INSERT the deleted data into deactivated_employees by referencing the CTE. The CTE would look like this:

WITH deactivated AS ( DELETE FROM employees WHERE id = 1 RETURNING id, name, department_id ) INSERT INTO deactivated_employees SELECT id, name, department_id, now() FROM deactivated;

Combining data-modifying statements with read statements opens up many use cases. But this is just the beginning. Next, we'll look at recursive queries and how near-impossible queries become simple when using CTEs.

Recursive Queries

Recursive CTEs are incredibly powerful, especially when dealing with hierarchical or tree-structured data, such as organizational charts, file systems, or categories with subcategories. They allow us to query data related to a parent-child relationship, iterating through the hierarchy levels until a particular condition is met. Only through a recursive common table expression can a relational database like PostgreSQL be turned into a graph database.

WITH can be given an optional modifier, RECURSIVE, to turn CTEs into recursive loops. The SQL structure looks like this:

WITH RECURSIVE [cte name] AS ( [base case] UNION ALL [recursive term] ) [primary query]

A recursive CTE consists of a base case and a recursive term. The base case defines the starting point of the recursive query (e.g., the first node in a graph or the parent in a hierarchy). The recursive term describes the "loop" that will run until some endpoint is determined.

This recursive term will reference the CTE itself to create the loop. The base case and recursive term are combined using a UNION ALL (or sometimes just UNION if you want to eliminate duplicates, but be cautious as it can be more expensive in terms of performance) to create the full recursive CTE.

Using Recursive CTEs

Returning to our employee management system from before, the employees table has a new manager_id column that references another employee.

CREATE TABLE employees ( id SERIAL PRIMARY KEY, name VARCHAR(255) NOT NULL, department_id INTEGER REFERENCES departments(id), salary DECIMAL(10, 2), hire_date DATE, manager_id INTEGER REFERENCES employees(id) );

Our task is to generate the company org chart starting with a given employee. Like the last example, breaking the task into small steps will help determine the recursive CTE query. The first step is to develop a base case, which would be to get the starting employee.

SELECT id, name, manager_id FROM employees WHERE id = [employee id];

Next, we need to figure out the recursive term or how to loop through the employee hierarchy. To do this, we can think of the employees table performing a JOIN against itself on current_employee.manager_id = manager.id.

SELECT manager.id, manager.name, manager.manager_id FROM employees manager JOIN [current employee] employee ON employee.manager_id = manager.id

We cannot reference the current iteration in the (recursive) loop yet, so I'll leave a placeholder [current employee]. Recursive terms will continue to execute until they reach a path when the recursive term returns no rows.

This is important to consider to prevent recursive queries from entering an infinite loop! In the recursive term, when gathering the employee hierarchy, the recursive query will stop executing when the manager_id is null (e.g., the CEO at the top of the org chart).

With the base case and the recursive term written, let's combine them into a recursive CTE. Our recursive CTE will have a name we can reference in the recursive term so that we can replace the [current employee] placeholder with the name of the CTE. The complete recursive query to gather the employee hierarchy would look like this:

WITH RECURSIVE employee_hierarchy AS ( -- Base case: select the starting employee SELECT id, name, manager_id FROM employees WHERE id = 1 -- employee id UNION ALL -- Recursive term: The recursion will stop when `manager_id` = NULL SELECT manager.id, manager.name, manager.manager_id FROM employees manager JOIN employee_hierarchy employee ON employee.manager_id = manager.id ) SELECT * FROM employee_hierarchy;

Column List

A CTE will return all the columns specified in its SELECT statement (in employee_hierarchy, it is id, name, and manager_id). But, we also have the option to explicitly define what columns CTEs return and their names—referred to as the column list. The names of columns can be changed by manually specifying each column after the CTE name in the WITH expression:

WITH RECURSIVE employee_hierarchy(employee_id, name, manager_id) AS ( SELECT id, name, manager_id ... query )

We can additionally add columns to column lists to provide generated data. For example, if we want to sort employee_hierarchy by what level the employee is in the org chart, we can add a level column to the column list. The base case would start with level set to one. Then, each iteration of the recursive term would increment level by one. This would look like:

WITH RECURSIVE employee_hierarchy(id, name, manager_id, level) AS ( SELECT id, name, manager_id, 1 -- level starting at one FROM employees WHERE id = 1 UNION ALL SELECT manager.id, manager.name, manager.manager_id, level+1 -- increment level FROM employees manager JOIN employee_hierarchy employee ON employee.manager_id = manager.id ) SELECT * FROM employee_hierarchy ORDER BY level DESC; -- order by level

Optimizations

Up to this point, we've explored using a common table expression to break down queries into simple steps and recursive CTEs to query data from graph or tree-like data structures. The last use case we'll examine is how CTEs can help us optimize our SQL queries. In Improving DISTINCT Query Performance Up to 8,000x on PostgreSQL, we discussed how recursive CTEs can improve DISTINCT queries. This is one example of CTEs creatively optimizing queries and can be handy if you're running PostgreSQL yourself. However, TimescaleDB supports skip index scans without relying on CTEs.

Until PostgreSQL 12, CTEs materialize by default. In other words, a common table expression is only computed once and then cached (similar to materialized views). Any additional reference to the CTE would refer to the cached result. CTEs being materialized provide excellent options for optimizing expensive queries that are called multiple times in a primary query.

Starting in PostgreSQL 12, a query that doesn't have side effects isn't recursive and only referenced once and can be inlined by the query planner. This means the CTE gets turned into a subquery.

For example, a query that looks like this:

WITH my_cte AS ( SELECT * FROM my_table ) SELECT * FROM other_table ot JOIN my_cte m ON ot.id = m.other_id; It could be inlined to behave like this:

SELECT * FROM other_table ot JOIN (SELECT * FROM my_table) m ON ot.id = m.other_id;

However, we can tell the query planner how we want it to behave. Materialization can be enabled or disabled by specifying MATERIALIZED or NOT MATERIALIZED after AS in the WITH expression.

WITH my_cte AS MATERIALIZED ( SELECT * FROM my_table ) SELECT * FROM other_table ot JOIN my_cte m ON ot.id = m.other_id;

CTEs can be a tool to optimize queries. But with every optimization, always ANALYZE queries before explicitly setting NOT MATERIALIZED or MATERIALIZED to ensure you're not accidentally making queries perform slower.

Hyperfunctions

Much like how a common table expression streamlines complex SQL queries into more manageable units, Timescale's hyperfunctions enable us to condense verbose SQL statements into more straightforward, concise queries.

For example, if we're looking to calculate the time-weighted average price of a stock, taking into account the duration each price was adequate, the query might look somewhat complex:

SELECT date_trunc('day', time) AS day, SUM(price * duration) / SUM(duration) AS time_weighted_average_price FROM ( SELECT time, price, lead(time) OVER (ORDER BY time) - time AS duration FROM stock_prices ) sub GROUP BY day ORDER BY day;

This query calculates the duration each price was in effect by using the lead() window function to get the next timestamp and subtracting the current timestamp from it. Then, it calculates the time-weighted average price by multiplying each price by its duration, summing those products, and dividing by the total duration.

Timescale offers the time_weight() hyperfunction, combined with the average() accessor, it makes calculating time-weighted averages straightforward:

SELECT time_bucket('1 day', time) AS one_day_bucket, average(time_weight('Linear', time, price)) AS time_weighted_average_price FROM stock_prices GROUP BY one_day_bucket ORDER BY one_day_bucket;

In this query, time_bucket() groups the data into daily intervals. The time_weight('Linear', time, price) function computes a linear time-weighted average, where time is the timestamp and price is the value to be averaged over time. The average() accessor extracts the average value from the time-weighted calculations.

This method abstracts away the complex window functions and arithmetic operations, significantly simplifying the query and improving readability, much like how CTEs simplify complex query logic into more manageable parts.

Conclusion

Common table expressions are a powerful tool when writing queries. This article shows that CTEs allow us to break down complex queries into manageable chunks, traverse graph-like structures through recursive queries, and optimize queries through optimization fences. Timescale's hyperfunctions provide some ancillary to CTEs by simplifying complex queries and speeding time series analysis.

You can create a free Timescale account and utilize common table expressions and hyperfunctions.