Understanding SQL Aggregate Functions

Written by Dylan Paulus

You may have heard that "data is the new oil." By itself, data is unrefined and not valuable, but given processing and refinement, it becomes precious. We gain insights into our products, applications, and customers by exploring our data. PostgreSQL exposes aggregate functions that give us the tools to transform and process our data to provide meaning.

In this article, we'll take a look at how to use SQL aggregate functions, the pitfalls, and how Timescale gives us advanced tooling to aggregate time-series data.

Aggregate Functions

PostgreSQL aggregate functions allow us to pull meaning from all the data we store in our database. Aggregate functions take in a list of data (a bunch of rows) to produce a single, meaningful output. 

The best way to visualize aggregate functions is to work through an example. Let's look at the avg() or average function. The average function tells us our dataset's arithmetic mean.

Let's say we have a table of products in a hypothetical store:

-- create CREATE TABLE products (   id SERIAL PRIMARY KEY,   name TEXT NOT NULL,   price DECIMAL NOT NULL );

-- insert INSERT INTO products (name, price) VALUES ('pen', 2.50); INSERT INTO products (name, price) VALUES ('paper', 1.25); INSERT INTO products (name, price) VALUES ('hammer', 6.76); INSERT INTO products (name, price) VALUES ('blanket', 12.45); INSERT INTO products (name, price) VALUES ('chair', 59.99);

We can write a query using avg() to find out the average price of all our products by running:

SELECT avg(price) FROM products;

Of course, a large list of different aggregate functions provided by PostgreSQL is at our disposal. A few of the most used aggregate functions include:

- SUM() : adds up all the input values

- MAX() : finds the largest of the input values

- MIN() : finds the smallest of the input values

- COUNT() : adds up the number of rows (not to be confused with SUM()!)

Grouping aggregates

One of the biggest sources of frustration around aggregates is intermixing aggregate functions with column data. Building on our previous product table, let's include a category column.

CREATE TABLE products (   id SERIAL PRIMARY KEY,   name TEXT NOT NULL,   price DECIMAL NOT NULL,   category TEXT );

INSERT INTO products (name, price, category) VALUES ('pen', 2.50, 'office'); INSERT INTO products (name, price, category) VALUES ('paper', 1.25, 'office'); INSERT INTO products (name, price, category) VALUES ('hammer', 6.76, 'tools'); INSERT INTO products (name, price, category) VALUES ('blanket', 12.45, 'home'); INSERT INTO products (name, price, category) VALUES ('chair', 59.99, 'home');

And when finding the average price of all the products, we want to include the category column in the result like so:

SELECT avg(price), category FROM products;

Run the SQL command and boom! An error is given to us. 

This is because the price column gets reduced or "smushed down" into a single value. category loses meaning when we find the average price of all products. In this error, PostgreSQL is letting us know that we need to 1) include category in the avg() aggregation or 2) group the average price by category. Since finding the average of a string value is impossible, our best bet is option 2. We can use SQL's GROUP BY to group the results by category—finding the average price by category.

SELECT avg(price), category FROM products GROUP BY category;

Taking advantage of PostgreSQL's GROUP BY, we can start to see the power of aggregate functions; in this example, we have insight into the average cost of products in a given category.

HAVING vs. WHERE

You have probably run into the WHERE clause when filtering queries, but there is another way to filter results using the HAVING clause, which is generally less used. Though they appear to behave similarly, WHERE and HAVING clauses have unique and distinct effects on aggregate functions. Let's take a look at both.

Both HAVING and WHERE will filter the result set by some conditional. If we don't want to include the average price of home items, we could write the query using either SQL clause:

-- where SELECT avg(price), category FROM products WHERE category != 'home' GROUP BY category;

-- having SELECT avg(price), category FROM products GROUP BY category HAVING category != 'home';

Though it's a slightly different syntax, the result is the same.

Instead of filtering by category, we want to only get the categories whose average price is over $2. Easy enough; let's modify both queries.

-- where SELECT avg(price), category FROM products WHERE avg(price) > 2.0 GROUP BY category;

-- having SELECT avg(price), category FROM products GROUP BY category HAVING avg(price) > 2.0;

Run these two queries separately, and you'll find a problem. The query using WHERE fails, but the query using HAVING succeeds. What gives? The main distinction between WHERE and HAVING is that the WHERE filter is applied before aggregation takes place. HAVING filters get applied after aggregation takes place. Since our example filters the result set using an aggregate function avg(price) > 2.0, we can only filter after aggregation occurs—by using HAVING.

Filter

The FILTER clause adds an additional way to limit the data aggregate functions operate on. Instead of WHERE or HAVING, which filters the result for the entire query, FILTER only applies to the given aggregate function. This means we can use multiple aggregate functions in a single query. First, let's look at an example of querying for products with a single FILTER clause:

SELECT   avg(price) FILTER (where category = 'home') as avg_home_prices FROM products;

Using FILTER, we can include multiple aggregate functions in a query with different filtering conditions.

SELECT   avg(price) FILTER (where category = 'home') as avg_home_prices,   sum(price) filter (where category = 'office') as sum_office_prices,  count(*) filter (where category = 'tools') as total_tools FROM products;

How Aggregates Work

On the surface, aggregate functions look similar to standard functions, but there is a critical difference between the two. Aggregate functions work on columns, whereas standard functions work on rows. For example, a standard function like CEIL() rounds a value to the greatest integer per row. An aggregate function like SUM() takes in a range of columns and produces a single result.

Aggregation has three main components. PostgreSQL loops through all the rows and keeps track of new and already-seen rows. A function called the state transition function is called on each new row, which updates an internal value. Once all the rows have been looped through, a final function is called with the internal value to produce a final result. 

Let's take, for example, the AVG() aggregate function with our products table.

- The initial state is (0, 0) for price = 0 and count = 0

- The state transaction function is called for each row in the table

- For AVG(), the current price is added to the total price, and count gets one added to it

- (total price + row price, index + 1)

- Finally, the final function calculates the average from the internal state

- total price / index

The exact process is followed for all aggregate functions.

The separation of state transition function and final function optimizes aggregate functions by keeping state transition functions small and offloading the heavy processing until all the rows have been looped through.

Aggregation With TimescaleDB

TimescaleDB expands on aggregation functions over hypertables using hyperfunction aggregates. Hyperfunction aggregates allow us to analyze time-series data. Some hyperfunction aggregates are provided out of the box, but others require the timescaledb_toolkit extension installed. 

Similarly to PostgreSQL aggregate functions, hyperfunction aggregates have a state transition function (accessor) and final function (rollup). By combining different aggregations, accessors, and rollup functions, we can create powerful insights into our data. Each of these operations is separated to provide a more functional programming approach to data aggregation. For example, to create a hyperfunction aggregation, we first create the aggregation (with an aggregation function like stats_agg), and then we pass the aggregation result to an accessor (like average).

To get a practical look at how this works, let's look at an example using stats_agg, average, and time_bucket to find an average.

First, create a conditions table with data:

CREATE TABLE conditions (    time        TIMESTAMPTZ       NOT NULL,    location    TEXT              NOT NULL,    device      TEXT              NOT NULL,    temperature DOUBLE PRECISION  NULL,    humidity    DOUBLE PRECISION  NULL );

SELECT create_hypertable('conditions', by_range('time'));

INSERT INTO conditions (time, location, device, temperature) VALUES (NOW(), 'home', 'omega', 72.3); INSERT INTO conditions (time, location, device, temperature) VALUES (NOW() + interval '1 day', 'home', 'omega', 55); INSERT INTO conditions (time, location, device, temperature) VALUES (NOW() + interval '2 day', 'home', 'omega', 65); INSERT INTO conditions (time, location, device, temperature) VALUES (NOW() + interval '2 day', 'home', 'alpha', 82); INSERT INTO conditions (time, location, device, temperature) VALUES (NOW(), 'home', 'alpha', 83); INSERT INTO conditions (time, location, device, temperature) VALUES (NOW(), 'home', 'alpha', 83); INSERT INTO conditions (time, location, device, temperature) VALUES (NOW() + interval '25 minutes', 'home', 'alpha', 90);

We want to find the average temperature by day. First, we need to group the time series data into buckets of one-day intervals. Then, by using stats_agg() to create an aggregate, we can pass that into average() to calculate the average temperature per day.

SELECT      time_bucket('1 day'::interval, time),      average(stats_agg(temperature)) FROM conditions GROUP BY 1;

By combining different aggregates, accessors, and rollup functions (if you prefer to watch a video, check the one below) provided by Timescale, we can gain even more power over our time-series data.

Conclusion

PostgreSQL's aggregate functions are powerful tools for extracting meaningful insights from datasets, aiding in data-driven decision-making. But why stop there? Timescale takes these capabilities to the next level with hyperfunctions that easily give insights into your time-series data. 

If you want to try aggregate functions and experiment with the extremely powerful hyperfunction aggregates, create a free Timescale account to get started today!