Data Analysis

Best Practices for PostgreSQL Aggregation

Colorful cubes merging as in data aggregation

PostgreSQL aggregation is essential for deriving meaningful insight from data—it transforms rows and rows of raw data into useful information that’s crucial for decision-making. It takes in multiple rows of data and outputs a single row, allowing users to analyze and summarize data efficiently.

PostgreSQL supports numerous built-in aggregate functions; common aggregate functions include SUM(), MAX(), MIN(), and AVG(), used for basic statistical operations, while more complex ones include stddev and variance, which are useful for more complex statistical analysis. 

In this article, we’ll discuss some best practices to help you leverage PostgreSQL aggregate functions and get the most out of them for enhanced data analysis. 

How Do PostgreSQL Aggregates Work?

While aggregates are a subset of functions, they’re fundamentally different from standard functions in the way they work. Aggregates take in a group of related rows to output a single result, while standard functions provide one result per row. To put it simply, functions work on rows, while aggregates work on columns. 

Let’s consider an example to understand this better. Suppose we have a sales table with three columns: product_id, quantity_sold, and price_per_unit, and we want to calculate the total revenue for each product and the total quantity sold for each. 

Let’s first create a table:

CREATE TABLE SALES (   product_id INTEGER,   quantity_sold INTEGER,   price_per_unit INTEGER );

Now, let’s add a few values to see the difference between functions and aggregates:

INSERT INTO SALES VALUES (0001, 10.00, 15.25); INSERT INTO SALES VALUES (0002, 12.00, 22.50); INSERT INTO SALES VALUES (0002, 10.00, 20.00); INSERT INTO SALES VALUES (0001, 8.00, 15.00);

We can now create a function to calculate the revenue by taking in the price per unit and the quantity sold:

CREATE FUNCTION calculate_revenue(quantity_sold INT, price_per_unit NUMERIC) RETURNS NUMERIC AS $$ BEGIN   RETURN quantity_sold * price_per_unit; END; $$ LANGUAGE plpgsql;

Now, let’s use this function to calculate the revenue for each:

SELECT product_id, quantity_sold, price_per_unit, calculate_revenue(quantity_sold, price_per_unit) AS total_revenue FROM SALES;

Here’s the result:

Best Practices for PostgreSQL Aggregation: example table

 

In this example, our calculate_revenue function returns the total revenue for each row. 

Now, to find the total quantity sold for each, we can just use the built-in SUM() aggregate function:

SELECT product_id, SUM(quantity_sold) AS total_quantity_sold FROM sales GROUP BY product_id;

As you can see, it works on the quantity_sold column and calculates the total quantity sold for each product:

Query output using the built-in SUM() aggregate function.

In other words, aggregates combine inputs from multiple rows into a single result (grouped by the product_id in this case). Under the hood, these PostgreSQL aggregates work row by row, which raises an important question: how do aggregates know the values stored in the previous rows? This is where state transition functions come in. 

State transition functions

The aggregate function stores the state of the rows it has already seen, and as new rows are added, the internal state is updated. In our example, the internal state is just the sum of all the products sold so far. 

The function that processes all the incoming rows and updates the internal state is called a state transition function. It takes in two arguments, the current state and the value of the incoming row, and outputs a new state. As the aggregate function scans over different rows, the state transition function updates the internal state, allowing PostgreSQL to move through a column quickly.

However, that’s not all there’s to it. Aggregate functions like SUM(), MAX(), and MIN() have a pretty straightforward state with just one value, but that’s not always the case. Some aggregates can have a composite state. 

For instance, in the case of the AVG() aggregate, you need to store both the count and the sum as the internal state. But then, there’s another step we need to take to get the result, which is to divide the total sum by the total count. This calculation is performed by another function called the final function; it takes the state and does the calculations necessary to get the final result. 

So, the state transition function is called every time there’s a new row, but the final function is only called once after the state transition function has processed the group of rows. And while state transition functions aren’t computationally more expensive than final functions, the former is still the most expensive part once you factor in the number of rows that go into the aggregate.

When you have a large volume of time-series data continuously being ingested, you want something that can help improve performance. The good news is that PostgreSQL already has mechanisms for optimizing aggregates.    

Parallelization and combine functions

Since the state transition function runs on each row, we can parallelize it to improve performance. We can do so by initializing multiple instances of the state transition function and providing each a subset of the rows as the input. 

Once these parallel aggregates run, we’ll end up with multiple partial states (one per parallel aggregate). However, since we need to aggregate the entire set of rows, we need an intermediate function that combines all the partial aggregates before running the final function. This is where we need another function, called the combine function, which we can run iteratively over all the partial states to get the combined state. Then, finally, we can run the final function to get the final result.

If this is unclear, consider the AVG() function again. By parallelizing the state transition functions, we can calculate the total sum and count of a subset of rows. Then, we can use the combine function to add up all the sums and counts of all subsets before running the final function.

Best Practices for Aggregate Design

Optimizing the design of these aggregates is essential if you want to get the most value from your data analytics. Here are some practices that can allow for effective aggregate design:

Two-step aggregation

One way to optimize data aggregation is to use a two-step aggregation process that basically emulates the way PostgreSQL implements the state transition and final functions for aggregates. The approach involves internal calls that return the internal state, exactly like the transition function we discussed above, and accessor calls that take in the internal state and return the result, exactly like the final function mentioned earlier. 

This is particularly useful for time-series data. By exposing the aggregates’ internal architecture using accessors and aggregates, we can better understand how to structure our calls. 

Caching results

Materialized views in PostgreSQL are a powerful way of optimizing performance by querying results. They pre-compute queries run frequently and store the results in the database. So, every time the query is run, the database doesn’t need to execute it; instead, the results are already accessible, which means you’ll get the response to your query quickly. This helps reduce repetitive computations, allowing for more efficient analytics. 

However, you’ll need to refresh materialized views every time the data is updated, which can be quite resource-intensive.    

Pre-aggregation

Pre-aggregations refer to materialized query results that persist as tables. You can create a separate roll-up table to store the aggregated data and use triggers to manage updates to aggregates, allowing access to the table instead of repeatedly calling the aggregate function. This can save many re-computations and improve overall performance, especially when the same aggregate is computed multiple times.  

Continuous Aggregates With Timescale

While materialized views are a powerful way of speeding up commonly run queries and avoiding expensive re-computations, there’s still a big problem—you need to manually refresh them to keep the view up to date, especially since it quickly becomes stale as new data comes in. 

And it’s not a one-time thing; you’ll have to refresh the views incrementally if you want to continue avoiding recomputation. To add to that, materialized views don’t have a built-in refresh mechanism that runs automatically. This means the materialized views won’t have the data that was updated or added after the last refresh, which can be particularly problematic if you have real-time data. 

We built continuous aggregates to overcome these limitations of materialized views and make real-time analytics possible. You can think of them as materialized views for real-time aggregates that are refreshed automatically via a refresh policy. Every time a refresh runs, it only uses the data changed since the last refresh (and not the entire dataset), making the process more efficient. Once you get up-to-date results, you can use these materialized views for use cases like live dashboards and real-time analytics.

Time_buckets() with continuous aggregates

Continuous aggregates involve the use of the time_bucket() function, which allows you to group data over different time intervals. It’s quite similar to PostgreSQL’s date_bin() function but is more flexible in terms of the start time and bucket size.  

Creating a refresh policy is pretty straightforward; you just need to define the refresh interval so that your continuous aggregates are periodically and automatically updated. This whole process is much more efficient than a materialized view, minimizes computation, and enables real-time analysis. 

Start Aggregating Your Data

Understanding the best practices to get the most out of PostgreSQL aggregation is crucial for improving data analytics and deriving more meaningful information from your data. We’ve talked about how PostgreSQL aggregates work and the best practices for their design.

And while built-in aggregates are great for minimizing computations, they come with a big limitation—they’re not always up-to-date, which can be a big problem, particularly if you have time-series data. If you want to improve your time-series aggregate performance, try Timescale today. You can experiment with continuous aggregates and see how they can improve your analytical capabilities.