Start supercharging your PostgreSQL today.
Written by Sarah Conway
Regardless of whether you’re using PostgreSQL or TimescaleDB, you can expect exactly the same behavior from using the GROUP BY
clause: it’ll allow you to return rows from SELECT
statements in groups of one or more columns. It groups all rows that have the same value in all the columns listed into a single group row. This will remove any redundancy and is particularly useful when used with aggregate functions such as SUM()
, AVG()
, COUNT()
, MIN()
, or MAX()
.
The behavior of GROUP BY
is very similar to window functions, yet there is a significant difference: when using GROUP BY
, rows are collapsed into groups, so each field can no longer be individually accessed. Window functions allow you to reference information for each specific record in addition to viewing the result of the window function.
In short, you'll want to use GROUP BY
when you're looking to eliminate duplicate rows, aggregate the result set, and squash rows prior to performing calculations.
The (simplified) syntax for this clause as part of the overall SELECT
statement is as follows:
SELECT <expressions>
FROM <tables>
GROUP BY <condition>;
The full syntax of the GROUP BY
clause itself is:
GROUP BY [ ALL | DISTINCT ] grouping_element [, ...]
The order in which columns are specified does not matter.
And, of course, the use of this clause can be part of a complex SELECT
statement as well. The following full syntax can be used as reference:
SELECT [ ALL | DISTINCT | DISTINCT ON (<distinct_expressions>) ]
<expressions>
FROM <tables>
[WHERE <conditions>]
[GROUP BY <expressions>]
[HAVING <condition>]
[ORDER BY <expression> [ ASC | DESC | USING <operator> ] [ NULLS FIRST FIRST | NULLS LASTLAST ]]
[LIMIT [ <number_rows> | ALL]
[OFFSET OFFSET <offset_value> [ ROW | ROWS ]]
[FETCH FETCH { FIRST FIRST | NEXT } [ <fetch_rows> ] { ROW | ROWS } ONLY]
[FOR { UPDATE | SHARE } OF <table> [ NOWAIT ]];
You’re able to use GROUP BY
with any combination of possible clauses within a SELECT
statement. It’s evaluated after the FROM
and WHERE
clauses, prior to HAVING
, DISTINCT
, ORDER BY
, and LIMIT
.
The following examples make use of a basic table called customers
that contains the following fictional data:
customer_id | firstname | lastname | state |
1 | Cheyenne | Holmes | Arizona |
2 | Alejandro | Ruiz | California |
3 | Martha | Parker | California |
4 | Robert | James | Florida |
5 | Danny | Angelo | North Carolina |
The simplest form of the GROUP BY
clause is to use it without an aggregate function on a single table, as follows:
SELECT customer_id FROM orders GROUP BY customer_id;
When used in this manner, it works similarly to SELECT
DISTINCT
in that it will remove duplicate rows from a returned result set. In our case, we’re choosing to return all orders and are grouping by the customer_id
; this means that we are essentially choosing to only return the customer_id
of customers that have placed an order.
Result:
customer_id |
3 |
2 |
1 |
Note: When using GROUP BY without pairing it with an aggregate function in the above manner, for readability, it is encouraged to instead use SELECT DISTINCT as it serves functionally the same purpose with more concise wording. For example, the above query would be greatly simplified and re-written as SELECT DISTINCT customer_id FROM orders;
Aggregate functions can be paired with GROUP BY
for quick and condensed result sets for analytical or informational purposes. Retrieving these summarized and calculated results is the primary use case for GROUP BY
.
A simple example that groups the results by a single column is to count the number of customers for each state:
SELECT COUNT(customer_id), state FROM customers GROUP BY state;
Results:
count | state |
2 | California |
1 | Arizona |
1 | North Carolina |
1 | Florida |
To list the number of orders made by each customer, you can join both tables and group the results by the customer name. For this example, a table was created named orders
that contains the following fictional data:
order_id | customer_id | order_total |
1 | 1 | 85 |
2 | 2 | 21 |
3 | 1 | 101 |
4 | 3 | 44 |
5 | 2 | 3 |
We’ll select the columns for each customer’s first and last name, along with the count of orders for each customer, and sort these results across multiple columns using GROUP BY
.
SELECT customers.firstname, customers.lastname, COUNT(orders.order_id)
FROM orders
LEFT JOIN customers ON orders.customer_id = customers.customer_id
GROUP BY firstname, lastname;
Results:
firstname | lastname | count |
Martha | Parker | 1 |
Cheyenne | Holmes | 2 |
Alejandro | Ruiz | 2 |
Continuing to build on the previous example, we can also sort the results in ascending or descending order by adding on ORDER BY
. Ascending (ASC
) order is the default. Else, you can specify DESC
to get customers who have placed the most orders first in the result set, as shown here:
SELECT customers.firstname, customers.lastname, COUNT(orders.order_id)
FROM orders
LEFT JOIN customers ON orders.customer_id = customers.customer_id
GROUP BY firstname, lastname
ORDER BY count DESC;
Note: If attempting to run a query similar to this for analytical purposes, using the LIMIT clause would also likely be helpful for more performant and efficient results.
Result:
firstname | lastname | count |
Cheyenne | Holmes | 2 |
Alejandro | Ruiz | 2 |
Martha | Parker | 1 |
To filter over grouped rows, you can specify the HAVING
clause to remove groups from the result set that do not satisfy a specified condition. This clause occurs after all grouping and aggregation of data, meaning after all rows are returned, records that do not meet the specified condition are removed from the resultset.
As an example referencing the two tables we’ve already created and queries we’ve already tried, we can combine prior queries to return the count of customers in each state that have placed more than one order:
SELECT customers.firstname, customers.lastname, customers.state, COUNT(orders.order_id) AS total_number_of_orders
FROM customers
LEFT JOIN orders
ON customers.customer_id = orders.customer_id
GROUP BY firstname, lastname, state
HAVING COUNT(orders.order_id) > 1;
Result:
firstname | lastname | state | total_number_of_orders |
Alejandro | Ruiz | California | 2 |
Cheyenne | Holmes | Arizona | 2 |
What about the WHERE
clause? That’ll need to be used before GROUP BY
in a SELECT
statement, as records are filtered with WHERE
before the execution of HAVING
. As a result, using WHERE
leads to a faster result over HAVING
. Both statements can be used within the same query; however, if the WHERE
clause can provide the desired result on its own, the use of this clause should be prioritized to ensure an efficiently designed query.
If we want to look for the count of orders for a customer whose name closely matches “Cheyenn,” we’d use the following SQL statement:
SELECT customers.firstname, customers.lastname, customers.state, COUNT(orders.order_id) AS total_number_of_orders
FROM customers
LEFT JOIN orders
ON customers.customer_id = orders.customer_id
WHERE firstname LIKE 'Cheyen%'
GROUP BY firstname, lastname, state;
Result:
firstname | lastname | state | total_number_of_orders |
Cheyenne | Holmes | Arizona | 2 |
To learn more about the GROUP BY
clause and how to use it in PostgreSQL, you can check out PostgreSQL’s documentation on SELECT (and the GROUP BY
clause) and aggregate expressions. The Table Expressions section of the documentation also contains information relevant to the GROUP BY
clause.
It’s possible to achieve even more complex operations by combining GROUP BY
with GROUPING SETS
, CUBE
, and ROLLUP
. More information about these three concepts can be found in the official PostgreSQL documentation.
If you want to experiment with GROUP BY
in a natively optimized and performant PostgreSQL database, all while enjoying the benefits of automatic data partitioning and columnar compression for further performance improvements and a reduced storage footprint, create a free Timescale account today.