Start supercharging your PostgreSQL today.
Written by Warda Bibi
Efficient querying is crucial when managing data in PostgreSQL. One way to ensure this is by using the right indexes on your tables. Indexes help queries retrieve results faster by skipping irrelevant parts of the table. Among the various index types PostgreSQL offers, one important index type is GIN (Generalized Inverted Index).
Although its name might sound complex, the concept is straightforward. GIN indexes excel at searching for elements within composite items such as arrays, text fields, or JSON. In this article, we'll walk you through on how to optimize array queries using GIN indexes.
This exploration of GIN indexes highlights why we've invested heavily in optimizing indexing capabilities at Timescale. Our work on PostgreSQL indexes for compressed columnar data represents a significant advancement, allowing developers to enjoy the benefits of columnar compression while maintaining the flexibility of PostgreSQL's rich indexing capabilities. Check it out!
Without indexes, PostgreSQL processes queries by performing sequential full-table scans, examining each row in the table to find matches. As the dataset grows, this approach becomes inefficient, leading to slower query execution due to increased I/O operations and higher CPU usage.
Consider a table with a column sensor_readings { Temperature , Humidity , Motion , BatteryLevel , Alert } containing arrays of values captured by IoT devices:
Row 1: {25.5, 60.2,Null,Null, 1} (Temperature: 25.5°C, Humidity: 60.2%, Alert: 1)
Row 2: {Null,Null,0, 15.3, 1} (Motion: 0, Battery Level: 15.3%, Alert: 1)
Row 3: {25.5, Null, 0, 10.5,Null} (Temperature: 25.5°C, Motion: 0, Battery Level: 10.5%)
Suppose you want to find all rows where the sensor_readings
array contains the values {25.5, 1} (Temperature = 25.5°C and Alert = 1). PostgreSQL will compare each value in the search array {25.5, 1} with the elements in the sensor_readings
array for every row.
For example:
In row 1 ({25.5, 60.2,Null,Null, 1}), PostgreSQL checks if both 25.5 and 1 are present. Since they are, this row is a match.
In row 2 ({Null,Null,0, 15.3, 1}), only 1 is present, so this row is not a match.
In row 3 ({25.5, Null, 0, 10.5,Null}), only 25.5 is present, so this row is not a match.
PostgreSQL goes through each row in the table and checks if all the values in the search array match those in the column's array. However, performing this comparison for each row can be very slow for large datasets, potentially resulting in thousands of operations, especially as the size of the arrays and the number of rows increase.
By using an index, PostgreSQL avoids these repeated, slow comparisons. Instead of comparing entire arrays row-by-row, the index can quickly identify rows that match the condition, dramatically improving performance, even for large datasets.
For faster queries, learn how to optimize your PostgreSQL database indexes.
The Generalized Inverted Index (GIN) is specifically designed to index composite values, such as arrays or JSON fields, in tables. It is particularly useful for queries that need to search for elements within these composite items efficiently.
Let's say you have transaction data in a database where:
Each row represents an account.
Each account has done multiple transactions, stored as an array of transaction values.
Acc_id | Transactions |
1 | {$100, $400, $600} |
2 | {$200, $500} |
Here:
Row 1: Account 1 (Acc1) has completed transactions of $100, $400, and $600.
Row 2: Account 2 (Acc2) has completed transactions of $200 and $500.
Now, you frequently run queries to find transactions that meet specific criteria, such as transactions exceeding a certain value (e.g., "$300").
Without an index, the database must scan each row and check if any transaction value in the array meets the condition (e.g., "greater than $300"). This requires evaluating every element in every array across all rows, which is slow and resource-intensive.
A GIN index works by creating a mapping between each unique transaction value (the key) and the corresponding row IDs (accounts) that contain that value. This efficient mapping allows PostgreSQL to quickly locate rows containing specific elements without scanning the entire table.
If a transaction value of $300 is found in the array of transactions for Account 1 (Row 1), the index records that $300 is associated with Account 1.
Similarly, if $300 is found in Account 2, the index maps $300 to Account 2 as well.
This mapping is stored in a structure called listOfAccounts
, which is a set of row IDs (accounts) linked to a specific transaction value.
For example, if $300 appears in the transaction arrays for Accounts 1 and 2, the listOfAccounts
for $300 would be {1, 2}.
This mapping allows the GIN index to quickly reference the accounts (rows) where $300 exists, eliminating the need to scan every row and every transaction.
The term "generalized" in GIN refers to its ability to handle different types of data, such as arrays, JSON, or full-text documents. However, GIN itself doesn’t understand these data types. Instead, it relies on operator classes to guide it on how to interact with and process these various data types.
An operator class can be thought of as a rulebook that provides GIN with instructions on how to manage specific data types. Each operator class defines the operations that can be performed on the data. For example:
For arrays, does one array contain another (@>
), or do two arrays overlap (&&
)?
For JSON, does a JSON object contain a specific key (?
), or does one JSON object contain another (@>
)?
These instructions ensure that GIN efficiently handles complex data types like arrays and JSON. Each operation, such as @>
or ?
, is assigned a strategy number within the operator class. For instance, the @>
operator for arrays might be strategy 1, and the &&
operator could be strategy 2. When you run a query like SELECT * FROM my_table WHERE my_array_column @> '{1, 2}';
, the GIN index doesn’t directly understand @>
.
Instead, it checks the operator class for arrays, finds strategy number 1, and uses the logic associated with it to execute the operation.
What makes GIN "generalized" is its ability to handle different data types using separate operator classes. GIN doesn’t need to be redesigned every time a new data type is introduced. You simply define a new operator class that tells it how to work with that type. For example:
One operator class teaches GIN how to work with arrays.
Another operator class teaches it how to handle JSON.
You can even define custom operator classes for your own data types. This flexibility makes GIN so powerful it can index a wide variety of data types without needing a complete overhaul. By separating the GIN index method (which is flexible) from the operator classes (which are data-type-specific), PostgreSQL can easily support multiple data types. This approach also allows new data types or operations to be added without having to modify the GIN index itself.
The "inverted" aspect of GIN refers to its index structure, which differs fundamentally from the traditional B-tree index. Instead of mapping a single value to a specific row (as in B-trees), GIN creates a reverse mapping where each indexed value points to a list of rows that contain that value. This design allows GIN to store multiple representations of a single row because a single row can contain multiple indexed values.
This "inverted" mapping structure allows GIN to handle complex queries exceptionally well, such as those involving multiple conditions (e.g., searching for rows where an array contains specific elements or where a document contains specific words). GIN combines these indexed entries to efficiently retrieve the relevant rows, even for queries that would be costly to execute with other types of indexes.
Another way of explaining GIN indexes comes from a presentation by Oleg Bartunov and Alexander Korotkov at a PGConf.EU in Prague. They describe a GIN index like the table of contents in a book, where the heap pointers (to the actual table) are the page numbers. Multiple entries can be combined to yield a specific result, like the search for “compensation accelerometers” in this example:
In the table with index, ‘compensation’ points to pages 30 and 68, indicating that these pages contain the keyword ‘compensation’. Similarly, ‘accelerometers’ point to pages 5, 10, 25, 28, 30, 36, 58, 59, 61, 73, 74, and 68, showing that these pages include the keyword ‘accelerometers’. When searching for ‘compensation accelerometers’, page 30 is common to both keywords. Therefore, page 30 would be returned as the result.
Before we move on, know that GIN indexes only support bitmap index scans (not index scan or index-only scan). This happens because a GIN index can list the same row pointer multiple times under different tokens (keys) that the field is broken into. To prevent returning duplicate rows, GIN forces queries through a bitmap index scan, which inherently deduplicates these pointers. Let's see an example.
Suppose we have a table ‘document’ where each document contains an array of keywords, and we have a GIN index on the ‘keywords’ column.
document_id | keywords |
1 | {apple, banana, cherry} |
2 | {banana, cherry, date} |
3 | {apple, cherry, date} |
4 | {banana, apple} |
Here, the GIN index stores a list of tokens (keywords) and the document IDs (row pointers) associated with those tokens. Each token (keyword) points to the rows where it appears. Here is the internal GIN index representation:
apple → {1, 3, 4} (appears in documents 1, 3, and 4)
banana → {1, 2, 4} (appears in documents 1, 2, and 4)
cherry → {1, 2, 3} (appears in documents 1, 2, and 3)
date → {2, 3} (appears in documents 2 and 3)
Now, when a query is run to find documents that contain both ‘apple’ and ‘banana’, PostgreSQL uses the bitmap index scan to process the GIN index and ensure deduplication. Let's say we run the following query:
SELECT * FROM documents WHERE keywords @> ARRAY ['apple', 'banana'];
The operator @>
checks if the array of keywords contains the specified values (in this case, 'apple' and 'banana'). PostgreSQL first retrieves the rows for the keywords apple and banana from the GIN index.
For ‘apple’, the GIN index tells us the rows are {1, 3, 4}.
For ‘banana’, the GIN index tells us the rows are {1, 2, 4}.
PostgreSQL then creates bitmaps for both sets of results. A bitmap is a data structure that represents rows using a series of bits, where each bit corresponds to a unique row and is set to 1 for a match and 0 otherwise. For apple ({1, 3, 4}), the bitmap might look like:
1 0 1 1
Where:
1 indicates the document is included (e.g., documents 1, 3, and 4).
0 indicates the document is not included (e.g., document 2 is not included for apple).
For banana ({1, 2, 4}), the bitmap might look like:
1 1 0 1
Where:
1 indicates the document is included (e.g., documents 1, 2, and 4).
0 indicates the document is not included (e.g., document 3 is not included for banana).
PostgreSQL now ANDs
the two bitmaps together to find the rows that match both conditions (contain both 'apple' and 'banana').
Bitmap for apple: 1 0 1 1
Bitmap for banana: 1 1 0 1
--------------------------------------------
Result bitmap: 1 0 0 1
The result bitmap shows 1 for documents 1 and 4, meaning these are the only documents that contain both 'apple' and 'banana'. Documents 2 and 3 are excluded because they do not meet both conditions.
The deduplication process occurs in the AND
operation by ensuring that each document is included only once in the result. The final query result will be:
document_id | keywords |
1 | {apple, banana, cherry} |
4 | {banana, apple} |
These are the documents where both apple and banana appear in the keywords array. Don’t be surprised when EXPLAIN
always shows bitmap index/heap scans for your GIN indexes. Bitmap scans are ideal for deduplication as they can easily identify and combine duplicate row pointers, resulting in a clean set of relevant rows.
The standard distribution of PostgreSQL includes a GIN operator class for arrays, which supports indexed queries using these operators:
<@
: If the left-hand array is contained within (or a subset of) the right-hand array)
@>
: If the left-hand array contains all the elements of the right-hand array
=
: If two arrays are equal
&&
: If two arrays have any elements in common
Let's say we have a table “inventory” with column “product” of type TEXT [ ]
(string array). We also have a GIN index on the product column. This index will help optimize queries that involve the array operators like <@
, @>
, =
, and &&
.
To ensure that PostgreSQL uses the GIN index instead of performing a sequential scan, you'll need a substantial number of rows. A sequential scan is more likely to occur if the dataset is small because PostgreSQL might decide that scanning the entire table is more efficient than using the index.
This will insert 1,000 rows with a repetitive set of string values in the array column.
The <@
operator is used to check if the left-hand array is fully contained within the right-hand array.
Left-hand array: This is the array stored in your database column (e.g., the product column in an inventory table). It acts as the "source" array being checked.
Right-hand array: This is the array you specify in your query for comparison. It contains the elements you want to verify.
PostgreSQL uses the <@
operator to compare the two arrays and determine if all the elements in the right-hand array exist within the left-hand array.
This operator is particularly useful for filtering rows based on whether their array values match a desired subset.
The @>
operator checks if the right-hand array is contained by the left-hand array, in other words, the right array must be a subset of the left array.
The results will include arrays that have [laptop, server] as a subset.
The =
operator checks if two arrays are exactly equal. It compares both the size and the content of the arrays, meaning the elements must be in the same order.
The &&
operator in PostgreSQL checks if two arrays overlap if they share at least one element. This is different from the equality operator because it doesn't require the arrays to be the same length or order; it only checks for common elements between the two arrays.
The ANY
operator in PostgreSQL is used to compare a value against any element of an array. For example, value = ANY
(array) checks if the value matches at least one element in the array. It's a shorthand for multiple OR
conditions, making queries involving arrays more concise. So instead of writing
value = element1 OR value = element2 OR value = element3
(where element1, element2, etc., are values in the array), you can simply write
value = ANY(array)
Note: PostgreSQL will not use the GIN index with ANY
because the =ANY
operator does not work with GIN indexes. The ANY
construct can be used in combination with various operators, but it is not an operator itself. When ANY
is used as:
constant = ANY (array_expression)
only indexes that support the =
operator on array elements would qualify. In this case, GIN indexes are not suitable. To utilize GIN, you can use the <@
operator with a one-element array instead.
Suppose we have a table test with a column of type int[]
. And we also have a GIN index in this column.
If you try to run a search query using ANY
, the query will likely result in a sequential scan. On the other hand, using @>
will yield the same result but with a bitmap heap scan.
Moreover, constant = ANY
(array_expression) is not completely equivalent to array_expression @> ARRAY[constant]
. Array operators return an error if any NULL
elements are involved, while the ANY
construct can handle NULL
on either side, leading to different results for data type mismatches.
In PostgreSQL, integer arrays (int [ ]
) are a specialized data type that allows you to store multiple integer values within a single column of a table. Each element in the array is of the integer type, and the array itself can be one-dimensional or multi-dimensional.
For example, a column of type int [ ]
can store data like:
{1, 2, 3} (a one-dimensional array)
{{1, 2}, {3, 4}} (a two-dimensional array)
Note: The intarray extension provides specialized functions and operators for integer arrays in PostgreSQL. However, these custom operators replace PostgreSQL's default array operators (like @>
, <@
). If you create a GIN index without accounting for this, PostgreSQL will ignore the index for queries using these operators and perform a sequential scan instead, significantly slowing down performance on large datasets.
For example, a query like that
SELECT * FROM my_table WHERE codes @> array[123];
will skip the GIN index and rely on a sequential scan.
To fix this, you need to specify the correct operator class for integer arrays when creating the GIN index. The intarray extension provides a dedicated operator class called gin__int_ops
, which ensures that the index works with the custom operators introduced by intarray.
Here’s how to create the index with the correct operator class:
Unlike regular indexes that create one entry per row, GIN indexes can create multiple entries for a single row, making them versatile but challenging to update. Modifying a GIN index can be expensive, potentially affecting tens or hundreds of index entries for just one row. This overhead slows down INSERT
and UPDATE
operations, especially with large datasets.
To address this, PostgreSQL uses fastupdate
(enabled by default) for GIN indexes. Rather than immediately updating the index, fastupdate
batches multiple updates into a temporary "pending list" that stores new or updated entries. These "deferred updates" improve initial write performance but must eventually be processed, adding overhead later.
This deferred data is flushed to the main index in one of three scenarios:
When the size of the pending list exceeds the value of gin_pending_list_limit
, PostgreSQL triggers a process to flush the accumulated entries from the pending list into the main GIN index. The gin_pending_list_limit
is a configuration parameter in PostgreSQL that determines the maximum size of the pending list for a GIN index. By default, this limit is set to 4 MB.
When an explicit call to gin_clean_pending_list
is made, it manually flushes the contents of the pending list to the main GIN index.
When autovacuum runs on the table with the GIN index, it automatically maintains tables and indexes in PostgreSQL by cleaning up dead tuples, analyzing table statistics, and flushing the pending list of GIN indexes as part of its routine maintenance at the end of the vacuum process.
While this deferred flushing helps optimize performance initially, it can cause noticeable slowdowns during write operations if the pending list becomes too large. For example, after every Nth INSERT
or UPDATE
, when the pending list exceeds its limit, PostgreSQL will flush the data to the main index, resulting in a significant delay for that operation.
GIN indexes provide powerful capabilities for specialized queries and complex data types in PostgreSQL, but they must be used strategically, especially on write-heavy tables.
This exploration of GIN indexes highlights why we've invested heavily in optimizing indexing capabilities at Timescale. Our work on PostgreSQL indexes for compressed columnar data represents a significant advancement, allowing developers to enjoy the benefits of columnar compression while maintaining the flexibility of PostgreSQL's rich indexing capabilities.
TimescaleDB leverages PostgreSQL's native capabilities, including its support for GIN indexes, while adding several performance enhancements for time-partitioned data. This makes your queries faster so you can power analytics in real time.
Sign up for a free and fully managed PostgreSQL cloud on Timescale Cloud (no credit card required), or install TimescaleDB and see firsthand how we're pushing the boundaries of PostgreSQL.