Authors: Umair Shahid and Carlota Soto
Efficient data retrieval is essential for achieving peak PostgreSQL application performance, particularly when dealing with vast datasets. Indexing is a robust solution that accelerates the retrieval of specific rows. This article will explore PostgreSQL indexing, starting with a general introduction and building up to the best practices for optimizing index performance.
This article is the third of our four-part series on PostgreSQL performance tuning. If you missed the first chapters of our guide, check out Part I (on database sizing) and Part II (on key PostgreSQL parameters you may want to fine-tune).
A PostgreSQL index is a data structure designed to speed up data retrieval operations within a database table. Think of it as the table of contents in a book. Instead of painstakingly searching through the entire book to pinpoint a particular topic, you can swiftly navigate to the relevant page using the table of contents.
Similarly, in PostgreSQL, indexes act like a GPS for the database engine, enabling it to efficiently locate and retrieve specific rows without scanning the entire table. Simply put, an index is a pointer to data within a table.
These database indexes are standalone data structures that reside on disk alongside data tables and other related catalog objects. Sometimes, when all the data required for a query is in an index, Postgres can employ an INDEX ONLY
scan and never have to touch the table.
Using indexes gives PostgreSQL (and other relational databases) some crucial benefits:
Accelerated data retrieval. Indexes are pivotal in reducing the time required to fetch data, particularly when dealing with large-sized tables. Without indexes, the database would perform a full table scan, which can be sluggish.
Enhanced query performance. PostgreSQL query planner/optimizer uses indexes to derive the most efficient way to execute a query. Queries involving the WHERE
clause or joins between tables experience substantial performance improvements using indexes.
Minimized disk I/O. Indexes store a subset of the table's data, reducing disk I/O operations. This reduction expedites query execution and reduces the burden on the storage system.
Data integrity enforcement. Unique indexes guard against duplicate values within specific columns. They play a pivotal role in upholding data integrity by ensuring that no two rows possess identical values in the designated columns. Foreign key constraints, another aspect of data integrity, are more efficient with indexes on the referenced columns, making operations like cascading updates or deletes faster.
Learn how you can monitor your PostgreSQL indexes.
From our experience, everything related to databases has drawbacks if overused, including indexes. Well-designed indexes will give you optimal performance, but over-indexing is a common problem in large PostgreSQL deployments that can cause multiple problems.
Every index you add to a table requires its own separate storage allocation. For smaller datasets, the space used by indexes typically constitutes a small percentage of the total table size.
However, as the size of your PostgreSQL tables grows, so does the size of its associated indexes. Sizes can become an obstacle when multiple columns are indexed, or the indexed columns contain textual or string data. Unlike fixed-size data types like integers or dates, strings can vary in length, meaning that the index has to accommodate not only the string values but also metadata indicating the length of each string, thereby increasing the space overhead.
PostgreSQL table partitioning adds another layer of complexity. Partitioning is a technique used to enhance the performance and maintenance of large tables by splitting them into smaller, more manageable pieces while still treating them as a single logical entity.
In Timescale, all hypertables are automatically partitioned by time. When a large table is partitioned, each partition is a table unto itself; if indexes are applied to the partitioned table, they are effectively applied to each partition separately. So, if you have hundreds or thousands of partitions, each with its own index, the cumulative storage used by these indexes can rise quickly.
An inefficient use of indexes can also slow down your writes. When you introduce an index, it has to be updated every time a row is inserted, updated, or deleted. PostgreSQL has an optimization called HOT (heap-only tuples) which can make in-place row modifications by avoiding index changes when specific columns are updated. However, if a table has many indexes or if updates frequently touch indexed columns or columns used in expression indexes, the ability to perform these efficient hot updates diminishes.
Indexes also lead to more I/O operations due to necessary index adjustments. Each time data is added, deleted, or updated in a table or a partition, its associated indexes must also be updated. These updates can increase table bloat from dead tuples, degrading performance over time. Additionally, maintaining numerous indexes can make database backups and restores more time-consuming.
Postgres supports many index types, from the default b-tree indexes to more exotic types like hash, GIN, GiST, or BRIN. They are well documented in the PostgreSQL docs, but here’s a quick summary:
CREATE INDEX index_product_id ON products (product_id);
Default index type in PostgreSQL.
Supports <, <=, =, >=, >, BETWEEN, IN, IS NULL, IS NOT NULL.
Organizes entries in ascending order.
CREATE INDEX index_product_id ON products USING HASH (product_id);
Ideal for equality checks, especially for integers.
Doesn't support range queries or sorting.
CREATE INDEX index_product_id_name ON products (product_id, product_name);
Multicolumn index defined on multiple table columns.
Optimize based on columns frequently used in WHERE clauses.
CREATE INDEX index_product_id ON products(product_id) where product_available = ‘true’;
Built over a subset of a table, defined by a conditional expression.
Useful for filtering out frequently unqueried rows.
CREATE INDEX index_product_id_name_status ON products (product_id, product_name) include (status);
Allows index-only scans when the select list matches index columns.
Additional columns are specified with the INCLUDE keyword.
CREATE INDEX brin_example_index ON logs USING BRIN(log_date);
Designed for large, sorted tables like time-series data.
Space-efficient, storing min and max values within logical blocks.
CREATE INDEX idx_gist ON my_table USING gist(my_column);
GiST (Generalized Search Tree) supports indexing of complex data types such as geometric shapes and full-text documents.
Allows for custom operators for querying, supporting advanced search functionalities like natural language processing.
Uses "lossy" indexing strategies to summarize complex items, potentially leading to false positives that require detailed row checks.
CREATE INDEX idx_spgist ON my_table USING spgist(my_column);
SP-GiST (Space Partitioned Generalized Search Tree) is ideal for data with natural partitioning, such as spatial and hierarchical data.
Enables different branching factors at different nodes, allowing tailored tree structures that reduce search space.
Particularly efficient for applications requiring complex spatial queries or managing large data hierarchies.
We’re mentioning 'EXPLAIN ANALYZE’ a lot in this series, and for a good reason—it’s a fantastic tool. By running EXPLAIN ANALYZE
before executing a query, you can examine the query execution plan generated by the planner. It details which indexes are being used, how data is retrieved, and which query node might be spending excessive time.
For the specific case of optimizing indexes, if you see that PostgreSQL is doing sequential scans (Seq Scan) on large tables when you expect it to be doing index scans, this is another sign that the system isn't using indexes.
Imagine a customers
table with a column email containing a million records. You frequently run queries to look up customers by their email addresses, so you've set up an index on the email column to optimize these lookups. However, over time, you've noticed that the query performance seems to be deteriorating. To investigate this, you decide to examine the query plan using EXPLAIN ANALYZE
:
EXPLAIN ANALYZE SELECT * FROM customers WHERE email = '[[email protected]](/cdn-cgi/l/email-protection)';
You get this output:
Seq Scan on customers (cost=0.00..20450.00 rows=1 width=1168) (actual time=500.032..600.021 rows=1 loops=1) Filter: (email = '[[email protected]](/cdn-cgi/l/email-protection)'::text) Rows Removed by Filter: 999999 Total runtime: 600.040 ms
The output shows that PostgreSQL has chosen a Seq Scan over the customers
table, even though there's an index on the email column. This choice is surprising; you'd expect an index scan to be more efficient for such a specific lookup.
The output also gives you the time taken (Total runtime: 600.040 ms) and highlights that the filter removed nearly all rows in the table (Rows Removed by Filter: 999999), indicating that there was indeed a full scan.
This information tells you that the index on the email column can be more effective. Potential reasons might include:
Some misconfigured parameter makes the planner believe sequential scans are cheaper. random_page_cost and seq_page_cost
may not be set accurately according to your resources. random_page_cost
is often too high, which causes the planner to avoid index scans even when they would be faster.
The index is damaged or missing. This is a rare scenario, but indexes might get corrupted due to disk issues, crashes, or other issues. Furthermore, if an expected index has been inadvertently dropped or never created in the first place, PostgreSQL would have no choice but to revert to a sequential scan.
The statistics used by the query planner are out of date. PostgreSQL relies on statistics about the distribution of data in tables and indexes to generate efficient query plans. As data changes over time (due to additions, updates, or deletions), these statistics can become outdated. Running the ANALYZE
command helps the system gather fresh statistics.
Routine maintenance tasks, like VACUUM
and ANALYZE
, play a crucial role in optimizing indexes in a PostgreSQL database, so with best practices, you can get serious results. The ANALYZE
operation deserves special attention. When you run ANALYZE
, the system examines the tables and updates the statistics related to the distribution of values within the tables.
These statistics are vital because the query planner relies on them to devise the most efficient strategy to execute a query, which often includes deciding whether to use an index.
Implementing a regular routine for these maintenance tasks (e.g., weekly) can be highly beneficial. Such a schedule ensures that the database is rid of dead rows—obsolete data no longer accessible due to updates or deletions—and has up-to-date statistics.
Nice! Now, you have the tools to step up your indexing game as you scale. Before we wrap up, Let’s run through a quick summary of indexing best practices. These are simple, but if you follow them, your indexes will perform optimally:
Only index what you need. Don't create indexes on every column. Assess your query patterns and create indexes that align with your most common queries. Over-indexing can negatively impact write operations, increase storage requirements, and introduce maintenance overhead.
Avoid indexes on low cardinality columns. Indexing uniformly distributed columns with few distinct values (say gender or a boolean value) is usually an anti-pattern, as the random I/O generated by using the index will still visit most pages and will be slower than sequentially scanning the table. Sometimes, using partial indexes (which have a WHERE clause), multicolumn indexes, or writing queries with char can solve this problem.
Regularly review and optimize indexes. As your application's data and usage patterns evolve, the effectiveness of existing indexes may diminish. Regular reviews and optimizations can help align indexes with current query requirements.
Test and benchmark. Before implementing new indexes, assess their impact on query performance using tools like EXPLAIN ANALYZE
to verify whether the planner utilizes the index.
Update statistics after adding or removing an index. After adding or removing an index, update statistics using the ANALYZE
command on the affected table. Accurate statistics about the distribution of values in indexed columns are vital for the query optimizer to make better decisions. Failing to update statistics can result in suboptimal query plans and decreased performance.
Rebuild indexes concurrently. The standard REINDEX
command requires an exclusive table lock, which can disrupt database operations. In contrast, REINDEX CONCURRENTLY
allows the table to remain accessible during the index rebuilding process, although it may take longer to complete. Setting a higher value of maintenance_work_mem
can substantially speed up the index creation process, as we mentioned in our previous article.
Lastly, when working with partitioned tables or with Timescale’s hypertables (which abstract partitioning, taking care of it automatically), it’s worth adding some extra considerations:
In PostgreSQL, while you can create indexes on the parent partitioned table (which apply to all partitions), you can also create indexes on individual partitions, known as local indexes. If you’re using CREATE_INDEX in Timescale, an index will be created on each chunk (instead of a single transaction for the entire hypertable), enhancing performance for large databases.
When creating a hypertable, Timescale automatically creates indexes on the time column.
You can inspect your database indexes via the Explorer view in the Timescale console
You can also use composite indexes (an index that includes two or more table columns) in vanilla PostgreSQL and hypertables. These indexes are useful if your queries often filter on the time column combined with one or more other columns. For example, you may want to keep the index on time to quickly filter for a given time range and add another index on another column (e.g., device_id).
When building composite indexes, order matters. Check out this blog post for more information.
To define an index as a UNIQUE or PRIMARY KEY index in Timescale, the index must include the time and the partitioning columns (if you use one).
When you create a hypertable in Timescale, set the datatype for the time column as timestamptz and not timestamp.
In partitioned tables, it is extra important to pay attention to over-indexing, as every index you add introduces extra overhead during data ingestion.
If you have sparse data (a.k.a. with columns that are often NULL), you can add a clause to the index saying WHERE column IS NOT NULL. This clause prevents the index from indexing NULL data, which can lead to a more compact and efficient index.
One of the benefits of partitioned tables is the ability to quickly drop partitions (for example, old data in a time-based partitioning setup). When a partition is dropped, its associated indexes are also dropped, which is much faster than deleting rows and cleaning up associated index entries. This operation is even faster in Timescale due to its data retention policies and continuous aggregates.
Indexing strategies should consider how data in partitioned tables is accessed. For example, if older partitions (e.g., in a time-based partitioning setup) are rarely accessed, it might not be worth maintaining certain indexes.
When transforming your PostgreSQL tables into hypertables, you’ll keep your indexes, as seen in this blog post.
An efficient PostgreSQL indexing is a potent ally for fine-tuning data retrieval within your applications. Indexing is a balancing act, but creating the appropriate indexes and maintaining them will significantly enhance the overall performance of your PostgreSQL database.
On to the last article of this series: PostgreSQL Performance Tuning: Designing and Implementing Your Database Schema.