PostgreSQL Indexes

PostgreSQL Performance Tuning: Optimizing Database Indexes

An elephant being workshopped by a developer—a metaphor for fine-tuning PostgreSQL performance

Authors: Umair Shahid and Carlota Soto

Efficient data retrieval is essential for achieving peak PostgreSQL application performance, particularly when dealing with vast datasets. Databases offer a robust solution in the form of indexing, a mechanism that accelerates the retrieval of specific rows. This article will explore PostgreSQL indexing, starting with a general introduction and building up to the best practices for optimizing index performance.

This article is the third of our four-part series on PostgreSQL performance tuning. If you missed the first chapters of our guide, check out Part I (on database sizing) and Part II (on key PostgreSQL parameters you may want to fine-tune).

What Is an Index? Reviewing Postgres Basics

At its core, a PostgreSQL index serves as a data structure designed to speed up data retrieval operations within a database table. Think of it as the table of contents in a book. Instead of painstakingly searching through the entire book to pinpoint a particular topic, you can swiftly navigate to the relevant page using the table of contents.

Similarly, in PostgreSQL, indexes act like a GPS for the database engine, enabling it to efficiently locate and retrieve specific rows without scanning the entire table. Simply put, an index is a pointer to data within a table.

These database indexes are standalone data structures that reside on disk alongside data tables and other related catalog objects. Sometimes, when all the data required for a query is in an index, Postgres can employ an INDEX ONLY scan and never have to touch the table.

Common PostgreSQL Index Types

Postgres supports many index types, from the default b-tree indexes to more exotic types like hash, GIN, GiST, or BRIN. They are well documented in the PostgreSQL docs, but here’s a quick summary:

B-Tree indexes (default index type in PostgreSQL)

CREATE INDEX index_product_id ON products (product_id);

  • Default index type in PostgreSQL.

  • Supports <, <=, =, >=, >, BETWEEN, IN, IS NULL, IS NOT NULL.

  • Organizes entries in ascending order.

Hash indexes

CREATE INDEX index_product_id ON products USING HASH (product_id);

  • Ideal for equality checks, especially for integers.

  • Doesn't support range queries or sorting.

Composite indexes

CREATE INDEX index_product_id_name ON products (product_id, product_name);

  • Multicolumn index defined on multiple table columns.

  • Optimize based on columns frequently used in WHERE clauses.

GiST and SP-GiST indexes

CREATE INDEX idx_gist ON my_table USING gist(my_column);

  • GiST (Generalized Search Tree) supports indexing of complex data types such as geometric shapes and full-text documents.

  • Allows for custom operators for querying, supporting advanced search functionalities like natural language processing.

  • Uses "lossy" indexing strategies to summarize complex items, potentially leading to false positives that require detailed row checks.

CREATE INDEX idx_spgist ON my_table USING spgist(my_column);

  • SP-GiST (Space Partitioned Generalized Search Tree) is ideal for data with natural partitioning, such as spatial and hierarchical data.

  • Enables different branching factors at different nodes, allowing tailored tree structures that reduce search space.

  • Particularly efficient for applications requiring complex spatial queries or managing large data hierarchies.

Partial indexes

CREATE INDEX index_product_id ON products(product_id) where product_available = ‘true’;

  • Built over a subset of a table, defined by a conditional expression.

  • Useful for filtering out frequently unqueried rows.

Covering indexes

CREATE INDEX index_product_id_name_status ON products (product_id, product_name) include (status);

  • Allows index-only scans when the select list matches index columns.

  • Additional columns are specified with the INCLUDE keyword.

Block Range index (BRIN)

CREATE INDEX brin_example_index ON logs USING BRIN(log_date);

  • Designed for large, sorted tables like time-series data.

  • Space-efficient, storing min and max values within logical blocks.

Index Best Practices: The Pros

Using indexes gives PostgreSQL (and other relational databases) some crucial benefits:

  • Accelerated data retrieval. Indexes are pivotal in significantly reducing the time required to fetch data, particularly when dealing with large-sized tables. Without indexes, the database would need to perform a full table scan, which can be excessively sluggish. For example, without an index, data retrieval is based on the object's key. While this can be quick for direct lookups, it's inefficient for complex queries or partial matches, which PostgreSQL handles gracefully with indexes.

  • Enhanced query performance. PostgreSQL query planner/optimizer uses indexes to derive the most efficient way to execute a query. Queries that involve conditions specified in the WHERE clause or require joins between tables experience substantial performance improvements using indexes. These queries can tap into indexes to swiftly identify rows that meet the specified criteria.

  • Minimized disk I/O. Indexes store a subset of the table's data, resulting in a reduction of disk I/O operations. This reduction not only expedites query execution but also reduces the burden on the storage system.

  • Data integrity enforcement. Unique indexes serve as guardians against duplicate values within specific columns. They play a pivotal role in upholding data integrity by ensuring that no two rows possess identical values in the designated columns. Foreign key constraints, another aspect of data integrity, are more efficient with indexes on the referenced columns, making operations like cascading updates or deletes faster.

Index Best Practices: The Cons

From our experience, everything related to databases has drawbacks if overused, including indexes. Well-designed indexes will give you optimal performance, but over-indexing is a common problem in large PostgreSQL deployments that can cause multiple problems.

An apparent one is the consumption of extra storage space: every index you add to a table requires its own separate storage allocation. At a fundamental level, an index comprises the values of the indexed columns and a reference (or pointer) to the actual row in the table.

For smaller datasets, the space used by indexes typically constitutes a small percentage of the total table size. However, as the size of your PostgreSQL tables grows, so does the size of its associated indexes. Sizes can become an obstacle when multiple columns are indexed or the indexed columns contain textual or string data. Unlike fixed-size data types like integers or dates, strings can vary in length, meaning that the index has to accommodate not only the string values but also metadata indicating the length of each string, thereby increasing the space overhead.

PostgreSQL table partitioning adds another layer of complexity. Partitioning is a technique used to enhance the performance and maintenance of large tables by splitting them into smaller, more manageable pieces while still treating them as a single logical entity.

In Timescale, all hypertables are automatically partitioned by time. When a large table is partitioned, each partition is a table unto itself; if indexes are applied to the partitioned table, they are effectively applied to each partition separately. So, if you have hundreds or thousands of partitions, each with its own index, the cumulative storage used by these indexes can rise quickly.

An inefficient use of indexes can also slow down your writes. When you introduce an index, it has to be updated every time a row is inserted, updated, or deleted. PostgreSQL has an optimization called HOT (heap-only tuples) which can make in-place row modifications by avoiding index changes when specific columns are updated. However, if a table has many indexes or if updates frequently touch indexed columns or columns used in expression indexes, the ability to perform these efficient hot updates diminishes.

Indexes also lead to more I/O operations due to necessary index adjustments. Each time data is added, deleted, or updated in a table or a partition, its associated indexes must also be updated. These updates can increase table bloat from dead tuples, degrading performance over time. Additionally, maintaining numerous indexes can make database backups and restores more time-consuming.

Indexes are a crucial aspect of your PostgreSQL database, but it’s essential to build a practice of monitoring and optimization to ensure they’re working in your favor. Keep reading for tips on improving your use of indexes, starting with identifying the parameters you should be aware of in PostgreSQL due to their influence on index usage.

Server Parameters Best Practices for Indexes

Certain server parameters offer a degree of control over how the query planner makes decisions about index usage, and they are instrumental in optimizing query performance. Let’s cover the best practices for these parameters:

work_mem and shared_buffers

We discussed these parameters in the previous episode of this serieswork_mem dictates the amount of memory allocated to each database operation, such as sorting and hashing—its sizing directly impacts whether operations remain in memory or spill over to disk. If work_mem is set appropriately, it can efficiently accommodate bitmap scans and sorts in memory, reducing the need for PostgreSQL to rely heavily on indexes. If the allocated memory is too small, the system might have to use disk-based temporary storage, affecting performance.

shared_buffers represents a cache of the database's disk pages in the main memory and is critical in how frequently the database needs to access the underlying disk. A well-sized shared_buffers ensures that frequently accessed table and index pages are kept in fast, main memory, drastically reducing I/O operations and improving overall query response times.

Make sure to check out Part II for more detailed configuration recommendations for both these parameters.

effective_cache_size

effective_cache_size  provides the query planner with an estimate of how much memory is available for caching data and indexes. Although it doesn't allocate memory by itself, it influences the planner's decisions regarding the execution of queries.

This parameter is essential for query planning, as it guides the planner in selecting the most efficient query plan based on the expected cost of different strategies. For example, suppose the effective_cache_size is a high value and the planner believes, based on this setting, that much of the needed data is likely to be in the OS's cache. In that case, the query planner is more likely to favor index scans over sequential ones, assuming ample caching memory.

The optimal value will vary based on your setup, but a common starting point is setting it to 50-70 percent of your machine's total RAM.

random_page_cost and seq_page_cost

random_page_cost and seq_page_cost are configuration parameters that help the PostgreSQL query planner estimate the relative costs of random versus sequential page reads. By default, seq_page_cost is set to 1 (one), indicating the base cost of a sequential read, while random_page_cost is set to 4 (four), suggesting that random access is considered four times more expensive than sequential access.

Generally, index lookups tend to involve random disk I/O because of the nature of traversing index trees. Full table scans are more inclined to be sequential, although this can vary with frequent updates or deletions. Adjusting the random_page_cost downward makes the planner more favorable towards index scans by narrowing the perceived cost gap between random and sequential page reads.

It's worth noting that the default value of 4 is more aligned with the behavior of traditional spinning disks. However, if your PostgreSQL instance runs on an SSD, you could see benefits by adjusting random_page_cost to around 1.1, reflecting the significantly reduced random access penalty of SSDs.

cpu_index_tuple_cost

This parameter indicates the relative cost of CPU operations required to scan and process index entries during query execution.

In theory, adjusting cpu_index_tuple_cost allows you to influence the query planner's choices regarding whether to utilize an index or a sequential scan for a particular query. Lower values would make index scans more appealing from a cost perspective, while higher values would lead the planner to favor sequential scans.

In practice, it is not recommended to alter the default value for cpu_index_tuple, which is 0.005 in the latest versions of PostgreSQL—at least not without thorough testing. The potential performance gains from tweaking it are usually marginal compared to the risks, e.g., suboptimal query plans and unexpected performance regressions.

How Can I Know Whether my PostgreSQL Indexes Are Working Efficiently?

The million-dollar question!

Determining whether your PostgreSQL indexes need optimization requires ongoing monitoring via a multi-faceted approach. Indexes are not a "set and forget" in PostgreSQL; as your workload evolves, they will require maintenance to ensure your database runs optimally.

Thankfully, PostgreSQL comes with great tools and techniques you can use to monitor your indexes. Let’s cover some of the most rewarding strategies you can follow to identify if your indexes might not be functioning as intended.

Finding unused indexes via pg_stat_user_indexes

As we mentioned at the beginning of this post, over-indexing is a frequent issue in many large PostgreSQL deployments. Once your database starts growing, it's essential to identify and remove unused or underutilized indexes to remain efficient. Too many indexes can lead to several problems:

  • Every time a row is added, deleted, or modified, all indexes on that table need to be updated. With many indexes, this write overhead can become substantial, slowing down INSERT, UPDATE, and DELETE operations.

  • Indexes consume disk space. Excessive or unused indexes can waste valuable storage, especially on large tables.

  • Indexes need to be periodically vacuumed and analyzed, and more indexes lead to longer maintenance windows.

If you want to spot potentially unused or underutilized indexes, pg_stat_user_indexes is your friend. You can look at the idx_scan column, which counts the number of index scans initiated on each index. Here's an example query that could help identify unused or rarely used indexes:

SELECT relname AS table_name, indexrelname AS index_name, pg_size_pretty(pg_relation_size(indexrelid)) AS index_size, idx_scan AS index_scan_count FROM pg_stat_user_indexes WHERE idx_scan < 50 -- Choose a threshold that makes sense for your application. ORDER BY index_scan_count ASC, pg_relation_size(indexrelid) DESC;

In the example above, we're looking for indexes with fewer than 50 scans (which is an arbitrary number for this example—you should adjust it based on your application's usage patterns and the duration since the last statistics reset; also, if you have daily or weekly batch jobs, ensure you're not prematurely dropping indexes that are used by those less frequent tasks).

Monitoring shared_blks_read and blk_read_time to identify if a new index may be needed

On the contrary, sometimes you may lack an index—this can happen as your workloads and query patterns evolve.

The pg_stat_statements view is once again your friend: it can also help you to identify slow queries that may benefit from additional indexing via the metrics shared_blks_read and blk_read_time.

shared_blks_read indicates the number of disk blocks read from shared memory; a high value for this parameter suggests that the system is reading a significant amount of data directly from the disk.

blk_read_time represents the total time spent reading these blocks, measured in milliseconds. A high value indicates that not only is the system reading a lot of blocks from the disk, but it's also spending a significant amount of time doing so.

If you notice higher values of shared_blks_read and blk_read_time associated with an important query, it signals that PostgreSQL is fetching data from disk rather than using indexes efficiently. Creating an index may significantly improve query performance.

Once again, let’s use a particle example to bring this home. Suppose you run a daily report listing all customers who made a purchase in the last 24 hours. Over time, as the number of customers and purchases grows, you notice that the query is taking longer and longer to run. To diagnose the issue, you turn to pg_stat_statements:

Query

shared_blks_read

blk_read_time

SELECT * FROM purchases WHERE purchase_date > NOW() - '1 day'::INTERVAL;

5,000

300

Within your use case, the high shared_blks_read value suggests that PostgreSQL is fetching a substantial amount of data directly from the disk when executing this query. Furthermore, the blk_read_time of 300 milliseconds indicates a significant delay in retrieving this data, which should be performing much better. (Once again, don’t take these values as generally applicable—you’ll have to determine your own normal ranges by monitoring overtime.)

Given this insight, you decide to investigate the indexes on the purchases table. You discover that there's no index on the purchase_date column. By adding an appropriate index, you might be able to significantly reduce the shared_blks_read and blk_read_time values, leading to a faster query execution:

CREATE INDEX idx_purchase_date ON purchases(purchase_date);

After creating the index and letting it populate, you could revisit pg_stat_statements to confirm the improvement.

Using EXPLAIN ANALYZE and looking for Seq Scan

We’re mentioning 'EXPLAIN ANALYZE’ a lot in this series, and for a good reason—it’s a fantastic tool.  By running EXPLAIN ANALYZE before executing a query, you can examine the query execution plan generated by the planner. It details which indexes are being used, how data is retrieved, and which query node might be spending excessive time. For the specific case of optimizing indexes, If you see that PostgreSQL is doing sequential scans (Seq Scan) on large tables when you expect it to be doing index scans, this is another sign that the system isn't using indexes.

Imagine a customers table with a column email containing a million records. You frequently run queries to look up customers by their email addresses, so you've set up an index on the email column to optimize these lookups. However, over time, you've noticed that the query performance seems to be deteriorating. To investigate this, you decide to examine the query plan using EXPLAIN ANALYZE:

EXPLAIN ANALYZE SELECT * FROM customers WHERE email = '[email protected]';

You get this output:

Seq Scan on customers (cost=0.00..20450.00 rows=1 width=1168) (actual time=500.032..600.021 rows=1 loops=1) Filter: (email = '[email protected]'::text) Rows Removed by Filter: 999999 Total runtime: 600.040 ms

The output shows that PostgreSQL has chosen a Seq Scan over the customers table, even though there's an index on the email column. This choice is surprising; you'd expect an index scan to be more efficient for such a specific lookup.

The output also gives you the time taken (Total runtime: 600.040 ms) and highlights that the filter removed nearly all rows in the table (Rows Removed by Filter: 999999), indicating that there was indeed a full scan.

This information tells you that the index on the email column can be more effective. Potential reasons might include:

  • There's some misconfiguration that makes the planner believe sequential scans are cheaper. The parameters we’ve covered earlier, such as random_page_cost and seq_page_cost, may not be set accurately according to your resources. As we’ve mentioned earlier, a common reason is that random_page_cost is too high, which causes the planner to avoid index scans even when they would be faster.

  • The index is damaged or missing. This is a rare scenario, but indexes might get corrupted due to disk issues, crashes, or other issues. Furthermore, if an expected index has been inadvertently dropped or never created in the first place, PostgreSQL would have no choice but to revert to a sequential scan.

  • The statistics used by the query planner are out of date. PostgreSQL relies on statistics about the distribution of data in tables and indexes to generate efficient query plans. As data changes over time (due to additions, updates, or deletions), these statistics can become outdated. Without accurate statistics, the query planner might make suboptimal choices, such as selecting a sequential scan over an index scan. Running the ANALYZE command helps the system gather fresh statistics.

Regularly running VACUUM and ANALYZE

Routine maintenance tasks, like VACUUM and ANALYZE, play a crucial role in optimizing indexes in a PostgreSQL database, so with best practices, you can get serious results. The ANALYZE operation deserves special attention. When you run ANALYZE, the system examines the tables and updates the statistics related to the distribution of values within the tables.

These statistics are vital because the query planner relies on them to devise the most efficient strategy to execute a query, which often includes deciding whether to use an index. Without updated statistics, the planner might make suboptimal choices, as mentioned previously.

Implementing a regular routine for these maintenance tasks (e.g., weekly) can be highly beneficial. Such a schedule ensures that the database is rid of dead rows—obsolete data no longer accessible due to updates or deletions—and has up-to-date statistics.

Indexing Best Practices: Summary

Nice! Now, you have the tools to step up your indexing game as you scale. Before we wrap up, Let’s run through a quick summary of indexing best practices. These are simple, but if you follow them, your indexes will perform optimally:

  • Only index what you need. Don't create indexes on every column. Assess your query patterns and create indexes that align with your most common queries. Over-indexing can negatively impact write operations, increase storage requirements, and introduce maintenance overhead.

  • Avoid indexes on low cardinality columns. Indexing uniformly distributed columns with few distinct values (say gender or a boolean value) is usually an anti-pattern, as the random I/O generated by using the index will still visit most pages and will be slower than sequentially scanning the table. Sometimes, using partial indexes (which have a WHERE clause), multicolumn indexes, or writing queries with char can solve this problem.

  • Regularly review and optimize indexes. As your application's data and usage patterns evolve, the effectiveness of existing indexes may diminish. Regular reviews and optimizations can help align indexes with current query requirements.

  • Test and benchmark. Before implementing new indexes, assess their impact on query performance using tools like EXPLAIN ANALYZE to verify whether the planner utilizes the index.

  • Update statistics after adding or removing an index. After adding or removing an index, update statistics using the ANALYZE command on the affected table. Accurate statistics about the distribution of values in indexed columns are vital for the query optimizer to make better decisions. Failing to update statistics can result in suboptimal query plans and decreased performance.

  • Rebuild indexes concurrently. The standard REINDEX command requires an exclusive table lock, which can disrupt database operations. In contrast, REINDEX CONCURRENTLY allows the table to remain accessible during the index rebuilding process, although it may take longer to complete. Setting a higher value of maintenance_work_mem can substantially speed up the index creation process, as we mentioned in our previous article.

Special Considerations for Partitioned Tables and Hypertables

Lastly, when working with partitioned tables or with Timescale’s hypertables (which abstract partitioning, taking care of it automatically), it’s worth adding some extra considerations:

  • In PostgreSQL, while you can create indexes on the parent partitioned table (which apply to all partitions), you can also create indexes on individual partitions, known as local indexes. If you’re using CREATE_INDEX in Timescale, an index will be created on each chunk (instead of a single transaction for the entire hypertable), enhancing performance for large databases.

  • When creating a hypertable, Timescale automatically creates indexes on the time column.

image

You can inspect your database indexes via the Explorer view in the Timescale console

  • You can also use composite indexes (an index that includes two or more table columns) in vanilla PostgreSQL and hypertables. These indexes are useful if your queries often filter on the time column combined with one or more other columns. For example, you may want to keep the index on time to quickly filter for a given time range and add another index on another column (e.g., device_id).

  • When building composite indexes, order matters. Check out this blog post for more information.

  • To define an index as a UNIQUE or PRIMARY KEY index in Timescale, the index must include the time and the partitioning columns (if you use one).

  • When you create a hypertable in Timescale, set the datatype for the time column as timestamptz and not timestamp.

  • In partitioned tables, it is extra important to pay attention to over-indexing, as every index you add introduces extra overhead during data ingestion.

  • If you have sparse data (a.k.a. with columns that are often NULL), you can add a clause to the index saying WHERE column IS NOT NULL. This clause prevents the index from indexing NULL data, which can lead to a more compact and efficient index.

  • One of the benefits of partitioned tables is the ability to quickly drop partitions (for example, old data in a time-based partitioning setup). When a partition is dropped, its associated indexes are also dropped, which is much faster than deleting rows and cleaning up associated index entries. This operation is even faster in Timescale due to its data retention policies and continuous aggregates.

  • Indexing strategies should consider how data in partitioned tables is accessed. For example, if older partitions (e.g., in a time-based partitioning setup) are rarely accessed, it might not be worth maintaining certain indexes.

When transforming your PostgreSQL tables into hypertables, you’ll keep your indexes, as seen in this blog post.

Wrap-Up

An efficient PostgreSQL indexing is a potent ally for fine-tuning data retrieval within your applications. Indexing is a balancing act, but creating the appropriate indexes and maintaining them will significantly enhance the overall performance of your PostgreSQL database.

On to the last article of this series: PostgreSQL Performance Tuning: Designing and Implementing Your Database Schema.