How to Index JSON Columns in PostgreSQL

Try for free

Start supercharging your PostgreSQL today.

Written by Warda Bibi

JSON (JavaScript Object Notation) is an open standard format that structures data using key/value pairs and arrays, as defined in RFC 7159. It is widely used for data exchange, document storage, and handling unstructured data in web services.

In this article, we will explore various techniques and best practices for efficiently indexing JSON data in PostgreSQL. If you want to learn more, be sure to check out our dedicated articles on JSON and JSONB in PostgreSQL.

Why Use JSON in PostgreSQL?

At first glance, the idea of storing unstructured data like JSON in a relational database like PostgreSQL might seem counterintuitive. After all, relational databases are designed for structured, tabular data. However, there are compelling reasons why JSON support in PostgreSQL is not just useful but often essential for today's data-intensive applications.

Let’s explore some key scenarios where JSON is helpful:

Adapting to changing data structures

One of the biggest challenges in data management is dealing with evolving schemas. Traditional relational databases require a fixed structure, but real-world data often involves dynamic structures. JSON offers flexibility, allowing us to store data without rigidly defining its structure upfront. This is especially valuable for use cases like event logging, analytics, or tagging systems, where the data format can change frequently. Instead of constantly altering the database schema (which can be resource-intensive), JSON lets us adapt seamlessly. 

Simplifying nested data

When handling deeply nested or hierarchical data, JSON provides a more natural way to represent it. Flattening such structures into multiple tables or columns can lead to unnecessary complexity. With JSON, we can store and query nested objects directly, making our database design cleaner and more intuitive.

Streamlining data integration

In many cases, data comes from external systems in JSON format, such as APIs, third-party services, or modern applications. Instead of immediately transforming this data to fit a relational model, PostgreSQL allows us to store it as-is. This can serve as a temporary staging area before further processing or integration into other parts of our system.

JSON support was first introduced in PostgreSQL 9.2  in 2012. Over the years, PostgreSQL has evolved with powerful JSON features that allow it to natively handle arbitrary data structures. As of now, PostgreSQL provides two data types for working with JSON data: JSON and JSONB.

JSON vs. JSONB

The JSON and JSONB data types in PostgreSQL accept almost identical input. Their main difference is their efficiency.

  • The JSON data type stores the raw JSON text exactly as it is, including any spaces, line breaks, duplicates, and the order of keys in JSON objects. 

  • On the other hand, JSONB (here, B stands for binary) stores JSON data in a binary format. It converts the input into a binary representation before storing it.

  • This JSONB format eliminates the need to reparse the data each time it is accessed, leading to significantly better performance for querying and processing. However, the trade-off is that writing data to a JSONB column incurs some overhead due to the conversion from text to binary. This makes JSONB slightly slower to insert than JSON, but much faster for querying and processing.

  • JSONB also normalizes data by removing unnecessary whitespace and formatting, and it does not preserve the order of keys in objects. Additionally, it does not allow duplicate keys in JSON objects; only the last occurrence of a key is stored, and earlier duplicates are discarded.

  • One of JSONB's most significant advantages is its advanced support for indexing, which can drastically improve query performance, especially on large datasets.

When it comes to use cases, you should choose JSON if you don’t plan on performing frequent queries or need to preserve the exact format of the input data, such as for logging purposes. On the other hand, if you need to query or index the data, or if performance is a priority, then JSONB is the better choice.

Why Index JSON/JSONB?

Without indexes, PostgreSQL processes queries by performing sequential full-table scans, examining each row in the table to find matches. As the dataset grows, this approach becomes inefficient, leading to slower query execution due to increased I/O operations and higher CPU usage. By using an index, PostgreSQL can avoid making these repeated, slow comparisons. Instead of comparing entire columns row-by-row, the index can be used to quickly identify rows that match the condition, dramatically improving performance, even for large datasets. 

As JSON and JSONB are designed to store hierarchical structures, this can make querying even more challenging compared to flat, tabular data. Without proper indexing, PostgreSQL has to traverse the entire JSON structure for each row to find matches, which can be inefficient, even for smaller datasets.

PostgreSQL indexes solve this problem by allowing us to quickly pinpoint the relevant parts of the JSONB data, bypassing the need to scan and parse the entire structure repeatedly. We've written extensively about fine-tuning index performance in PostgreSQL, so check it out to learn more.

JSONB Indexes

PostgreSQL provides several options to index JSONB data, such as:

  • GIN

  • BTREE

  • HASH

  • TRIGRAM with GIN

It’s important to note that not all index types support every operator or query pattern. Choosing the right index depends on the types of queries you run most frequently. Let’s dive into each index type to understand its strengths and use cases.

GIN (Generalized Inverted Index) 

GIN stands for “Generalized Inverted indexes.” GIN is designed to index composite values, such as arrays or JSON fields, in tables. It is particularly useful for queries that need to efficiently search for elements within these composite items. For example, the items could be documents, and the queries could be searches for documents containing specific words. To understand GIN indexes, let's say you are managing a large e-commerce platform that processes millions of transactions daily. Here’s what the data might look like:

id

user_id

amount

details (JSONB)

1

101

500

{"tags": ["electronics", "laptop", "discount"], "gift_points": false}

2

103

1500

{"tags": ["electronics", "phone"], "gift_points": true, "express_shipping": true}

3

104

200

{"tags": ["home_appliances", "discount"], "gift_points": false}

4

101

100

{"tags": ["beauty", "skincare"], "gift_points": true, "first_time_buyer": true}

Now, suppose your marketing team frequently searches for discount-related transactions to analyze the effectiveness of promotions. Without an index, PostgreSQL would have to:

  • Scan each row in the table.

  • Parse the JSONB column.

  • Check if the "discount" tag exists in the data.

This approach becomes extremely slow when dealing with millions of transactions. A GIN index optimizes these queries by mapping each unique tag to the corresponding transaction IDs that contain it.

For example, If the "tags" field contains "discount" in transaction 1, the GIN index creates an entry mapping:                      

 "discount" → {1}.

If "discount" also  appears in transaction 3, the index updates the mapping to: 

"discount" → {1, 3}.

This mapping allows the GIN index to quickly reference the transactions (rows) where the discount tag exists, eliminating the need to scan every transaction. 

GIN structure

Instead of searching every transaction manually for specific tags, the GIN index has a list of all the tags found in the transactions. Each tag in the list points to a smaller list of rows that contain that tag.

Entry Page (Keywords Level)
  • Think of each entry as a unique tag from our JSONB column, such as "discount," "electronics," "laptop," etc.

  • Each keyword (like "discount") acts as an index entry, helping us find transactions containing that tag.

Entry Page ( Leaf Nodes)
  • If a tag (like "discount") appears in a few transactions, PostgreSQL stores a sorted list of those transaction IDs inside a list called the posting list.

  • If the tag appears too many times, it doesn’t fit into a simple list, so PostgreSQL uses a posting tree (B-Tree structure) instead. This makes searching even faster!

Posting Pages ( Item Pointers)
  • These store references to actual rows in the transactions table.

  • If "discount" appears in transaction IDs 1 and 3, the posting list would contain {1, 3}.

  • If "electronics" appears in many transactions (1, 2, ...1000+), PostgreSQL builds a tree structure for better search performance which further uses these posting pages for exact item pointers.

Operator classes:

An operator class can be thought of as a rulebook that provides GIN with instructions on how to manage specific data types. Each operator class defines the operations that can be performed on the data. GIN supports two operator classes:

  • jsonb_ops (default): ?, ?|, ?&, @>, @@, @?

    •  Index each key and value in the JSONB element

  • jsonb_pathops: @>, @@, @?

    • This uses an optimized GIN index structure that stores the key/value pairs as hash values: hashfn(field1, value1), hashfn(field2, value2), and hashfn(other_field, value2). 

Syntax:

CREATE INDEX index_name ON table_name USING gin (jsonb_column_name);

Existence Operators (?, ?|, ?& )

These operators can be used to check for the existence of top-level keys in the JSONB. Let’s create a GIN index on the details JSONB column. 

CREATE INDEX details_gin_idx ON transactions USING GIN (details);

To find all the transactions that have an express shipping tag.

​​postgres=# EXPLAIN ANALYZE SELECT * FROM transactions                                                                                                                                                       WHERE details ? 'express_shipping';

                            QUERY PLAN                                                              ------------------------------------------------------------------------------  Bitmap Heap Scan on transactions  (cost=2017.04..5660.14 rows=1166 width=56) (actual time=89.180..445.504 rows=1000001 loops=1)    Recheck Cond: (details ? 'express_shipping'::text)    Heap Blocks: exact=16656    ->  Bitmap Index Scan on details_gin_idx  (cost=0.00..2016.74 rows=1166 width=0) (actual time=87.190..87.191 rows=1000001 loops=1)          Index Cond: (details ? 'express_shipping'::text)  Planning Time: 0.184 ms  Execution Time: 463.929 ms (7 rows)

As you can see from the EXPLAIN output, the GIN index we created is being used for the search. Let's say we want to find users who are first-time buyers and have tags:

​​postgres=# EXPLAIN ANALYZE SELECT * FROM transactions WHERE details ?&  ARRAY['first_time_buyer', 'tags'];

                              QUERY PLAN                                                         ------------------------------------------------------------------------------  Bitmap Heap Scan on transactions  (cost=444.00..448.01 rows=1 width=102) (actual time=1.408..1.409 rows=1 loops=1)    Recheck Cond: (details ?& '{first_time_buyer,tags}'::text[])    Heap Blocks: exact=1    ->  Bitmap Index Scan on details_gin_idx  (cost=0.00..444.00 rows=1 width=0) (actual time=1.383..1.384 rows=1 loops=1)          Index Cond: (details ?& '{first_time_buyer,tags}'::text[])  Planning Time: 0.173 ms  Execution Time: 1.475 ms

The GIN index only supports the “existence” operators on “top-level” keys. If the key is not at the top level, then the index will not be used. It will result in a sequential scan:

postgres=# EXPLAIN ANALYZE SELECT * FROM transactions                           WHERE details->'tags' ? 'discount';

                          QUERY PLAN                                                        ------------------------------------------------------------------------------  Seq Scan on transactions  (cost=0.00..31656.10 rows=10000 width=102) (actual time=1.036..163.573 rows=299887 loops=1)    Filter: ((details -> 'tags'::text) ? 'discount'::text)    Rows Removed by Filter: 700120  Planning Time: 0.289 ms  Execution Time: 170.906 ms (5 rows)

The GIN index was not used because "tags" is a nested key.

Path Operators @>, <@:

The path operator tests whether one document contains another. This operator can compare partial JSON strings against a JSONB column. Let’s use it similar to the ? operator above:

postgres=# EXPLAIN ANALYZE SELECT * FROM transactions                           WHERE details @> '{"express_shipping" :true}';

                       QUERY PLAN                                                                ------------------------------------------------------------------------------  Bitmap Heap Scan on transactions  (cost=2007.19..21184.45 rows=201701 width=102) (actual time=90.695..230.534 rows=200056 loops=1)    Recheck Cond: (details @> '{"express_shipping": true}'::jsonb)    Rows Removed by Index Recheck: 399708    Heap Blocks: exact=16656    ->  Bitmap Index Scan on details_gin_idx  (cost=0.00..1956.76 rows=201701 width=0) (actual time=88.512..88.512 rows=599764 loops=1)          Index Cond: (details @> '{"express_shipping": true}'::jsonb)  Planning Time: 0.425 ms  Execution Time: 234.453 ms (8 rows)

The path operators support queries on both nested and top-level objects:

postgres=# explain analyze SELECT * FROM transactions                           WHERE details @> '{"tags" :"discount"}';

                              QUERY PLAN                                                         ------------------------------------------------------------------------------  Bitmap Heap Scan on transactions  (cost=444.00..448.01 rows=1 width=102) (actual time=1.484..1.486 rows=0 loops=1)    Recheck Cond: (details @> '{"tags": "discount"}'::jsonb)    ->  Bitmap Index Scan on details_gin_idx  (cost=0.00..444.00 rows=1 width=0) (actual time=1.480..1.480 rows=0 loops=1)          Index Cond: (details @> '{"tags": "discount"}'::jsonb)  Planning Time: 0.274 ms  Execution Time: 1.563 ms (6 rows)

B-Tree indexes

B-tree indexes are the most commonly used index type in relational databases. They are highly efficient for range queries and equality checks, making them ideal for structured data. However, when applied to an entire JSONB column, B-tree indexes only support whole-object comparisons using the operators =, <, <=, >, and >=. This is rarely useful in practice because JSONB objects are typically complex and contain nested values.

A more practical approach is to use B-tree expression indexes, which allow indexing specific keys or nested values inside a JSONB column. These indexes support the common comparison operators (=, <, >, <=, >=), which GIN indexes do not support.

Let's say we want to find all transactions where the first_time_buyer field is true.

postgres=# explain analyze SELECT * FROM transactions WHERE (details->'first_time_buyer')::boolean = true;

                                QUERY PLAN                                                      ------------------------------------------------------------------------------  Seq Scan on transactions  (cost=0.00..31656.10 rows=333336 width=102) (actual time=0.047..113.492 rows=1 loops=1)    Filter: ((details -> 'first_time_buyer'::text))::boolean    Rows Removed by Filter: 1000006  Planning Time: 0.190 ms  Execution Time: 113.520 ms (5 rows) Even though there is a GIN index in the details column, it does not support the = operator. As a result, PostgreSQL performs a full sequential scan, making the query inefficient. To optimize this query, we can create a B-tree expression index on first_time_buyer inside the details JSONB column:

postgres=# CREATE INDEX idx_btree_first_time_buyer ON transactions USING BTREE (((details->'first_time_buyer')::boolean));

postgres=# explain analyze SELECT * FROM transactions WHERE (details->'first_time_buyer')::boolean = true;

                             QUERY PLAN                                                              -----------------------------------------------------------------------------  Bitmap Heap Scan on transactions  (cost=5533.79..29689.85 rows=333336 width=102) (actual time=0.073..0.075 rows=1 loops=1)    Filter: ((details -> 'first_time_buyer'::text))::boolean    Heap Blocks: exact=1    ->  Bitmap Index Scan on idx_btree_first_time_buyer  (cost=0.00..5450.45 rows=500004 width=0) (actual time=0.051..0.052 rows=1 loops=1)          Index Cond: (((details -> 'first_time_buyer'::text))::boolean = true)  Planning Time: 0.782 ms  Execution Time: 0.115 ms (7 rows)

Here, adding a B-tree index on first_time_buyer improved query performance, reducing execution time from 113 ms to just 0.1 ms.

Hash Indexes 

Hash indexes are a specialized type of index in PostgreSQL that are optimized for equality checks (i.e., the = operator). They are particularly useful when we need to perform exact matches on specific JSONB fields, whether those fields are top-level or deeply nested.

Hash indexes tend to be smaller than B-tree or GIN indexes, which can be beneficial for storage.

CREATE INDEX idx_hash_express ON transactions USING HASH (((details->'express_shipping')::boolean)); This index extracts the publisher field from the JSONB data column and creates a hash index on the extracted values.

postgres=# explain analyze SELECT * FROM transactions WHERE (details->'express_shipping')::boolean = true;

       QUERY PLAN                                                                 ------------------------------------------------------------------------------  Bitmap Heap Scan on transactions:  (cost=24615.32..60866.85 rows=334201 width=102) (actual time=22.495..362.902 rows=203054 loops=1)    Filter: ((details -> 'express_shipping'::text))::boolean    Heap Blocks: exact=16707    ->  Bitmap Index Scan on idx_hash_express  (cost=0.00..24531.76 rows=501302 width=0) (actual time=18.872..18.872 rows=203054 loops=1)          Index Cond: (((details -> 'express_shipping'::text))::boolean = true)  Planning Time: 0.247 ms  Execution Time: 367.889 ms (7 rows)

This shows that the query uses the hash index to quickly locate the row where express_shipping equals true.

  • fields: Indexing every field can lead to index bloat and increased storage usage.

GIN Trigram Indexes

PostgreSQL supports a powerful feature called trigram indexes for efficient string matching, especially when dealing with partial or fuzzy searches. When combined with GIN indexes, trigram indexes can be used to index and query JSONB fields for arbitrary substring matches.

Trigram indexes work by breaking up text into trigrams, which are sequences of three consecutive characters. For example, the word "hello" is broken into the trigrams: "hel", "ell", "llo".

These trigrams are then indexed, allowing PostgreSQL to efficiently search for patterns within strings, even if the pattern appears in the middle of the text.

When working with JSONB data, we might need to search for partial matches within string fields, such as finding a tag field that contains a specific substring ("dis").

Before creating a trigram index, we need to enable the pg_trgm extension, which provides the necessary functions and operators for trigram indexing.

CREATE EXTENSION pg_trgm; Now we can create a GIN trigram index on a specific JSONB field using the gin_trgm_ops operator class.

CREATE INDEX tags_trgms_idx ON transactions USING GIN ((details->>'tags') gin_trgm_ops);

We can now perform efficient substring searches on the tags field.

Explain analyze SELECT * FROM transactions WHERE (details->>'tags')::text LIKE '%dis%';                            QUERY PLAN                                                             ------------------------------------------------------------------------------  Bitmap Heap Scan on transactions  (cost=414.93..31248.46 rows=40120 width=102) (actual time=42.538..157.725 rows=300796 loops=1)    Recheck Cond: ((details ->> 'tags'::text) ~~ '%dis%'::text)    Heap Blocks: exact=16707    ->  Bitmap Index Scan on tags_trgms_idx  (cost=0.00..404.90 rows=40120 width=0) (actual time=38.784..38.784 rows=300796 loops=1)          Index Cond: ((details ->> 'tags'::text) ~~ '%dis%'::text)  Planning Time: 0.336 ms  Execution Time: 164.067 ms (7 rows) This shows that the query uses the GIN trigram index (tags_trgms_idx) to efficiently locate rows where the tags field contains the substring "dis".

Note: Regular B-tree indexes are left-anchored, meaning they only work efficiently for patterns that start at the beginning of the string ( LIKE 'DIS%'). However, GIN trigram indexes allow us to search for arbitrary substrings (LIKE '%DIS%').

Conclusion

Indexes are powerful tools and are often essential for optimizing performance. However, they must be used thoughtfully, especially on tables with frequent write operations, as improper indexing can lead to overhead and inefficiencies. By carefully selecting the right indexing strategies based on your query patterns, JSONB's advanced indexing capabilities enhance query performance and maintain a well-balanced, high-performing database.

If you want to get even more performance out of your PostgreSQL database, try TimescaleDB. Available as a standalone on-prem solution or as part of Timescale Cloud's fully managed PostgreSQL platform, which suits the needs of mission-critical applications, TimescaleDB extends PostgreSQL for real-time analytics, events, vector data, and time-series data.

If you think PostgreSQL can't scale, you're probably doing it wrong—Timescale can fix it. Install TimescaleDB on your machine today or skip the extra steps and create a 30-day free Timescale Cloud account.

Learn more