Category: All posts
Jun 17, 2024
Posted by
Matvey Arye
Embedding data stored in a PostgreSQL table is undoubtedly useful—with applications ranging from semantic search and recommendation systems to generative AI applications and retrieval augmented generation. But creating and managing embeddings for data in PostgreSQL tables can be tricky, with many considerations and edge cases to take into account, such as keeping embeddings up to date with table updates and deletes, ensuring resilience against failures, and impact to existing systems dependent on the table.
In a previous blog post, we detailed a step-by-step guide on the process of creating and managing embeddings for data residing in PostgreSQL using PgVectorizer—our simple and resilient embedding creation system for data residing in PostgreSQL. Using a blog application with data stored in a PostgreSQL database as an example, we covered how to create and keep up-to-date vector embeddings using Python, LangChain, and pgai on Timescale.
In this blog post, we’ll discuss the technical design decisions and the trade-offs we made while building PgVectorizer to ensure simplicity, resilience, and high performance. We’ll also discuss alternative designs if you want to roll your own.
Let’s jump into it.
First, let’s describe how the system we are building will work. Feel free to skip this section if you already read the PgVectorizer post.
As an illustrative example, we’ll use a simple blog application storing data in PostgreSQL using a table defined as follows:
CREATE TABLE blog (
id SERIAL PRIMARY KEY NOT NULL,
title TEXT NOT NULL,
author TEXT NOT NULL,
contents TEXT NOT NULL,
category TEXT NOT NULL,
published_time TIMESTAMPTZ NULL --NULL if not yet published
);
We want to create embeddings on the contents of the blog post so we can later use it for semantic search and power retrieval augmented generation. Embeddings should only exist and be searchable for blogs that have been published (where the published_time
is NOT NULL
).
While building this embeddings system, we were able to identify a number of goals that any straightforward and resilient system that creates embeddings should have:
These guidelines were the basis of a robust architecture that we implemented using the Python Vector library, a library for working with vector data using PostgreSQL. To complete the job successfully, added new functionality to this library—PgVectorizer—to make embedding PostgreSQL data as simple as possible.
Here’s the architecture we settled on:
In this design, we first add a trigger to the blog table that monitors for changes and, upon seeing a modification, inserts a job into the blog_work_queue table that indicates that a row in the blog table is out-of-date with its embedding.
On a fixed schedule, an embeddings creator job will poll the blog_work_queue table, and if it finds work to do, will do the following in a loop:
To see this system in action, see an example of usage to create and maintain embeddings in a PostgreSQL table using OpenAI, LangChain, and pgai on Timescale in this blog post.
Going back to the example of our blog application table, on a high level, PgVectorizer has to do two things:
Both of these have to be highly concurrent and performant. Let’s see how it works.
You can create a simple work queue table with the following:
CREATE TABLE blog_embedding_work_queue (
id INT
);
CREATE INDEX ON blog_embedding_work_queue(id);
This is a very simple table, but there is one item of note: this table has no unique key. This was done to avoid locking issues when processing the queue, but it does mean that we may have duplicates. We discuss the trade-off later in Alternative 1 below.
Then you create a trigger to track any changes made to blog
:
CREATE OR REPLACE FUNCTION blog_wq_for_embedding() RETURNS TRIGGER LANGUAGE PLPGSQL AS $$
BEGIN
IF (TG_OP = 'DELETE') THEN
INSERT INTO blog_embedding_work_queue
VALUES (OLD.id);
ELSE
INSERT INTO blog_embedding_work_queue
VALUES (NEW.id);
END IF;
RETURN NULL;
END;
$$;
CREATE TRIGGER track_changes_for_embedding
AFTER INSERT OR UPDATE OR DELETE
ON blog
FOR EACH ROW EXECUTE PROCEDURE blog_wq_for_embedding();
INSERT INTO blog_embedding_work_queue
SELECT id FROM blog WHERE published_time is NOT NULL;
The trigger inserts the ID of the blog that has changed into blog_work_queue. We install the trigger and then insert any existing blogs into the work_queue. This ordering is important to make sure that no IDs get dropped.
Now, let’s describe some alternative designs and why we rejected them.
Introducing this key would eliminate the problem of duplicate entries. However, it's not without its challenges, particularly because such a key would force us to use the INSERT…ON CONFLICT DO NOTHING
clause to insert new IDs into the table, and that clause takes a lock on the ID in the B-tree.
Here's the dilemma: during the processing phase, it's necessary to delete the rows being worked on to prevent simultaneous processing. Yet, committing this deletion can only be done after the corresponding embedding has been placed into blog_embeddings. This ensures no IDs are lost if there's a disruption midway—say, if the embedding creation crashes post-deletion but before the embedding is written.
Now, if we create a unique or primary key, the transaction overseeing the deletion stays open. Consequently, this acts as a lock on those specific IDs, preventing their insertion back into the blog_work_queue for the entire duration of the embedding creation job. Given that it takes longer to create embeddings than your typical database transaction, this spells trouble. The lock would stall the trigger for the main 'blog' table, leading to a dip in the primary application's performance. Making things worse, if processing multiple rows in a batch, deadlocks become a potential problem as well.
However, the potential issues arising from occasional duplicate entries can be managed during the processing stage, as illustrated later. A sporadic duplicate here and there isn't a problem as it only marginally increases the amount of work the embedding job performs. This is certainly more palatable than grappling with the above-mentioned locking challenges.
blog
table to track whether an up-to-date embedding exists.For example, we could add an embedded
boolean column set to false on modification and flipped to true when the embedding is created. There are three reasons to reject this design:
blog
table for the reasons we already mentioned above.A separate work_queue_table solves these issues.
This approach has several issues:
Now we have a list of blogs that need to be embedded, let’s process the list!
There are many ways to create embeddings. We recommend using an external Python script. This script will scan the work queue and the related blog posts, invoke an external service to craft the embeddings, and then store these embeddings back into the database. Our reasoning for this strategy is as follows:
The jobs must be both performant and concurrency-safe. Concurrency guarantees that if jobs start running behind, the schedulers can start more jobs to help the system catch up and handle the load.
We’ll go through how to set up each of those methods later, but first, let’s see what the Python script would look like. Fundamentally, the script has three parts:
Steps 2 and 3 are performed by an embed_and_write
callback that we define in the PgVectorizer blog post. So, let’s look more deeply at how we process the work queue.
We’ll first show you the code and then highlight the key elements at play:
def process_queue(embed_and_write_cb, batch_size:int=10):
with psycopg2.connect(TIMESCALE_SERVICE_URL) as conn:
with conn.cursor(cursor_factory=psycopg2.extras.DictCursor) as cursor:
cursor.execute(f"""
SELECT to_regclass('blog_embedding_work_queue')::oid;
""")
table_oid = cursor.fetchone()[0]
cursor.execute(f"""
WITH selected_rows AS (
SELECT id
FROM blog_embedding_work_queue
LIMIT {int(batch_size)}
FOR UPDATE SKIP LOCKED
),
locked_items AS (
SELECT id,
pg_try_advisory_xact_lock(
{int(table_oid)}, id) AS locked
FROM (
SELECT DISTINCT id
FROM selected_rows
ORDER BY id
) as ids
),
deleted_rows AS (
DELETE FROM blog_embedding_work_queue
WHERE id IN (
SELECT id
FROM locked_items
WHERE locked = true ORDER BY id
)
)
SELECT locked_items.id as locked_id, {self.table_name}.*
FROM locked_items
LEFT JOIN blog ON blog.id = locked_items.id
WHERE locked = true
ORDER BY locked_items.id
""")
res = cursor.fetchall()
if len(res) > 0:
embed_and_write_cb(res)
return len(res)
process_queue(embed_and_write)
The SQL code in the above snippet is subtle because it is designed to be both performant and concurrency-safe, so let’s go through it:
While one could counter this issue by introducing explicit version tracking, it introduces considerable complexity without performance benefit. The strategy we opted for not only mitigates this issue but also prevents redundant operations and wasted work by concurrently executing scripts.
A Postgres advisory lock, prefixed with the table identifier to avoid potential overlaps with other such locks, is employed. The try
variant, analogous to the earlier application of SKIP LOCKED, ensures the system avoids waiting on locks. The inclusion of the ORDER BY blog_id clause helps prevent potential deadlocks. We’ll cover some alternatives below.
embed_and_write
callback, we use published_time being NULL as a sentinel for the blog being deleted (or unpublished, in which case we also want to delete the embedding).If the system already uses advisory locks and you are worried about collisions, it’s possible to use a table with a blog ID as the primary key and lock the rows. In fact, this can be the blog table itself if you are sure these locks won’t slow down any other system (remember, these locks have to be held throughout the embedding process, which can take a while).
Alternatively, you can have a blog_embedding_locks table just for this purpose. We didn’t suggest creating that table because we think it can get quite wasteful in terms of space, and using advisory locks avoids this overhead.
We introduced PgVectorizer and outlined a system adept at generating vector embeddings from data stored in PostgreSQL and automatically keeping them up to date. This architecture ensures the embeddings remain synchronized with the perpetually evolving data, responding seamlessly to insertions, modifications, and deletions.
In this blog post, we gave you a behind-the-scenes look at how we created a system that boasts resilience, effectively handling potential downtimes of the embedding-generation service. Its design is adept at managing a high rate of data modifications and can seamlessly use concurrent embedding-generation processes to accommodate heightened loads.
Moreover, the paradigm of committing data to PostgreSQL and using the database to manage embedding generation in the background emerges as an easy mechanism to supervise embedding upkeep amidst data modifications. A myriad of demos and tutorials in the AI space focus singularly on the initial creation of data from documents, overlooking the intricate nuances associated with preserving data synchronization as it evolves.
However, in real production environments, data invariably changes, and grappling with the complexities of tracking and synchronizing these shifts is no trivial endeavor. But that’s what a database is designed to do! Why not just use it?
Here are some resources to continue your learning journey: