Oct 29, 2024
"Your embeddings are out of sync again."
It's a message that haunts engineering teams trying to build AI applications. What starts as a simple vector search implementation inevitably evolves into a complex orchestra of monitoring, synchronization, and firefighting.
We've spent the past year talking to dozens of engineering teams building AI systems with vector databases, whether semantic search, retrieval-augmented generation (RAG) systems, or AI agents. We learned that while everything works smoothly for simple applications and PoCs, taking these AI systems into production reveals flawed abstractions with vector databases and the way we use them today.
Here’s one common thread we heard:
"You're building a RAG system, and your team uses Pinecone as a vector database to store and search embeddings. But you can't just use Pinecone—your text data doesn't fit well into Pinecone's metadata, so you're also using DynamoDB to handle those blobs and application data. And for lexical search, you needed OpenSearch. Now you're juggling three systems, and syncing them is a nightmare."
It’s easy to see why. Say you delete a source document in your database because it's outdated. Now you have to:
And you need to do this for every update, addition, or delete to your source documents!
Managing these configurations isn’t just messy—it’s risky. One missed step, and you're stuck paying for an index you haven’t used in months. One team recently shared how they ended up paying $2,000 a month for an index that should've been deleted four months ago! And that’s not even mentioning the risk of returning incorrect or stale data to your users.
Sounds familiar? After hearing this pattern repeatedly, we realized something crucial:
Vector databases are built on the wrong abstraction.
Vector databases treat embeddings as independent data, divorced from the source data from which embeddings are created, rather than what they truly are: derived data. By treating embeddings as independent data, we’ve created unnecessary complexity for ourselves.
In this post, we'll propose a better way: treating embeddings more like database indexes through what we call the "vectorizer" abstraction. This approach automatically keeps embeddings in sync with their source data, eliminating the maintenance costs that plague current implementations.
A vectorizer’s equivalent of the CREATE INDEX
command is:
-- Create a vectorizer to embed data in the blogs table
-- Use Open AI text-embedding-3-small model
SELECT ai.create_vectorizer(
'public.blogs'::regclass,
embedding => ai.embedding_openai('text-embedding-3-small', 1536),
chunking => ai.chunking_recursive_character_text_splitter('content')
);
This one command will create embeddings for every entry in the blog table and keep updating those embeddings as you change the data in the blog table. In the immortal words of Ina Garten: how easy is that!
We’ll detail how we implemented this vectorizer abstraction in PostgreSQL in a new open-source tool called pgai Vectorizer, making it work with the open-source pgvector and pgvectorscale extensions for vector search.
We built a vectorizer for PostgreSQL because many developers regard PostgreSQL as the “Swiss army knife” of databases, as it can handle everything from vectors and text data to JSON documents. We think an “everything database” like PostgreSQL is the solution to eliminate the nightmare of managing multiple databases, making it the ideal home for vectorizers and the foundation for AI applications.
If you've ever cursed at stale embeddings or questioned why managing vector search feels needlessly complex, this post is for you.
Vector databases were developed as specialized systems to handle large volumes of vector embeddings for text, image, and multi-modal data. And, as the utility of vector embeddings became clearer, many general-purpose databases, like PostgreSQL (with pgvector), MySQL, MongoDB, and even Oracle, added vector search support to their general-purpose offerings either officially or via extensions.
However, the abstraction of vector search capabilities, either in a standalone system or added to an existing database, suffers from a fatal flaw: once embeddings are inserted into the database, we lose the connection between the unstructured data that is being embedded and the vector embeddings themselves.
Without this connection, embeddings are mistakenly treated as standalone data atoms that developers must manage rather than what they truly are: derived data. After all, vector embeddings are high-dimensional representations of their source data, whether the source data be text, images, or something else. There is a fundamental connection between the vector embedding and the source data it is generated from.
When we reconceptualize embeddings as derived data, the absurdity of the current vector database abstraction becomes evident, with embeddings disconnected from their source data. To handle the demands of a production AI system, even the best teams find themselves juggling multiple databases and managing the following:
And what if you want to upgrade to a newer embedding model or try a different chunking method? Now you're writing custom code and coordinating changes across multiple data services and databases. For example, we’ve heard teams who’ve put off migrating from OpenAI’s text-embedding-ada-002
to the newer text-embedding-3 model family for this very reason.
These tasks place the burden on the development team to ensure that the embeddings are timeously created as source data changes. Otherwise, they risk embeddings frequently becoming stale, leading to a worse application experience for users.
Thus, even in the best case, teams spend countless hours writing and debugging synchronization logic, setting up infrastructure to handle embedding updates at scale, and firefighting when these systems inevitably break down.
When we reconceptualize embeddings as derived data, the responsibility for generating and updating them as the underlying data changes can be handed over to the database management system. This change frees developers from the burden of manually keeping embeddings in sync with the source data. (*Cue in collective developer rejoice*)
This distinction may not seem critical for simple applications that perform a one-time data import for RAG. But for most real-world applications, data is constantly changing. Consider an e-commerce platform that uses embedding-based semantic search across its product catalog, always evolving with new products and updated descriptions. Or a product assistant RAG app that must be kept up to date with the latest product information to give accurate answers to prospective and current customers.
Manually tracking these changes and regenerating embeddings is not only labor-intensive and error-prone but also distracts developers from focusing on core business objectives. Why waste valuable development time managing this manually when the database system could handle it for you automatically?
A more effective abstraction is conceptualizing vector embeddings not as independent tables or data types but as a specialized index on the embedded data. This is not to say that vector embeddings are literally indexes in the traditional sense, like those in PostgreSQL or MySQL, which retrieve entire data rows from indexed tables. Instead, vector embeddings function as an indexing mechanism that retrieves the most relevant parts of the data based on its embeddings.
Rather than indexes, we can call this new index-like abstraction a “vectorizer,” as it creates vectors from the underlying source data it is connected to (in other words, vectorizes them). The vectorizer abstraction offers several key advantages:
One of the primary benefits of indexing in databases is that it automatically keeps the index in sync with the underlying data. When the data in a column changes, the index is updated accordingly. By treating vector embeddings as a form of indexing, we can leverage this same automatic synchronization. The system would ensure that vector embeddings are always up-to-date with the latest data, eliminating the need for manual updates and reducing the risk of errors.
When vectors are stored independently, it is easy to lose track of their relationship with the original data. Was this vector generated from a recent update to the data? Or is it an outdated vector from a previous embedding model? These questions are critical, and confusion here can lead to significant errors. The relationship is clear and maintained automatically by tying vector embeddings directly to the data as an index.
Developers often face challenges when managing data synchronization manually. For instance, forgetting to delete data from an old embedding model when the underlying data is deleted can cause inconsistencies. The vectorizer abstraction simplifies this process by making it the system’s responsibility to manage these relationships, thus reducing the cognitive load on developers and minimizing the potential for mistakes.
The vectorizer concept is a natural evolution of modern database management systems (DBMS) capabilities. Today’s DBMSs are already adept at managing data transformations and synchronizations through declarative constructs like indexes, triggers, and materialized views. The vectorizer abstraction fits neatly into this paradigm, providing a new tool for handling the increasingly important task of managing vector embeddings.
By embedding this functionality directly into the DBMS, we move closer to fulfilling the ultimate promise of database systems: to manage data in a way that abstracts away the complexities, allowing users to focus on what they do best—building applications, analyzing data, and driving innovation.
The good news is that we don’t have to wait long for the vectorizer dream described above to become a reality. Motivated by the promise of making developers’ burdens lighter, our AI Engineering team at Timescale implemented a vectorizer for PostgreSQL. It’s called pgai Vectorizer (currently in Early Access) and is part of the PGAI project to make PostgreSQL better for AI systems and make AI development accessible to developers familiar with PostgreSQL.
If you’d like to see pgai Vectorizer in action, check out this demo video on how it automatically creates and updates vector embeddings for data in PostgreSQL.
For more advanced use cases, here's how you can use pgai Vectorizer to easily test different embedding models, or test which chunking and formatting strategy fits your data best.
Let’s move on to some background on how pgai Vectorizer works and how it brings the discussed advantages to life.
Developers define and create a vectorizer in SQL. The following query creates a vectorizer and specifies the table it acts on, the column to vectorize, the embedding model to use, and additional formatting for other information to include in the source data to be embedded.
-- Create a vectorizer to automatically embed data in the blogs table
SELECT ai.create_vectorizer(
'public.blogs'::regclass
-- Use Open AI text-embedding-3-small model
, embedding=>ai.embedding_openai('text-embedding-3-small', 1536, api_key_name=>'OPENAI_API_KEY')
-- Automatically create a StreamingDiskANN index when table has 100k rows
, indexing => ai.indexing_diskann(min_rows => 100000, storage_layout => 'memory_optimized'),
-- Apply recursive chunking on the content column
, chunking=>ai.chunking_recursive_character_text_splitter('content')
-- Add metadata from other columns to the embedding for better retrieval
, formatting=>ai.formatting_python_template('Blog title: $title url: $url blog chunk: $chunk')
);
-- The vectorizer will update the embeddings as the source table changes
-- No other user actions are required
We also define some default chunking functions since long pieces of text need to be split up into multiple smaller chunks to fit inside embedding model token limits.
Under the hood, pgai Vectorizer checks for modifications to the source table (inserts, updates, and deletes) and asynchronously creates and updates vector embeddings. Our team built pgai Vectorizer for two deployment types: self-hosted and fully managed in Timescale Cloud.
In the cloud-hosted implementation of pgai Vectorizer, we use cloud functions in the Timescale Cloud platform to create embeddings. In the open-source version of pgai Vectorizer, embeddings are created by running an external worker.
Pgai Vectorizer stores the configuration and catalog information along with key internal bookkeeping data inside the database.
Here’s an overview of the architecture of the pgai Vectorizer system:
An important note is that the actual embedding process occurs outside the database in an external process. This helps reduce the load on the database server and means that vectorizers don’t affect the database’s ability to service application queries. It also makes it easier to scale the embedding task independently of other database operations.
This process first reads the database to see if there is work to be done. If there is, it reads the data from the database, performs chunking and formatting, makes a call to an embedding model provider, such as OpenAI, to generate the embedding, and writes the result back to the database.
Pgai Vectorizer is flexible: You can specify the chunking and formatting rules used to create embeddings. In particular, you can configure which columns of the source table are to be vectorized and the chunking and formatting rules to ensure the source data fits within embedding token limits and that relevant data is contained within each embedding.
In the Early Access release of pgai Vectorizer, you can customize the choice of OpenAI embedding model (e.g., text-embedding-3-small, text-embedding-3-large, text-embedding-ada-002
), chunking strategies to split text into smaller chunks, formatting options to inject additional context into each chunk, and custom indexing configurations for automated index creation and performance tuning. We plan to make this even more flexible soon by allowing users to submit their own Python code to fully customize chunking, embedding, and formatting.
For example, here’s a vectorizer configured to split HTML source files recursively and create OpenAI embeddings from the source data. You can configure the chunking and formatting to suit your specific application data, whether it be code, documents, markdown, or anything else.
-- Advanced vectorizer configuration
SELECT ai.create_vectorizer(
'public.blogs'::regclass,
destination => 'blogs_embedding_recursive',
embedding => ai.embedding_openai('text-embedding-3-small', 1536),
-- apply recursive chunking with specified settings for HTML content
chunking => ai.chunking_recursive_character_text_splitter(
'content',
chunk_size => 800,
chunk_overlap => 400,
-- HTML-aware separators, ordered from highest to lowest precedence
separator => array[
E'</article>', -- Split on major document sections
E'</div>', -- Split on div boundaries
E'</section>',
E'</p>', -- Split on paragraphs
E'<br>', -- Split on line breaks
E'</li>', -- Split on list items
E'. ', -- Fall back to sentence boundaries
' ' -- Last resort: split on spaces
]
),
formatting => ai.formatting_python_template('title: $title url: $url $chunk')
);
We plan to detail the design decisions which informed our building of pgai Vectorizer in a future blog post.
Try pgai Vectorizer now in Early Access. Open source, built for application devs, and ready to simplify your AI workflows. See how much easier embedding management can be—jump in and let us know what you think. Pick your favorite deployment option: fully managed on Timescale Cloud or self-hosted in your PostgreSQL environment of choice (leave us a GitHub ⭐ along the way).
Why waste your precious development time manually managing embeddings when you could simply hand this over to your database?
With pgai Vectorizer, you can. It enables you to keep your vector embeddings always up to date, reinforcing the embedding-data relationship and simplifying your data management. PostgreSQL becomes your AI development platform of choice, allowing you to store all your data in one place and automating embedding management. In addition, pgai Vectorizer builds upon other tools in the PGAI suite, like pgvectorscale, which complements pgvector for high-performance vector search, and the pgai extension, which brings AI models closer to your data in PostgreSQL.
Questions or feedback? Join our Discord community, where our team is ready to help you make the most of pgai Vectorizer and the rest of the tools in the PGAI suite.
Learn more about pgai Vectorizer and how it can help you build better AI systems with PostgreSQL in these companion blogs and videos: