Category: All posts
Oct 29, 2024
Posted by
Avthar Sewrathan
When OpenAI released their text-embedding-3 model family earlier this year, this conversation happened in countless teams building AI systems:
Engineer: "Should we upgrade to the latest OpenAI model? It's cheaper and performs better on benchmarks."
Tech lead: "We should explore it, but re-embedding all our data is too much work right now. Let's revisit next month."
Fast-forward six months: the team is still using older embedding models and paying more than needed for embeddings.
It doesn't have to be this way, especially if you're using pgvector and PostgreSQL. That's where pgai Vectorizer comes in. It’s an open-source tool that automatically creates embeddings from your data and keeps them synchronized as the source data changes. It lets you create different versions of your embeddings using different models with a single SQL command and keeps them up to date for evaluation, A/B testing, and smooth switchover. Forget manual reprocessing or juggling multiple embedding and synchronization pipelines.
Want to know if text-embedding-3-small
is good enough for your use case? Or whether text-embedding-3-large
's improved accuracy justifies its higher cost? Let's explore how to answer these questions systematically.
Pgai Vectorizer is an open-source tool that automates embedding creation in PostgreSQL. Forget about ETL (extract-transform-load) pipelines or custom scripts—you manage embedding models like you'd adjust a database index using a simple, declarative SQL command. What used to require complex workflows and infrastructure is now another simple database operation.
Here's an example of creating a vectorizer to embed data in a table called blogs
using the OpenAI text-embedding-3-small
model:
-- Create a vectorizer to automatically embed data in the blogs table
SELECT ai.create_vectorizer(
'public.blogs'::regclass,
-- Specify the OpenAI model to use and the embedding dimension
embedding => ai.embedding_openai('text-embedding-3-small', 768),
-- Recursively split the content column into chunks for embedding
chunking => ai.chunking_recursive_character_text_splitter('content')
);
When you create a vectorizer, pgai automatically sets up the necessary database objects for storing and querying embeddings. For our vectorizer on the blogs
table above, you query embeddings through the automatically created view blogs_embedding
.
This view joins your original blog data with its embeddings, making querying embeddings and associated metadata together straightforward. Pgai Vectorizer also creates an embedding storage table, which only stores embeddings and chunks. In our example above, it is called blogs_embedding_store
.
-- Query the view for the vectorizer above (recommended)
SELECT id, title, url, chunk, embedding
FROM public.blogs_embedding
LIMIT 5;
-- Query the embeddings storage table for the vectorizer above
SELECT embedding_uuid, chunk, embedding
FROM public.blogs_embedding_storage
LIMIT 5;
This approach is powerful because pgai Vectorizer treats embeddings as a declarative feature of your data. Just as PostgreSQL automatically maintains an index when you update data, pgai Vectorizer automatically handles the creation and synchronization of embeddings as your source data changes.
The automation brought by pgai Vectorizer means you can do the following:
For example, you might start with the cost-effective text-embedding-3-small
model and then test whether text-embedding-3-large
provides enough accuracy improvement to justify its higher cost. Alternatively, you could maintain compatibility with existing ada-002
embeddings while gradually transitioning to newer models.
Let's look at how this works in practice.
You can change the embedding model with a simple SQL command without needing to modify the application code or reprocess data manually.
You can create multiple vectorizers for the same data source, each using a different embedding model. This allows for side-by-side comparisons of model performance.
Let’s look at an example of creating embeddings in three different OpenAI embedding models for the same source data.
Say we have a table called blogs
containing various blog posts for a “Chat with blog” RAG app we’re building:
-- Create a table with the following columns: id, title, content, url
CREATE TABLE blogs (
id bigint primary key generated by default as identity,,
title TEXT,
content TEXT,
url TEXT
);
We can automatically embed the blogs’ content using the older OpenAI text-embedding-ada-002
, as well as the newer text-embedding-3-small
and text-embedding-3-large
models as follows:
-- create a vectorizer to embed the blogs table using the text-embedding-ada-002 model
SELECT ai.create_vectorizer(
'public.blogs'::regclass,
destination => 'blogs_ada_002',
embedding => ai.embedding_openai('text-embedding-ada-002', 1536),
chunking => ai.chunking_recursive_character_text_splitter('content'),
formatting => ai.formatting_python_template('Title: $title\nURL: $url\nContent: $chunk')
);
-- create a vectorizer to embed the blogs table using the text-embedding-3-small model
SELECT ai.create_vectorizer(
'public.blogs'::regclass,
destination => 'blogs_3_small',
embedding => ai.embedding_openai('text-embedding-3-small', 768),
chunking => ai.chunking_recursive_character_text_splitter('content'),
formatting => ai.formatting_python_template('Title: $title\nURL: $url\nContent: $chunk')
);
-- create a vectorizer to embed the blogs table using the text-embedding-3-large model
SELECT ai.create_vectorizer(
'public.blogs'::regclass,
destination => 'blogs_3_large',
embedding => ai.embedding_openai('text-embedding-3-large', 1536),
chunking => ai.chunking_recursive_character_text_splitter('content'),
formatting => ai.formatting_python_template('Title: $title\nURL: $url\nContent: $chunk')
);
This will create three different sets of embeddings for the underlying source data and automatically update the embeddings as changes to the source data occur. No developer intervention required!
Once created, we can compare the search and RAG responses using each embedding model simply by querying the different associated view with each vectorizer. For example, here’s a direct comparison of the responses generated for each embedding model for a RAG query:
-- Compare RAG responses from all three vectorizer approaches above
SELECT
'ada_002' as strategy,
generate_rag_response(
'What does Sam Altman think about Generative AI models?',
'public.blogs_ada_002_embedding'
) as response
UNION ALL
SELECT
'text_embedding_3_small' as strategy,
generate_rag_response(
'What does Sam Altman think about Generative AI models?',
'public.blogs_3_small_embedding'
) as response
UNION ALL
SELECT
'text_embedding_3_large' as strategy,
generate_rag_response(
'What does Sam Altman think about Generative AI models?',
'public.blogs_3_large_embedding'
) as response;
Rather than comparing responses with a “vibe test” above, in practice, you would want to run evaluation queries against the various embedding datasets, which is made easy as you can simply point your evaluation scripts to the respective views for each embedding set.
Once satisfied, you can even easily A/B test by pointing some vector search/ RAG queries to the view containing the new embedding model and others to the existing one.
By querying the vectorizer information views, developers can compare metrics like processing time and embedding quality across different models.
The system tracks which model was used for each embedding, facilitating A/B testing and gradual rollouts of new models.
This is a much better source of truth for reconciling changes to and deployment dates of embedding model rollouts, as well as modifications to chunking and formatting strategies, than what many teams use today, which is usually a combination of various Google Docs or team wikis and “inside an engineer’s brain,” all of which can be prone to human error.
To view the model tracking information, we provide several useful tables in the ai
schema:
-- View all configured vectorizers and their settings
SELECT * FROM ai.vectorizer;
-- See detailed configuration for each vectorizer, including embedding model used
SELECT
id,
source_table,
configuration->>'embedding' as embedding_config,
configuration->>'chunking' as chunking_config,
configuration->>'formatting' as formatting_config,
created_at,
updated_at
FROM ai.vectorizer;
-- View current status and pending items for each vectorizer
SELECT * FROM ai.vectorizer_status;
If you’re using the cloud-hosted version of pgai Vectorizer on Timescale Cloud, you can also view this information in the Vectorizers tab for a specific cloud database.
When switching models, pgai Vectorizer can maintain the old embeddings alongside the new ones, allowing for graceful transitions and fallback options. This flexibility empowers developers to continually optimize their vector search capabilities as new embedding models become available or their specific use cases evolve.
Gone are the days of postponing embedding model upgrades due to implementation complexity. With pgai Vectorizer, testing and deploying new OpenAI embedding models is as simple as writing a single SQL command. You can experiment with confidence while maintaining production stability.
Pgai Vectorizer is available for you to get started with today in Early Access. Pick your favorite deployment option: fully managed on Timescale Cloud or self-hosted in the PostgreSQL environment of your choice (and leave us a GitHub ⭐ along the way.)
Questions or feedback? Join our Discord community, where our team is ready to help you make the most of your embedding models.
Learn more about pgai Vectorizer and how it can help you build better AI systems with PostgreSQL in these companion blogs and videos: