Oct 29, 2024
Posted by
Avthar Sewrathan
Good retrieval makes good RAG (retrieval-augmented generation). And good chunking is essential for good retrieval.
Chunking and formatting can make or break your search and RAG systems. Too-large chunks overwhelm, while too-small chunks lose context. Proper formatting allows you to reintroduce much-needed context into individual chunks and ensures your large language model (LLM) can process the data effectively.
If you’re building search engines, chatbots, or summarizers, testing these strategies is critical—but tedious. That’s where pgai Vectorizer comes in. Forget manual reprocessing pipelines. Run RAG chunking and formatting tests in PostgreSQL with SQL. Compare, tweak, and refine—all in one place.
Pgai Vectorizer is an open-source tool that automates embedding creation in PostgreSQL. Forget about ETL (extract-load-transform) pipelines or custom scripts—you can modify your RAG chunking and formatting strategies like you'd adjust a database index, using simple SQL. What used to be complex is now just another database operation.
Here’s an example using a recursive text splitter to chunk text from the content
column in a table called blogs
while adding the blog title and URL to each chunk:
-- Vectorizer with recursive chunking settings and metadata formatting
SELECT ai.create_vectorizer(
'public.blogs'::regclass,
embedding => ai.embedding_openai('text-embedding-3-small', 1536),
chunking => ai.chunking_recursive_character_text_splitter(
-- specify column to chunk and chunking settings
'content',
chunk_size => 800,
chunk_overlap => 200,
-- default separator is array[E'\n\n', E'\n', '.', '?', '!', ' ', '']
),
-- specify metadata from other columns to include in the embedding
formatting => ai.formatting_python_template('title: $title url: $url $chunk')
);
What makes this approach powerful is that pgai Vectorizer treats embeddings as a declarative feature of your data. Just as PostgreSQL automatically maintains an index when you update data, pgai Vectorizer automatically handles the creation and synchronization of embeddings as your source data changes.
The automation brought by pgai Vectorizer means you can do the following:
For example, you might start with a basic RAG chunking strategy that splits text by paragraphs, then experiment with a more sophisticated approach that considers sentence boundaries and semantic units. Or you could test how adding different types of metadata to your chunks—like document titles, timestamps, or category information—affects your RAG system's accuracy.
Let's look at how this works in practice.
The pgai Vectorizer API reference contains all the information for each configuration we’ll discuss.
You can experiment with RAG chunking and formatting strategies with simple SQL commands without needing to change application code or manually reprocess data. This declarative approach means that once you create a new vectorizer configuration, pgai Vectorizer automatically reprocesses your data in the background.
You can continue using your application while the new embeddings are being generated: under the hood, pgai Vectorizer maintains both old and new embeddings while you test which settings work best—allowing for easy experimentation on real data and ensuring zero downtime for your AI features.
We’ll now explain our formatting configuration option and show you how you can experiment with different formatting settings.
Use the ai.formatting_python_template()
function to specify how data should be formatted before embedding it. This function allows you to combine multiple fields or add context to the text being embedded.
For example, while the default formatting is simply the chunk of the source text, denoted by $chunk
, the formatting specification below prepends the values from the title
and url
columns of the blog
table that correspond to the chunk in question to every chunk created.
formatting => ai.formatting_python_template('title: $title, URL: $url, $chunk')
This adds additional context, which can help LLMs distinguish between the chunks added to the LLM context window and better reason about them, leading to better LLM responses for RAG.
What’s more, if any of the source content or associated metadata in the source table changes, the vectorizer will pick up modifications to the source table and update the embeddings to reflect the changed source data. This tracking method not only prevents stale embeddings, but gives you a single source of truth to update (the source table on which the vectorizer is created) to update your embeddings vs. updating them separately.
You can create multiple vectorizers for the same data source, each using different RAG chunking or formatting strategies. This enables side-by-side comparisons of different approaches.
For example, here’s how to create two vectorizers for the same source data.
-- Vectorizer with custom formatting added
SELECT ai.create_vectorizer(
'public.blogs'::regclass,
destination => 'blogs_extra_formatting',
embedding => ai.embedding_openai('text-embedding-3-small', 1536),
chunking => ai.chunking_recursive_character_text_splitter('content', 800, 400),
formatting => ai.formatting_python_template('Title: $title\nURL: $url\nContent: $chunk')
);
-- Vectorizer without custom formatting
SELECT ai.create_vectorizer(
'public.blogs'::regclass,
destination => 'blog_no_formatting',
embedding => ai.embedding_openai('text-embedding-3-small', 1536),
chunking => ai.chunking_recursive_character_text_splitter('content', 800, 400),
formatting => ai.formatting_python_template('$chunk')
);
Note that we only specify the destination
parameter if we have more than one vectorizer created on the same source table in order to distinguish between the destination tables and associated views for each vectorizer running.
Once the vectorizers have been created, you can query them by querying their associated views as follows:
-- Query the view for the first vectorizer (with custom formatting)
SELECT id, title, url, chunk, embedding
FROM public.blogs_extra_formatting_embedding
LIMIT 5;
-- Query the view for the second vectorizer (without custom formatting)
SELECT id, title, url, chunk, embedding
FROM public.blog_no_formatting_embedding
LIMIT 5;
To give a more real-world example and see which is better, here’s how you can compare the RAG responses generated by searching against the two embedding tables generated by the vectorizers we defined above. While this will give you a “vibe check,” in practice, we recommend pointing your evaluation set of queries to each vectorizer view and comparing the results on your evaluation set side by side.
Here’s how you could compare the LLM response generated by searching against each embedding view associated with the two vectorizers with different formatting settings:
-- Compare RAG responses from both vectorizer approaches
SELECT
'with_formatting' as strategy,
generate_rag_response(
'What does Sam Altman think about Generative AI models?',
'public.blogs_extra_formatting_embedding'
) as response
UNION ALL
SELECT
'no_formatting' as strategy,
generate_rag_response(
'What does Sam Altman think about Generative AI models?',
'public.blog_no_formatting_embedding'
) as response;
In the example above, we do RAG via a custom function called generate_rag_response()
, defined as follows:
-- RAG function to summarize retrieved context using gpt-4o-mini
CREATE OR REPLACE FUNCTION generate_rag_response(
query_text TEXT,
embedding_view TEXT -- Parameter for the view name
) RETURNS TEXT AS $$
DECLARE
context_chunks TEXT[];
response TEXT;
query TEXT;
BEGIN
-- Dynamically construct and execute the similarity search query
query := format(
'SELECT chunk AS context
FROM %I
ORDER BY embedding <=> ai.openai_embed(''text-embedding-3-small'', $1)
LIMIT 5',
embedding_view -- Use the provided view name
);
EXECUTE query INTO context_chunks USING query_text;
-- Generate a summary using gpt-4o-mini
SELECT ai.openai_chat_complete(
'gpt-4o-mini',
jsonb_build_array(
jsonb_build_object(
'role', 'system',
'content', 'You are a helpful assistant. Use only the context provided to answer the question. Also mention the titles of the blog posts you use to answer the question.'
),
jsonb_build_object(
'role', 'user',
'content', format('Context: %s\n\nUser Question: %s\n\nAssistant:', array_to_string(context_chunks, E'\n'), query_text)
)
)
)->'choices'->0->'message'->>'content' INTO response;
RETURN response;
END;
$$ LANGUAGE plpgsql;
While this example implements search and RAG in SQL, you can use the PostgreSQL ecosystem of drivers and connectors to implement RAG in the language of your choice. The key idea is that to compare the outputs, all you need to do is change the name of the view in the semantic search query, which will reflect the configuration of the vectorizer you want to test.
Pgai Vectorizer supports multiple chunking functions, allowing you the flexibility to easily experiment and test to find the chunking strategy that works best for your RAG application. Let’s take a look at the two chunking functions supported in pgai Vectorizer:
The recommended text splitter is ai.chunking_recursive_character_text_splitter()
, a sophisticated function that recursively splits text using multiple separators and provides more fine-grained control over the chunking process.
-- Example 1: Using recursive character text splitter
-- More sophisticated approach that tries multiple separators in sequence
SELECT ai.create_vectorizer(
'public.blogs'::regclass,
destination => 'blogs_embeddings_recursive',
embedding => ai.embedding_openai('text-embedding-3-small', 1536),
chunking => ai.chunking_recursive_character_text_splitter(
'content', -- column to chunk
chunk_size => 800, -- max characters per chunk
chunk_overlap => 400, -- overlap between chunks
separator => array[ -- tries separators in order
E'\n\n', -- paragraphs
E'\n', -- line breaks
'.', -- sentences
' ' -- words (last resort)
]
),
formatting => ai.formatting_python_template('Title: $title\nURL: $url\nContent: $chunk')
);
Pgai Vectorizer also supports a simple character text-splitting function called ai.chunking_character_text_splitter()
, which splits text into chunks based on a specified separator.
-- Example 2: Using simple character text splitter
-- Straightforward approach that splits on a single separator
SELECT ai.create_vectorizer(
'public.blogs'::regclass,
destination => 'blogs_embeddings_simple',
embedding => ai.embedding_openai('text-embedding-3-small', 1536),
chunking => ai.chunking_character_text_splitter(
'content', -- column to chunk
chunk_size => 800, -- max characters per chunk
chunk_overlap => 400, -- overlap between chunks
separator => E'\n' -- split only on newlines
),
formatting => ai.formatting_python_template('Title: $title\nURL: $url\nContent: $chunk')
);
While both functions allow control over chunk size and overlap, the critical difference between these functions lies in their splitting strategies. The recursive version employs a more sophisticated approach aimed at preserving semantic meaning, which is what we recommend for most use cases. It splits the text using user-defined boundaries, provided as a hierarchical list of separators (e.g., '\n\n'
for paragraphs, '.'
for sentences).
If the resulting chunk is still too large, it recursively applies the next separator in the list until the desired size is reached. In contrast, the non-recursive version splits the text strictly at a single user-defined boundary or separator.
For example, the default separator for ai.chunking_recursive_character_text_splitter()
is:
array[
E'\n\n', -- Double newline (paragraph breaks)
E'\n', -- Single newline (line breaks)
'.', -- Period (end of sentences)
'?', -- Question mark (end of questions)
'!', -- Exclamation mark (end of exclamatory sentences)
' ', -- Space (word boundaries)
'' -- Empty string (character by character, absolute last resort)
]
And with the default settings, this is how the recursive splitter works:
This approach preserves the semantic structure of the text better than using a single separator because it respects natural document boundaries first (paragraphs), falls back to natural language boundaries (sentences), and only splits words as a last resort, with each level providing finer granularity than the previous one. In practice, most chunks will be created using the higher-level separators (paragraphs and sentences), with the lower-level ones serving as fallbacks for unusually long text without natural breaks.
You can also easily customize ai.chunking_recursive_character_text_splitter()
for other kinds of documents like HTML, Markdown, or even code files. Here’s an example separator for HTML:
-- Example separator configuration for HTML content:
-- Recursively tries each separator in order, preserving HTML document structure.
-- Starts with major structural elements, then falls back to text-level splits.
separator => array[
E'</article>', -- Split on major document sections
E'</div>', -- Split on div boundaries
E'</section>',
E'</p>', -- Split on paragraphs
E'<br>', -- Split on line breaks
E'</li>', -- Split on list items
E'. ', -- Fall back to sentence boundaries
' ' -- Last resort: split on spaces
]
The pgai Vectorizer system tracks which chunking and formatting strategies were used for each embedding. This monitoring facilitates A/B testing and gradual rollouts of new strategies.
This is a much better source of truth for reconciling changes to and deployment dates of different RAG chunking and formatting strategies, and even embedding model rollouts, than what many teams use today, which is usually a combination of various Google Docs, team wikis, and “inside an engineer’s brain,” all of which can be prone to human error.
To view the version tracking information, we provide several useful tables in the ai
schema:
-- View all configured vectorizers and their settings
SELECT * FROM ai.vectorizer;
-- See detailed configuration for each vectorizer, including chunking and formatting settings
SELECT
id,
source_table,
configuration->>'embedding' as embedding_config,
configuration->>'chunking' as chunking_config,
configuration->>'formatting' as formatting_config,
created_at,
updated_at
FROM ai.vectorizer;
-- View current status and pending items for each vectorizer
SELECT * FROM ai.vectorizer_status;
If you’re using the cloud-hosted version of pgai Vectorizer on Timescale Cloud, you can also view this information in Vectorizers tab for a specific cloud database:
When switching strategies, pgai Vectorizer can maintain old embeddings alongside new ones. This allows for graceful transitions and fallback options.
You can also customize the choice of embedding model used to create the embeddings. As of the Early Access release, pgai Vectorizer supports embedding models from OpenAI, including text-embedding-3-small
, text-embedding-3-large
, text-embedding-ada-002
, amongst others. See this blog post for how to use pgai Vectorizer to test and experiment with different OpenAI models for your search or RAG app using pgvector.
The right chunking and formatting strategy can be the difference between a mediocre and exceptional RAG application. With pgai Vectorizer, you don’t have to choose between experimentation and production stability. The pgai development platform offloads the task of generating and syncing embeddings to your database, allowing you to test and refine your embeddings with simple SQL commands, all while your app keeps running.
No matter the content—long-form articles, HTML, or structured data—pgai Vectorizer helps you find the optimal strategy for your use case, all within the PostgreSQL tools you already know. PostgreSQL is now an AI development platform accessible to all developers, regardless of their initial AI expertise.
Pgai Vectorizer is available for you to get started with today in Early Access. Pick your favorite deployment option: fully managed on Timescale Cloud or self-hosted in the PostgreSQL environment of your choice (and leave us a GitHub ⭐ along the way.)
Questions or feedback? Join our Discord community, where our team is ready to help you make the most of your embedding models.
Learn more about pgai Vectorizer and how it can help you build better AI systems with PostgreSQL in these companion blogs and videos: