Oct 29, 2024
Posted by
Hervé Ishimwe
Creating vector embeddings is straightforward. Updating and keeping them in sync for production usage is anything but simple.
Any engineer who’s taken an AI system from MVP to production-ready can recount challenges like handling stale embeddings, synchronizing embeddings with new or changed source data, and building increasingly complex glue services to manage it all across multiple databases. All this complexity makes it unnecessarily hard for teams to integrate AI capabilities like search, recommendation systems, RAG (retrieval-augmented generation), and agentic behavior into their applications.
Pgai Vectorizer provides a simpler way forward. It makes it possible to automate embedding creation, keep them up-to-date as the source data changes, and manage it all within PostgreSQL without the complexity of building custom data workflows and processing systems. Pgai Vectorizer enables you to set up and manage vector embeddings using a single SQL command, just like creating a database index.
If you already use extensions like pgvector and pgvectorscale to utilize PostgreSQL as your vector database, pgai Vectorizer helps make PostgreSQL even more powerful for AI application development, allowing your teams to focus on building innovative applications faster rather than managing infrastructure.
We’ve already made the case for managing everything—from vectors to relational data—inside PostgreSQL with just SQL. In this blog post, we’ll demonstrate how pgai Vectorizer simplifies embedding creation within PostgreSQL and streamlines the process for building RAG systems using a product catalog dataset and OpenAI’s text-embedding-3-small
embedding model and gpt-4o mini
as our large language model.
Pgai Vectorizer is a powerful developer tool that automates the creation and management of vector embeddings with PostgreSQL. It abstracts embeddings into a declarative feature, allowing developers to manage embeddings with just a single SQL command.
With pgai Vectorizer, you only need to configure the vectorizer once and let it handle the rest automatically. This “set it and forget it” approach eliminates the need for ongoing management, streamlining your workflow.
Without pgai Vectorizer, you would need to manually call the embedding function for each entry and handle chunking and formatting independently. That requires a lot of surrounding infrastructure, such as a queuing system to handle the embedding endpoint being down, monitoring, alerting, proper retry handling, etc.—adding significant time and complexity.
On the other hand, pgai Vectorizer allows you to generate embeddings for the entire dataset with just one SQL command. You don’t need specialized tools: PostgreSQL is now a platform that helps you with your AI development workflow.
Without pgai Vectorizer, you would have to build a custom workflow involving functions, background workers, and state management to handle new or modified documents. With pgai Vectorizer, no additional code is needed. Just insert/delete/change data into the source table, and the vectorizer will automatically update the embeddings in the background.
Using pgai Vectorizer not only saves you development time but also simplifies your system architecture. All you need is to configure the vectorizer and an API key, reducing the complexity of managing multiple systems. With fewer components to monitor, you can focus on building the best AI application possible.
A single SQL command lets you define:
Pgai Vectorizer encourages developers to focus on building their next innovative product without needing to maintain infrastructure. It also accelerates AI development and reduces the complexity of working with vector data, making PostgreSQL an even more attractive platform for developers building AI applications.
This new PostgreSQL tool is available on Timescale Cloud or as a self-hosted option, offering flexibility for various deployment scenarios.
In this section, we build a simple RAG system to illustrate how pgai Vectorizer works. We use the cloud version, which is accessible via Timescale Cloud. (If you want to try it, you can create a free account and get a 30-day free trial!) You can find the code in this GitHub repository or this Google Colab notebook.
Database connection string: After creating your Timescale account, define a database service by clicking on New service, choosing AI and Vector, and configuring your database specifications. For this specific dataset, I suggest 0-100k vectors or 0.5 CPU/2 Gib memory for the compute size. Make sure to save the connection string provided once the creation process is complete.
As a best practice, create a .env
file to store both the OpenAI API key and the database connection string in your Python environment.
Now, let’s install the necessary Python libraries: the PostgreSQL adapter for Python, psycopg2
, the pandas library for data loading and preparation, and python-dotenv
for reading the value stored in your environment.
%pip install psycopg2 pandas python-dotenv
We use a product catalog dataset containing 500 entries and split it into two subsets:
subset_1
: inserted before the pgai Vectorizer is created.subset_2
: to demonstrate how the vectorizer updates embeddings after a data change.import pandas as pd
import os
import psycopg2
from dotenv import load_dotenv
load_dotenv()
OPENAI_API_KEY = os.environ["OPENAI_API_KEY"]
DATABASE_CONNECTION_STRING = os.environ["DATABASE_CONNECTION_STRING"]
data = pd.read_csv('./product_catalog.csv')
products = [
{"name": row["name"], "description": row["description"]}
for _, row in data.iterrows()
]
subset_1 = products[:250]
subset_2 = products[250:]
After defining and processing variables in your .env
file, connect to your database service using our Python adapter, psycopg2
. Ensure that pgai (0.4.0) is installed before creating the source table.
def connect_db():
return psycopg2.connect(DATABASE_CONNECTION_STRING)
with connect_db() as conn:
with conn.cursor() as cur:
cur.execute("CREATE EXTENSION IF NOT EXISTS ai CASCADE;")
with conn.cursor() as cur:
cur.execute("""
CREATE TABLE IF NOT EXISTS products (
id bigint primary key generated by default as identity,
product TEXT NOT NULL,
description TEXT NOT NULL
);
""")
subset_1
in the source tableWe do multiple inserts with our subsets and define insert_subset_data
, a helper function, for inserting data into the source table.
def insert_subset_data(subset):
with connect_db() as conn:
with conn.cursor() as cur:
output = io.StringIO()
for product in subset:
output.write(f"{product['name']}\t{product['description']}\n")
output.seek(0)
cur.copy_from(output,'products', columns=('product', 'description'), sep='\t')
Then, we insert the data from subset_1
with the following call:
insert_subset_data(subset_1)
First, set up your OpenAI API key from the database service. Go to the Project settings and input your key as follows:
Then, navigate to the AI tab of your service overview and create a vectorizer by following the checklist for customizing the SQL command template provided.
Vectorizers are very customizable, and we encourage you to play around with various options. Here is the SQL command we used to create a vectorizer:
SELECT ai.create_vectorizer(
'public.products'::regclass,
embedding=>ai.embedding_openai('text-embedding-3-small', 1536, api_key_name=>'OPENAI_API_KEY'),
chunking=>ai.chunking_recursive_character_text_splitter('description'),
formatting=>ai.formatting_python_template('product: $product description: $chunk')
);
Let’s break down this SQL command and understand its parameters:
products
) whose data you want to create embeddings for. ai.embedding_openai
. This function accesses the embedding model by providing the model name (text-embedding-3-small
), the vector dimensions (768), and the API key name. ai.chunking_recursive_character_text_splitter
. This function prepares the data by splitting it into smaller chunks to meet input size limits for the embedding model and improve model processing time. You can fine-tune the parameters for more control over the chunking formation. I use the default values here: chunk_size = 800
chunk_overlap = 400
separator = array[E'\n\n', E'\n', '.', '?', '!', ' ', '']
ai.formatting_python_template
. It allows you to define a pattern for your chunks to add context from other columns before sending them to the embedding model. For instance, in this case, we prepend the product name to each chunk of the description.There are more advanced options for configuring indexing and other parameters; we use the defaults in this example. Check out the pgai Vectorizer API reference for more information.
Pgai Vectorizer also creates a view incorporating source data and the corresponding embedding information. The default naming for this view is <source_table>_embedding
. This view contains all the columns from the source table (in this case, products
) as well as the following embedding-related columns:
embedding_uuid
: a unique identifier for the embeddingchunk_seq
: the sequence number of the chunk within the source text (e.g., 1 indicates the first chunk) chunk
: the actual text chunk sent to the embedding modelembedding
: the vector representation generated from the chunkA RAG application relies on vector embeddings to find information relevant to the generative model query from a knowledge base. This supplemental information is used as context and appended to the query to prevent hallucination.
Using the embeddings generated by pgai Vectorizer, we retrieve the top three chunks relevant to the query provided. Since we named all chunks with the product name, we decided to query the embeddings table (instead of the view) and only extract the chunk from the table.
query = "Tell me about different types of t-shirts."
with connect_db as conn:
with conn.cursor() as cur:
cur.execute("""
SELECT chunk
FROM products_embedding
ORDER BY embedding <=> SELECT ai.openai_embed('text-embedding-3-small', %s)
LIMIT 3;
""", (query,))
rows = cur.fetchall()
# Prepare the context for generating the response
context = "\n\n".join([f"Chunk: {row[1]}" for row in rows])
print(context)
Here are the chunks retrieved:
Chunk: product: Versatiliti tee description: A free-moving, breathable and famously resilient tee-shirt, the Versatiliti is designed for the perfect blend of town and country. Its soft 5.5-oz 62% organic cotton is combined with durable yet silky 38% Tencel plaited jersey. For play that's a little bit like work, and vice versa. Recyclable through the Common Threads Recycling Program.<br><br><b>Details:</b><ul> <li>"Environmentally conscious, travel-friendly lightweight organic cotton and Tencel is soft and comfy"</li> <li>Crew neck</li></ul><br><br><b>Fabric: </b>5.5-oz 62% organic cotton/38% Tencel plaited jersey. Recyclable through the Common Threads Recycling Program<br><br><b>Weight: </b>(190 g 6.6 oz)<br><br>Made in Thailand.
Chunk: product: Versatili-tee description: At the top of the T-shirt food chain, the Versatili-Tee has satin-stitch finishing and a sensuous drape that humbles the competition. The plaited blend of organic cotton and Tencel has a soft hand and natural stretch. Recyclable through the Common Threads Recycling Program.<br><br><b>Details:</b><ul> <li>Plaited organic cotton and Tencel fabric is soft and comfy</li> <li>Crew neck</li></ul><br><br><b>Fabric: </b>5.5-oz 62% organic cotton/38% Tencel plaited jersey. Recyclable through the Common Threads Recycling Program<br><br><b>Weight: </b>(126 g 4.4 oz)<br><br>Made in Thailand.
Chunk: product: Text logo t-shirt description: The Text Logo T-Shirt tells the classic story with one word. Printed on pure, 100% organic cotton with PVC- and phthalate-free inks, the shirt is ringspun for softness. Durable, taped shoulder seams minimize chafing. Recyclable through the Common Threads Recycling Program.<br><br><b>Details:</b><ul> <li>Ringspun yarns for a softer hand</li> <li>Taped shoulder seams</li> <li>Screen-print inks are PVC- and phthalate-free</li></ul><br><br><b>Fabric: </b>5.4-oz 100% organic cotton. Recyclable through the Common Threads Recycling Program<br><br><b>Weight: </b>(199 g 6.9 oz)<br><br>Made in USA.
Let’s generate a response using the retrieved context alongside the model query. The retrieval and generation workflow is straightforward thanks to pgai
, which integrates seamlessly with AI model providers such as OpenAI and Anthropic. pgai
provides embedding functions (used in the previous sections of this article) and chat generation functions, which also simplify the development of AI applications within PostgreSQL.
We use one of the latest generative models, gpt-4o mini
, to generate the response.
# give instructions to the generative model and the query + context.
with connect_db() as conn:
with conn.cursor() as cur:
cur.execute(
"""
SELECT ai.openai_chat_complete(
'gpt-4o-mini',
jsonb_build_array(
jsonb_build_object('role', 'system', 'content', 'you are a helpful assistant'),
jsonb_build_object('role', 'user', 'content', %s)
)
)->'choices'->0->'message'->>'content';
""",
(f"Query: {query}\nContext: {context}",)
)
model_response = cur.fetchone[0]
print(model_response)
There are various types of t-shirts designed with different materials and purposes. Here are some notable examples:
1. **Versatiliti Tee**:
- **Material**: Made from a blend of 62% organic cotton and 38% Tencel plaited jersey, it weighs 5.5 oz.
- **Features**: This t-shirt is known for its breathability and resilience, making it suitable for both casual outings and light work. It has a crew neck design and is environmentally friendly, being recyclable through the Common Threads Recycling Program. It aims to provide comfort during active use.
- **Weight**: 190 g (6.6 oz)
- **Production**: Made in Thailand.
2. **Versatili-Tee**:
- **Material**: Also made from a 62% organic cotton and 38% Tencel blend, weighing 5.5 oz.
- **Features**: This t-shirt offers satin-stitch finishing and has a soft hand with a natural stretch, providing a luxurious drape. It is also designed for comfort and is recyclable through the Common Threads Recycling Program.
- **Weight**: 126 g (4.4 oz)
- **Production**: Made in Thailand.
3. **Text Logo T-Shirt**:
- **Material**: This t-shirt is made from 100% organic cotton, weighing 5.4 oz.
- **Features**: It features a simple text logo printed with PVC- and phthalate-free inks. The ringspun yarns ensure a soft feel, and the taped shoulder seams help minimize chafing. It is also recyclable through the Common Threads Recycling Program, making it an environmentally conscious choice.
- **Weight**: 199 g (6.9 oz)
- **Production**: Made in the USA.
These t-shirts highlight the range of styles and purposes available in the t-shirt market, emphasizing comfort, environmental responsibility, and style.
Besides creating embeddings, pgai Vectorizer also manages embeddings whenever the source data changes. In this section, we’ll walk you through an embedding management use case by inserting new data in the source table.
We only use half of the dataset to create the vectorizer. Let’s use the other half, subset_2
, to showcase how pgai Vectorizer handles new data.
insert_subset_data(subset_1)
After inserting the second half of the dataset into the source table, products
, using the helper method, we can go to the AI
tab in the service console and observe that the vectorizer has already scheduled the processing of 250 items (the new entries).
Since the Timescale job scheduler runs every five minutes, we can expect the vectorizer to start processing the new items in the next five minutes or less. Less than a minute after inserting the new items into the source table, the vectorizer runs and generates 604 embeddings from the new data! This is achieved without any other code! Just insert new data, and the vectorizer handles the rest.
This example illustrates not only how the embedding management workflow is simplified by pgai Vectorizer but also how quickly the extension processes data changes. We can observe the newly generated embeddings in the following query and its results below:
SELECT * FROM products_embedding LIMIT 20;
In this blog post, we walked you through using pgai Vectorizer. This open-source tool streamlines the process of managing embeddings in PostgreSQL by automating embedding creation and updates directly within PostgreSQL—with a single SQL command.
Forget external pipelines, complex workflows, and custom infrastructure. With this tool and your regular SQL, you can significantly reduce operational overhead while ensuring data and embeddings stay aligned as your dataset evolves.
Features like flexible chunking, data synchronization, and easy model experimentation allow pgai Vectorizer to replace custom solutions and infrastructure without losing the ability to customize your embeddings the way you want. This empowers developers to focus on building AI solutions rather than maintaining infrastructure.
Its seamless integration with the PostgreSQL ecosystem, especially with extensions like pgvector
and pgvectorscale
, makes AI application development more efficient—you don’t have to leave your PostgreSQL database. Better yet, you don’t have to be an AI expert. Pgai Vectorizer ensures scalable and reliable embedding management, making PostgreSQL an ideal platform for AI innovation.
Pgai Vectorizer is available for you to get started with today in Early Access. Pick your favorite deployment option: fully managed on Timescale Cloud or self-hosted in the PostgreSQL environment of your choice (and leave us a GitHub ⭐ along the way.)
Questions or feedback? Join our Discord community, where our team is ready to help you make the most of your embedding models.
Learn more about pgai Vectorizer and how it can help you build better AI systems with PostgreSQL in these companion blogs and videos: