Category: All posts
Feb 20, 2025
Posted by
Hervé Ishimwe
The launch of pgai Vectorizer has transformed how developers incorporate vector embeddings into their applications. With one SQL command, they can automate embedding creation and management, ensuring synchronization with evolving source data—gone are the traditional manual and time-consuming processes.
However, not all developers interact with PostgreSQL directly through SQL. Many rely on integration libraries and frameworks that abstract SQL, offering a more familiar way to manage databases. This blog post will show you how to seamlessly integrate pgai Vectorizer with Python, one of the most widely used and popular programming languages. We’ll use SQLAlchemy, the Python object-relational mapper (ORM) and SQL toolkit, and Alembic, its database migration tool.
We’ll walk you through the example of a blog application with different blog posts whose titles and content are stored in a table. The goal is to set up a vectorizer that generates and manages embeddings of the blog posts’ content.
Pgai Vectorizer simplifies embedding workflows by automating several steps:
Let’s dive into the steps to add it to our example setup using SQLAlchemy and Alembic!
To begin, install the pgai
’s Python library with SQLAlchemy support:
pip install "pgai[sqlalchemy]"
To integrate pgai Vectorizer, define a migration script using Alembic. First, register pgai’s Alembic operations in your env.py
:
from pgai.alembic import register_operations
register_operations()
If you are using Alembic’s autogenerate feature to generate migrations, make sure also to add the following code to your env.py
to ensure that the tables generated and managed under pgai
Alembic operations are excluded from the autogenerate process:
def include_object(object, name, type_, reflected, compare_to):
if type_ == "table" and name in target_metadata.info.get("pgai_managed_tables", set()):
return False
return True
context.configure(
connection=connection,
target_metadata=target_metadata,
include_object=include_object
)
Assuming the blog posts are stored in a table named blog_posts
, we create a migration script to generate embeddings using OpenAI’s text-embedding-3-small
as the embedding model:
from alembic import op
from pgai.vectorizer.configuration import (
EmbeddingOpenaiConfig,
ChunkingCharacterTextSplitterConfig,
FormattingPythonTemplateConfig
)
def upgrade() -> None:
op.create_vectorizer(
source="blog_posts",
target_table ='blog_posts_embedding_store',
view_table ='blog_posts_embedding',
embedding=EmbeddingOpenaiConfig(
model='text-embedding-3-small',
dimensions=1536
),
chunking=ChunkingRecursiveCharacterTextSplitterConfig(
chunk_column='content',
chunk_size=800,
chunk_overlap=400,
separators=['.', ' ']
),
formatting=FormattingPythonTemplateConfig(template='$title - $chunk')
)
def downgrade() -> None:
op.drop_vectorizer(target_table="blog_posts_embedding_store", drop_all=True)
A few things to note from the code above:
target_table
and view_table
define the table and view names, respectively. ChunkingRecursiveCharacterTextSplitterConfig
to indicate the column from the source table whose content will be used to generate the embeddings. We can also define different parameters for splitting the content to fit the context window of the embedding model. These chunks are also padded with extra information (i.e., the blog post title) following the pattern defined in the function, FormattingPythonTemplateConfig
, which is one of the ways to enrich the text provided to the embedding models and embeddings generated.
You can find more parameters in Vectorizer’s SQL API reference.
Pgai’s Python library provides the preconfigured SQLAlchemy relationship, vectorizer_relationship
, through all ORM interactions with the vectorizers are done.
from sqlalchemy.orm import DeclarativeBase, Mapped, mapped_column
from pgai.sqlalchemy import vectorizer_relationship
class Base(DeclarativeBase):
pass
class BlogPost(Base):
__tablename__ = "blog_posts"
id: Mapped[int] = mapped_column(primary_key=True)
title: Mapped[str]
content: Mapped[str]
# Add vector embeddings for the content field
content_embeddings = vectorizer_relationship(
dimensions=1536
)
Besides the dimensions parameter, vectorizer_relationship
also supports more parameters, including:
target_schema
: the schema of the embedding table; if not specified, the embedding table inherits the parent (source table)’s schema. target_table
: the name of the embedding table.You can treat this relationship as any SQLAlchemy relationship, which you can configure by setting parent_kwargs
parameters.
This relationship allows you to access different embedding properties defined in the embedding table as well as join embedding queries with regular SQL queries as follows:
# Filtering over the blog posts and embeddings at the same time
results = (
session.query(BlogPost, BlogPost.content_embeddings)
.join(BlogPost.content_embeddings)
.filter(BlogPost.title.ilike("%search term%"))
.all()
)
# Accessing embedding properties
for post, embedding in results:
print(post.title)
print(embedding.embedding) # The vector embedding
print(embedding.chunk) # The text chunk
print(embedding.chunk_seq) # The sequence number of the chunk
print(embedding.embedding_uiid) # The UUID number of the embedding
Through this relationship, we can do a semantic search as follows using cosine_distance
, one of the distance comparators from the pgvector-python library:
from sqlalchemy import func, text
similar_posts = (
session.query(BlogPost.content_embeddings)
.order_by(
BlogPost.content_embeddings.embedding.cosine_distance(
func.ai.openai_embed( # using pgai's embedding functions
"text-embedding-3-small",
"search query",
text("dimensions => 1536")
)
)
)
.limit(5)
.all()
)
Let’s break it down:
func.ai.openai_embed
, referring to pgai’s OpenAI embedding function, generates the embedding for the search query, which is then used to search over the embedding generated by the vectorizer. You can also provide your own query embeddings instead of using pgai’s embedding functionalities:
similar_posts = (
session.query(BlogPost.content_embeddings)
.order_by(
BlogPost.content_embeddings.embedding.cosine_distance(
[3, 1, 2,...]
)
)
.limit(5)
.all()
)
In this blog post, we observed how pgai Vectorizer streamlines embedding generation and management by creating a vectorizer through Alembic, querying embeddings, and performing semantic search! This Python integration allows developers to work with familiar tools while enabling powerful AI-driven features with minimal effort.
If you’re building AI applications, explore Timescale’s complete open-source AI stack or head to pgai’s GitHub repository to start bringing AI workflows into PostgreSQL—no need to leave your database. If you find it helpful, we would love your support—leave us a ⭐!
To learn more about our mission of simplifying your AI stack and other integrations, check out these resources: