Category: All posts
Jun 27, 2024
Introducing pgai, a new open-source extension that brings embedding and generation models closer to the database, making PostgreSQL an even better database for AI applications.
There’s an industry-wide push in companies to explore how they can leverage AI and large language models (LLMs) in their products and processes. And with that push, many professional developers find themselves in the category of “experienced developer but a beginner in AI.” Thanks to their engineering experience and expertise, these developers attempt to navigate the uncharted waters of applying AI and LLMs to build products, but without a traditional machine learning background.
In his seminal blog post “The Rise of the AI Engineer,” Shawn Wang (better known as Swyx) talks about this class of developers as “AI Engineers”—developers who specialize in applying AI models, tools, and APIs to build AI products. AI engineers stand in contrast to traditional AI practitioners who are primarily researchers and scientists.
Case in point: we spoke to a Timescale customer who was tasked with building a RAG application at their company. This engineer had decades of database and C experience but complained to us about fumbling around trying to learn Python to quickly build a proof-of-concept (PoC) to validate the core idea. “I wish I could do more of this in SQL” was a common refrain we heard from this and many other engineers we spoke to, ultimately sparking the idea for pgai.
Pgai is a PostgreSQL extension that brings more AI workflows to PostgreSQL, like embedding creation and model completion. Licensed under the Open Source PostgreSQL License, pgai further enriches the PostgreSQL AI ecosystem, making it easier for developers to build search and retrieval-augmented generation (RAG) applications.
We built the pgai extension to give application developers familiar with PostgreSQL more tools to simplify their workflows and face fewer bumps in the road as they travel from idea to PoC to building production-ready AI applications. Simply put, we built pgai to help make more PostgreSQL developers AI engineers.
The initial release of pgai supports creating OpenAI embeddings and getting OpenAI chat completions from models like GPT4o directly from your PostgreSQL database. Pgai complements pgvector by storing embeddings in the pgvector data type and using Python and PL/Python to interact with model APIs from within a PostgreSQL database.
Next, we plan to add support for more models like Claude and Cohere, as well as open-source embedding and generation models via Ollama and HuggingFace endpoints. We welcome community contributions for models and functionality you want to see. To get involved, file an issue in the pgai GitHub repository or contribute a feature today.
We gave a sneak peek of pgai to a select group of developers who are building AI applications with PostgreSQL. Here’s what Web Begole, CTO of Market Reader, a company using PostgreSQL to build an AI-enabled financial information platform, had to say:
“Pgai is incredibly exciting for building AI applications with PostgreSQL. Having embedding functions directly within the database is a huge bonus. Previously, updating our saved embeddings was a tedious task, but now, with everything integrated, it promises to be much simpler and more efficient. This will save us a significant amount of time and effort.”
To give you a taste of what’s possible with the pgai extension, let’s take a look at two common tasks developers building RAG applications face and how pgai can help streamline them:
We’ll use the example of git commit messages as the text we’ll embed and reason over. The code snippet below shows installing the pgai extension and ingesting a CSV file of git commits into a table named commit_history
. We’ll reference this table in the examples below:
-- install pgai (also installs pgvector via cascade)
create extension if not exists ai cascade;
-- a table for storing git commit history
create table commit_history
( id int not null primary key
, author text
, "date" timestamptz
, "commit" text
, summary text
, detail text
);
-- use psql's copy metacommand to load the csv into the table
\copy commit_history from 'commit_history.csv' with (format csv)
Pgai simplifies creating embeddings from existing relational data in PostgreSQL. Let’s look at an example of using the pgai extension to generate an embedding from a table of git commits.
The code snippet below generates embeddings for each row in the commit_history
table using OpenAI’s text-embedding-3-small model. The embeddings are stored in a new table called commit_history_embed
. The code selects each row from the commit_history
table, creates a single text string representation of the commit, and then embeds this text. The resulting embeddings are then inserted into the commit_history_embed
table.
-- we want to embed each row of commit history and put the embedding in this table
create table commit_history_embed
( id int not null primary key
, embedding vector(1536) -- the vector type comes from the pgvector extension
);
-- select from the first table, embed the content, and insert in the second table
insert into commit_history_embed (id, embedding)
select
id
, openai_embed
( 'text-embedding-3-small'
-- create a single text string representation of the commit
, format('author: %s date: %s commit: %s summary: %s detail: %s', author, "date", "commit", summary, detail)
) as embedding
from commit_history
;
In this example, the embeddings are inserted into a new table, but they can also be stored in a new column in the table where the text data is housed.
Pgai enables classification, summarization, and data enrichment tasks on existing relational data. Let’s look at an example of summarization where we use pgai to output release notes in markdown format from a table of git commit messages.
In the code snippet below, we use pgai to perform a summarization task on the rows in thecommit_history
table. The code uses pgai to call OpenAI’s GPT4o model to perform the summarization task, which can be stored in another column or output as a response. We show the generated release notes in the image below.
-- summarize and categorize git commits to produce a release notes document
select openai_chat_complete
( 'gpt-4o'
, jsonb_build_array
( jsonb_build_object
( 'role', 'system'
, 'content', 'You are a software release engineer who summarizes git commits to produce release notes.'
)
, jsonb_build_object
( 'role', 'user'
, 'content'
, -- build up a list of the commit details to append to the prompt
concat
( E'Summarize the following list of commits from the timescaledb git repo from August 2023 in a release notes document in markdown format.\n\n'
, string_agg(x.commit_desc, E'\n\n')
)
)
)
)->'choices'->0->'message'->>'content'
from
(
-- convert each to a text format
select format
( E'%s %s\n\tcommit: %s\n\tauthor: %s\n\tdate: %s\n\tdetail: %s'
, row_number() over (order by "date")
, summary
, "commit"
, author
, "date"
, detail
) as commit_desc
from commit_history
-- just look at commits from August 2023
where date_trunc('month', "date") = '2023-08-01 00:00:00+00'::timestamptz
order by "date"
) x
;
Here’s the resulting release notes generated by the above query:
Pgai is open source under the PostgreSQL License and is available for you to use in your AI projects today. You can find installation instructions on the pgai GitHub repository. You can also access pgai on any database service on Timescale’s cloud PostgreSQL platform.
Pgai is an effort to enrich the PostgreSQL ecosystem for AI. If you’d like to help, here’s how you can get involved:
Let's make Postgres a better database for AI, together!