Using Pgvector With Python

A high-tech Postgres elephant and a python side by side

Written by Haziqa Sajid

Have you ever wondered how artificial intelligence (AI) systems can understand our language? The key concept is embeddings, where words and phrases are converted into high-dimensional vectors that capture their meanings and relationships. These vectors allow computers to perform mathematical operations on language data. The challenge then becomes storing these high-dimensional vectors efficiently.

The good news is that you can use an old friend, PostgreSQL, and transform it into a full-fledged vector database with the pgvector extension. Adding Python to the mix allows you to build Python applications with machine learning elements. 

For those who aren’t sure about how to achieve this, here’s what you’ll learn in this guide:

  • What are vector databases and pgvector

  • How to install and use pgvector in Python

  • How to leverage PostgreSQL for your AI applications

Pgvector and Python

Let’s start by explaining the concepts required to understand pgvector. Then, we will provide the technical details to get up and running with pgvector in Python.

What is pgvector (and how does it make PostgreSQL a vector database)

A vector database stores data as high-dimensional vectors, mathematical representations of features or attributes. Vectors, with dimensions ranging from tens to thousands, are derived from raw data like text, images, and even videos through embedding algorithms. These databases are designed for fast and precise similarity search and enhanced search based on semantic meaning.

Editor’s Note: Check this article to learn the difference between vector search and semantic search.

PostgreSQL does not have native vector capabilities (as of PostgreSQL 16), and pgvector is designed to fill this gap. Pgvector is an open-source vector similarity search extension for PostgreSQL. Here's how it enhances PostgreSQL, making it a proficient vector database:

  1. Vector and traditional store combined: With pgvector, you can store both vector and traditional data, ensuring compatibility with traditional database features while providing newer capabilities.

  2. Advanced search: Pgvector stores data as vectors, enabling various nearest-neighbor search algorithms for exact or approximate searches (the latter are used when the search space is extensive), such as L2, inner product, and cosine distance. This facilitates the efficient finding of the most similar content based on a given query. Here are all the other search algorithms:

Pgvector: Supported operators

Supported operators in pgvector

3. Integration with PostgreSQL features: Pgvector seamlessly integrates with standard PostgreSQL features, enhancing its capabilities, which include the following:

  • ACID compliance: ensuring transactional integrity

  • Point-in-time recovery: enabling database restoration to specific moments

  • JOIN support: facilitating data combination from multiple tables

Enough theory; let’s get into practicality. 

Setting Up Pgvector for Python

In the previous section, we built an understanding of pgvector. This section will set the foundation for seeing it in action using Python. 

Set up a PostgreSQL database

If you haven't already, start by installing a PostgreSQL database. Then, install psql on your OS to connect to it more easily. Psql, a terminal-based interactive program from PostgreSQL, facilitates direct interaction with the PostgreSQL server. This interface allows users to execute SQL statements and manage various database objects seamlessly. 

If you have installed PostgreSQL correctly, executing the following command in the terminal will provide version information:

PostgreSQL installation verification

Note: You must manually add the path to the environment variables on Windows.

Now, we are ready to connect to the database.

Connect to database

To connect to the database using psql, we will run the following command in the terminal:

> psql -U postgres

It will prompt for the password, and upon successful authentication, you will be connected to the database.

Connecting to the database using psql

Adding the pgvector extension

Let’s add the pgvector extension and see the relevant requirements. Later, we will walk you through how to use pgvector with Python.

Installing pgvector On Windows, ensure that C++ support in Visual Studio is installed. If not, install it from here and check the Desktop development with C++

C++ support in Visual Studio

After this, run the following command in the Command prompt. Make sure it is open in administrator mode.

> call "C:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\VC\Auxiliary\Build\vcvars64.bat

Note: The specific path will vary based on the version and edition of Visual Studio.

In the next steps, we will set the root and clone the pgvector repository. Then we will use nmake to build:

> set "PGROOT=C:\Program Files\PostgreSQL\16" > cd %TEMP% > git clone --branch v0.7.0 https://github.com/pgvector/pgvector.git > cd pgvector > nmake /F Makefile.win > nmake /F Makefile.win install

Note: nmake is Microsoft's implementation of the make tool. It comes packaged with the C++ development kit. If the path is not recognized in the command prompt, ensure it is added.

After success, using psql run the following query:

postgres=# CREATE EXTENSION VECTOR;

To ensure pgvector is installed, run this query:

postgres=# \dx

Importing pgvector to Python

To use the pgvector extension in your Python projects, follow these steps:

1. Install the pgvector package in your Python environment. This can be done using pip:

%pip install pgvector 2. Once installed, import the pgvector package in your Python script:

import pgvector

3. Additional packages are required to connect to a PostgreSQL database. A commonly used package is psycopg2.

import psycopg2

4. Depending on your specific requirements, you may need to install and import other packages. For example:

  •    For data manipulation and analysis, you might use pandas:

%pip install pandas

import pandas as pd

  •    For object-relational mapping (ORM), you might use SQLAlchemy:

%pip install SQLAlchemy from sqlalchemy import create_engine These steps will give you the tools and packages to work with pgvector and PostgreSQL in your Python environment. 

Using Pgvector in Python

This section will explore using the pgvector extension in PostgreSQL with Python, from connecting to the database to querying. We set up all the installations and imports, so let’s get started. 

Setting up the connection 

First, we must connect to our PostgreSQL database using psycopg2. Here's how you can do it:

import psycopg2 conn = psycopg2.connect( host="localhost",  database="your_database_name",  user="your_username",  password="your_password") In this snippet, we import the psycopg2 library and use it to connect to the PostgreSQL server. Replace your_database_name, your_username, and your_password with your actual database credentials.

Creating a cursor 

Next, create a cursor to execute SQL commands. 

cur = conn.cursor()

The cursor is used to execute SQL commands and fetch results. We can also ensure the connection by the following “hello world” example.

cur = conn.cursor() cur.execute("SELECT 'hello world'") print(cur.fetchone())

Creating a pgvector table 

  • Basic example: Start by creating a simple table with an embedding vector. In the following snippet, we define and execute an SQL command to create a table named vectors with an embedding column of dimension three (3).

# Define a SQL command to create a table create_table_command = """ CREATE TABLE vectors (     id bigserial primary key,     embedding vector(3)  -- specify the dimension of the vector ); """ # Execute the SQL command cur.execute(create_table_command) # Commit the transaction conn.commit() We can confirm the creation of a table by querying \dt in psql:

Psql service to ensure the table creation

Psql service to ensure the table creation

  • Example with more features: Let's create a table with more features for language model training. This command creates a table named embeddings with additional columns such as label, url, content, tokens, and an embedding vector of dimension three (3).

# Define a SQL command to create a more complex table create_table_command = """ CREATE TABLE embeddings (     id bigserial primary key,     label text,     url text,     content text,     tokens integer,     embedding vector(3)  -- specify the dimension of the vector ); """ # Execute the SQL command cur.execute(create_table_command) # Commit the transaction conn.commit()

Data insertion 

Let's insert some data in the newly created table. In the real world, the embeddings are generated using a model trained on a vast corpus of data. These models understand the semantics behind the words. 

For this example, we resort to a random array. We use the NumPy library to generate and insert random vectors into the embeddings table. Each entry includes a label, URL, content, token count, and a vector.

import numpy as np # Define the SQL command for inserting data insert_command = """ INSERT INTO embeddings (label, url, content, tokens, embedding) VALUES (%s, %s, %s, %s, %s); """ # Create sample data data = [     ("label1", "http://example.com/1", "The article on dogs", 100, np.random.rand(3).tolist()),     ("label2", "http://example.com/2", "The article on cats", 150, np.random.rand(3).tolist()),     ("label3", "http://example.com/3", "The article on cars", 200, np.random.rand(3).tolist()),     ("label4", "http://example.com/4", "The article on books", 250, np.random.rand(3).tolist()),     ("label5", "http://example.com/5", "The article on embeddings", 300, np.random.rand(3).tolist()) ] # Insert data into the table for record in data:     cur.execute(insert_command, record) # Commit the transaction conn.commit()

Let's check the data stored in the embeddings table. We will use fetchall() to retrieve the data from the table.

fetch_contents = """ select * from embeddings """ # Execute the SQL command cur.execute(fetch_contents) # Commit the transaction cur.fetchall()

Querying similar objects 

Retrieve the top five similar objects from the database using nearest neighbors. In this example, we generate a random query vector to find the top five similar entries in the embeddings table based on vector similarity. Note the explicit type cast::vector to ensure the comparison is understood by PostgreSQL.

# Generate a random query vector query_vector = np.random.rand(3).tolist()  # Example query vector print(query_vector) # Define the SQL command to retrieve similar objects retrieve_command = """ SELECT content FROM embeddings ORDER BY embedding <=> %s::vector LIMIT 5; """ # Execute the command with the query vector cur.execute(retrieve_command, (query_vector,)) # Fetch the results similar_objects = cur.fetchall() # Print the similar objects for obj in similar_objects:     print(obj[0])

Closing the connection 

Finally, close the cursor and connection. Closing the cursor and connection ensures that all resources are properly released.

cur.close() conn.close()

Supercharged Pgvector With Python

So far, we've explored pgvector for vector storage and similarity search in PostgreSQL. But you can experience a supercharged version of pgvector: with Timescale Cloud, developers can use pgvector alongside pgvectorscale and pgai, two open-source extensions that turn PostgreSQL into an easy-to-use and high-performance vector database, plus a fully managed cloud database experience.

  • Timescale Cloud enhances PostgreSQL for AI applications by packaging pgvector as part of its offering. It retains all the capabilities of pgvector, such as the vector data type and indexing methods like HNSW and IVFFlat, making it a complement rather than a replacement for pgvector

  • Timescale Cloud introduces new features on top of pgvector, such as time-based vector search. This makes it easy to migrate your existing pgvector deployment and take advantage of additional features for scale. You also have the flexibility to create different index types suited to your needs.

  • With pgvectorscale, Timescale Cloud accelerates approximate nearest-neighbor (ANN) searches on large-scale vector datasets by incorporating a cutting-edge ANN index inspired by the DiskANN algorithm (StreamingDiskANN). Also included is pgai, which brings more AI workflows to PostgreSQL, making it easier for developers to build search and retrieval-augmented generation (RAG) applications. This complete stack will make the development of AI applications faster, more efficient, and scalable. 

Check out this blog post to learn more about pgai and how it brings embedding and generation models closer to the database. We also explained how pgvectorscale makes PostgreSQL faster and cheaper than other specialized vector databases, like Pinecone.

Let’s set up Timescale Cloud with Python and see it in action.

Set up Timescale Cloud with Python

  1. Installations: Ensure you have installed the required Python packages, timescale_vector, and python-dotenv, using the following command:

%pip install timescale_vector python-dotenv

2. Import required libraries: Import the necessary libraries in your Python script as shown below:

from dotenv import load_dotenv, find_dotenv import os from timescale_vector import client import uuid from datetime import datetime, timedelta 3. Setting up Timescale: To start, sign up, create a new database, and follow the provided instructions. For more information, refer to the Getting Started with Timescale guide

After signing up, connect to the Timescale database by providing the service URI, which can be found under the service section on the dashboard. The URI will look something like this:

postgres://tsdbadmin:@.tsdb.cloud.timescale.com:/tsdb?sslmode=require

Configuration dashboard for connecting to the service

The password can be created by going to project settings and clicking Create credentials

Project settings page for creating credentials

4. Set Up .env File: Create a .env file in your project directory and include your PostgreSQL credentials in the following format:

TIMESCALE_SERVICE_URL=your_service_url_here 

5. Load PostgreSQL Credentials: Load your PostgreSQL credentials from the .env file:

_ = load_dotenv(find_dotenv(), override=True) service_url = os.environ['TIMESCALE_SERVICE_URL']

Using Timescale Cloud With Python

A vector can be created with the client that takes the following arguments:

Name

Description

service_url

Timescale service URL / connection string

table_name

Name of the table to use for storing the embeddings

num_dimensions

Number of dimensions in the vector.

Initialize the vector client with your service URL, table name, and the number of dimensions in the vector:

vec = client.Sync(service_url, "embeddings", 3)

Create tables and insert data 

Now, we will create tables for the collection and insert data. The data is represented as follows: 

  • UUID serves as a unique identifier for the embedding. 

  • Metadata about the embedding is stored in JSON format.

  • Text representing the embedding.

  • Embedding itself is also included.

vec.create_tables()

vec.upsert([     (uuid.uuid1(), {"service": "pgvectors"}, "Pgvector is an open-source vector similarity search extension of PostgreSQL.", [1.0, 1.3, 2.0]),     (uuid.uuid1(), {"service": "timescale_vector"}, "Timescale Vector amplifies Pgvector's capabilities by integrating it into PostgreSQL for AI applications, adding new functionalities like the Timescale Vector index and time-based vector search, while also enhancing scalability with advanced ANN indexing inspired by the DiskANN algorithm.", [1.0, 10.8, 3.0]), ])

Let's ensure that the data is updated in the table as well. We can see it in the Explorer tab in the Timescale console:

Ensuring data insertion in Timescale Console

Ensuring data insertion in the Timescale console

Query similar objects 

Timescale Cloud provides a very simple API for searching similar vectors. As mentioned before, these vectors are generated by specialized models when given data. But in this example, it's just a random array being queried.

vec.search([2.0, 9.4, 3.0]) Here’s the result:

>> [[UUID('5773dc0f-20d6-11ef-8777-dae2664cc367'),   {'service': 'timescale_vector'},   "Timescale Vector amplifies Pgvector's capabilities by integrating it into PostgreSQL for AI applications, adding new functionalities like the Timescale Vector index and time-based vector search, while also enhancing scalability with advanced ANN indexing inspired by the DiskANN algorithm.",   array([ 1. , 10.8,  3. ], dtype=float32),   0.006858884829482381], [UUID('5773dc0e-20d6-11ef-a334-dae2664cc367'),   {'service': 'pgvectors'},   'Pgvector is an open-source vector similarity search extension of PostgreSQL.',   array([1. , 1.3, 2. ], dtype=float32),   0.2235118982678087]]

Conclusion

In the article, we briefly understood the concept of vector databases and elaborated on pgvector. The article covered the installation of the extension, and we used Python to play around with it. Later, we used Timescale Cloud—which includes pgvector, pgvectorscale, and pgai—to leverage its faster search capabilities.   

Utilizing pgvector with Python empowers the development of robust machine-learning applications. With pgai and pgvectorscale, Timescale Cloud elevates PostgreSQL vector databases to new heights, offering enhanced capabilities and performance at scale. 

Pgai and pgvectorscale are both open source under the PostgreSQL License and available for you to use in your AI projects today. You can find installation instructions on the pgai and pgvectorscale GitHub repositories (Git ⭐s welcome!).  You can also access them on any database service on Timescale’s cloud PostgreSQL platform.