Vector Data

PostgreSQL Extensions: Turning PostgreSQL Into a Vector Database With pgvector

A visual representation of a vector, represented in a graph.

Written by Matvey Arye and Avthar Sewrathan

pgvector is a PostgreSQL extension that provides powerful functionalities for working with high-dimensional vectors. It introduces a dedicated data type, operators, and functions that enable efficient storage, manipulation, and analysis of vector data directly within the PostgreSQL database. If you're looking for a vector database, know that PostgreSQL is all you need. 

The Timescale Cloud platform includes two new open-source PostgreSQL extensions developed by the Timescale Team that complement pgvector: pgai, which brings more AI workflows to PostgreSQL; and pgvectorscale, which enables developers to build more scalable AI applications, with higher performance embedding search and cost-efficient storage. You can learn more about these two extensions in this article.

Let’s walk through the main features and use cases. We’ll also cover a few examples of analyzing relational and time-series data in TimescaleDB.

Understanding pgvector: Enabling Vector Operations

The pgvector extension is a powerful tool for storing, modifying, and querying vectors. This functionality enables applications such as similarity and semantic search, retrieval augmented generation, image search, recommendation systems, natural language processing (NLP), and computer vision. To kick off our pgvector exploration, let's take a look at vectors themselves.

What are vectors?

Vectors refer to mathematical representations of data points in multidimensional space. They are typically arrays or lists of numerical values that capture an entity's essential features or attributes. In machine learning and data science, vectors encode information in a way that enables efficient computation and analysis.

Vectors can represent words, sentences, or documents, allowing for semantic search where similar meanings are identified even if the exact words differ. This can be used to find documents with similar content or answer queries based on the context of the text. By converting visual data into vectors, vector databases (such as PostgreSQL with the pgvector extension) enable searching for similar images or videos. This can be useful in applications like facial recognition, object detection, and content-based image retrieval.

In time series and other types of data, vectors can represent normal behavior patterns, allowing the detection of anomalies by identifying vectors that significantly deviate from the norm.

Here are some of the key features and use cases of pgvector:

  • Vector storage: The pgvector extension lets you store high-dimensional vectors directly in PostgreSQL tables. It provides a dedicated data type for vector representation, allowing efficient storage and retrieval of vector data.

  • Similarity search: With pgvector, you can perform similarity searches based on vector similarity metrics such as cosine similarity or Euclidean distance. Searching for similar vectors facilitates applications like content-based recommendation systems, k nearest-neighbor search, and clustering.

  • Semantic search: Large language model (LLM) embeddings are a way to create vectors from other types of data (e.g., text, images, etc). These embeddings represent the meaning of the underlying data. Therefore, using similarity search to find similar vectors returns data with similar semantic meaning or content, effectively identifying data points that share common themes or topics regardless of their original format (text, images, etc).

  • Natural Language Processing (NLP) and text analysis: Vector embeddings are powerful as they capture the underlying meaning of text. Techniques like word or document embeddings enable you to capture the ideas expressed in the text. You can then perform vector-based operations like similarity search, clustering, or classification on textual data.

  • Computer vision: The pgvector extension can handle vector representations of images and enable similarity-based image search. You can convert images to vector representations using convolutional neural networks (CNN) or image embeddings. These capabilities allow you to perform content-based image retrieval, image similarity matching, object identification, and image clustering within the database.

image

Pgvector stores vectors that represent semantic meaning. Thus things that are meaningfully similar are “closer” together in the vector space. This allows people searching for SUVs to find cars and trucks instead of watermelons and apples, even though SUVs and trucks are different words.

Using Pgvector for Vector Data

By utilizing the pgvector extension, developers can efficiently store and query vector data within PostgreSQL, making it easier to build generative AI applications with LLMs, as well as AI applications that require similarity search, recommendation systems, NLP, and other tasks that involve working with vectors.

  • Installation: You can use pgvector on a cloud database service in Timescale or a self-hosted PostgreSQL instance:

  • Vector data type: pgvector introduces a new data type called vector representing a high-dimensional vector. You can define vector-type columns in your database tables to store vector data. For example, you can have a table called documents with an embedding column of type vector to store LLM embeddings as vectors for each document. CREATE TABLE documents (  id BIGINT PRIMARY KEY GENERATED ALWAYS AS IDENTITY,  content TEXT,  author_id BIGINT, embedding VECTOR(1538) );

  • Indexing and vector search: pgvector provides indexing mechanisms optimized for approximate nearest neighbor search over vector data. You can create an index on a vector column using the CREATE INDEX command, specifying the hnsw index type and a vector_cosine_ops operator class. This enables fast similarity searches on vector data using the cosine distance similarity metric. You can then use the <=> operator to perform similarity searches in your queries: CREATE INDEX ON documents USING hnsw (embedding vector_cosine_ops); --find the closest 5 elements to a given query SELECT * FROM documents ORDER BY embedding <=> '[10.5, 11.0,...]' LIMIT 5;.

  • Vector functions: pgvector has a set of built-in functions to manipulate and perform operations on vector data. These functions allow you to calculate vector similarities, perform vector arithmetic, and more. For example, you can use the cosine_distance function to calculate the cosine similarity between two vectors.  Other useful functions are vector_norm() to get the Euclidean norm and vector_dims() to determine how many dimensions a vector contains.

  • Vector aggregates: pgvector has the avg(vector) and sum(vector) aggregate functions for calculating analytics on vectors.

  • Integration with other PostgreSQL features: pgvector seamlessly integrates with other features of PostgreSQL, such as transaction management, query optimization, and security. This integration enables you to leverage the power of PostgreSQL for complex data processing tasks. Joining pgvector data with other data types, like relational, time-series, and geospatial data, is a powerful way to enrich your vector queries’ data. As a simple example, you could return author information for documents found via similarity search:

WITH matching_docs as (   --find 5 closest matches   SELECT *    FROM documents    ORDER BY embedding <=> '[10.5, 11.0,...]'    LIMIT 5 ) SELECT d.content, a.first_name, a.last_name FROM matching_docs d  INNER JOIN author a ON (a.id = d.author_id).

Why Use Pgvector With Timescale?

Timescale is PostgreSQL++ engineered for time-series data, events, and analytics. It provides a robust foundation for storing, retrieving, and analyzing large volumes of time-series data thanks to its time-based aggregations, data retention policies, and features like hypertables and continuous aggregates.

Many real-world AI applications have retrieval or analytical requirements that include both vector data and a temporal aspect to similarity searches: embeddings of news, legal documents, and financial statements.

In such cases, users want to find similar documents within a specific time frame. Another common use is time-weighting, where users may not use a strict time filter but still prefer more recent or older data. For such workloads, Timescale can markedly enhance data ingestion and query speeds, thanks to hypertables—automatic time-based partitioning of tables. Timescale is especially good at optimizing queries that filter data based on time, enabling those queries to run more efficiently. 

How to Use Pgvector With Timescale

Step 1. Set up pgvector and timescaledb on a PostgreSQL instance. You can do this either with a cloud-hosted database on Timescale (recommended) or a self-hosted instance.

You then need to execute the following:

CREATE EXTENSION vector; CREATE EXTENSION timescaledb;

Step 2. Create a Timescale hypertable with a vector column: Create a regular table in your database that will serve as the basis for the hypertable. Convert the table into a hypertable using the create_hypertable function provided by TimescaleDB. Specify the time column and other relevant options. 

CREATE TABLE documents (   id BIGINT GENERATED ALWAYS AS IDENTITY,   created_at TIMESTAMPTZ,   content TEXT,   embedding VECTOR(1538) ); SELECT create_hypertable('documents', 'created_at');

Step 3. Insert vector data into the hypertable: Use regular SQL INSERT statements to insert data into the hypertable, including the vector data in the designated vector column. Ensure that the vector data is in the correct format expected by pgvector. You will need to get the vector representation from an embedding model (e.g., OpenAI text-embedding-3 models or, if you prefer, open-source sentence transformers). 

INSERT INTO documents (created_at, content,embedding) VALUES ('2023-06-01 00:00:00', 'the quick', '[1,2,3]'),       ('2023-06-02 00:00:00', 'brown fox', '[4,5,6]'),       ('2023-06-03 00:00:00', 'jumped over' '[7,8,9]');

Step 4. Querying and analyzing vector data. Use SQL queries to perform various operations on the vector data stored in the hypertable. You can combine TimescaleDB's time-series functions with pgvector's vector functions for analysis and querying.

Examples of queries you can perform include:

  • Similarity search: Find vectors similar to a given query vector using the <=> operator provided by pgvector. Combine that with filters on other columns, such as time.

  • Aggregation and grouping: Use TimescaleDB's time-series aggregation functions to aggregate vector data over time intervals or other dimensions.

  • Filtering and selection: Use regular SQL filters and conditions to select specific vector data based on certain criteria.

-- Similarity search SELECT * FROM documents ORDER BY data <=> '[1,2,3]'::vector LIMIT 5; -- Similarity search filtered by time SELECT * FROM documents WHERE time >= '2023-06-02 00:00:00' AND time < '2023-06-03 00:00:00' ORDER BY data <=> '[1,2,3]'::vector LIMIT 5; -- Aggregation and grouping, SELECT time_bucket('30 day', created_at) AS day, length(data) AS avg_data_length FROM sensor_data GROUP BY day ORDER BY day; -- Filtering and selection SELECT * FROM sensor_data WHERE time >= '2023-06-02 00:00:00' AND time < '2023-06-03 00:00:00';

Step 6. Optimization and performance. Depending on the scale and complexity of your data, consider optimizing the performance of your queries by creating indexes on the vector column using the CREATE INDEX command with the HNSW index type and the vector_cosine_ops operator class. Tune the configuration parameters of TimescaleDB and PostgreSQL based on your workload and resource requirements to achieve optimal performance.

CREATE INDEX ON documents USING hnsw (embedding vector_cosine_ops);

Next Steps

Want to learn more about pgvector and how to use it? Read the following resources:

Get started with pgvector on a mature, production-ready cloud PostgreSQL platform: sign up for Timescale Cloud today. With Timescale Cloud, developers can access pgvector, pgvectorscale, and pgai—extensions that turn PostgreSQL into an easy-to-use and high-performance vector database, plus a fully managed cloud database experience.