AI

Jun 11, 2024

How We Made PostgreSQL as Fast as Pinecone for Vector Data

Hi, we’re Timescale! We build a faster PostgreSQL for demanding workloads like time series, vector, events, and analytics data. Check us out.

How We Made PostgreSQL as Fast as Pinecone for Vector Data

Posted by

Matvey Arye

Matvey Arye

We’ve recently announced the open-sourcing of pgvectorscale, a new PostgreSQL extension that provides advanced indexing techniques for vector data. Pgvectorscale provides a new index method for pgvector data, significantly improving the search performance of approximate nearest neighbor (ANN) queries. These queries are key for leveraging modern vector embedding techniques to facilitate semantic search, which allows for finding things similar to a query's meaning. That, in turn, enables applications like retrieval-augmented generation (RAG), summarization, clustering, or general search.

In our announcement post, we described how our new StreamingDiskANN vector index allows us to perform vector search faster than bespoke purpose-built databases created for this purpose—like Pinecone. We also observed that if bespoke databases aren’t faster, then there is no reason to use them because they can’t possibly compete with the rich feature set and ecosystem of general-purpose databases like PostgreSQL.

In this article we’ll go into the technical contributions that allowed us to “break the speed barrier” and create a fast vector index in PostgreSQL. We’ll cover three technical improvements we made:

  • Implementing the DiskANN algorithm to allow the index to be stored on SSDs instead of having to reside in memory. This vastly decreases the cost of storing large amounts of vectors since SSDs are much cheaper than RAM. 
  • Supporting streaming post-filtering, which allows for accurate retrieval even when secondary filters are applied. In contrast, the HNSW (hierarchical navigable small world) index fails to accurately retrieve data if the filters exclude the first ef_search vectors. Pinecone had previously complained about this problem when comparing itself to pgvector. Guess what; through the power of open source, this issue has been resolved. 
  • Developing a completely new vector quantization algorithm we call SBQ (statistical binary quantization). This algorithm provides a better accuracy vs. performance trade-off compared to existing ones like BQ (binary quantization) and PQ (product quantization).

Enhancing PostgreSQL for Vector Data


Implementing the DiskANN algorithm to optimize for SSD storage

The DiskANN algorithm was developed by work coming out of Microsoft. Its goal was to store a very large number of vectors (think Microsoft scale). At that scale, it was simply uneconomical to store everything in RAM. Thus, the algorithm is geared towards enabling storing vectors on SSDs and using less RAM. Its details are described very well in the paper, so I’ll only give a bit of intuition below. 

The DiskANN algorithm is a graph-based search algorithm like HNSW. Graph-based algorithms in this space have a well-known problem: finding an item that’s “very far” from the starting position is expensive because it requires a lot of hops. 

HNSW solves this problem by introducing a system of layers where the first (top) layer only has “long-range” edges that quickly get you into the right vicinity and have pointers to nodes into lower levels that allow you to traverse the graph in a more fine-grained way. This solves the long-range problem but introduces more indirection through the layering system, which requires more random-access that forces the graph into RAM for good performance. 

In contrast, DiskANN uses a single-layer graph and solves the long-range problem during graph construction by allowing for neighbor edges that refer to far-away nodes. The single-layer construction simplifies the algorithm and decreases the random access necessary during search, allowing SSDs to be used effectively.

Support for streaming retrieval for accurate metadata filtering

Oftentimes, when searching for semantically similar items, you want to constrain your search with additional filters. For example, documents are often associated with a set of tags and you may want to constrain your search by requiring a match of the tags as well as vector similarity.

A diagram representing two-stage filtering
Figure 1: The problem with two-stage post-filtering is that if the matching records aren’t located in the set before the cutoff of the first stage, the final answer will be incorrect.

This is challenging for many HNSW-based indexes (including pgvector’s implementation) because the index retrieves a pre-set number of records from the index (set by the hnsw.ef_search parameter, often set to 1,000 or less) before applying secondary filters. If not enough items in the retrieved set (e.g., first 1,000 items) match the secondary filters, you will miss those results. 

Figure 1 illustrates this problem when you use hnsw.ef_search=5 to find the top two vectors closest to a given query and matching the tag “department=engineering”. In this scenario, the first item with the correct tag is the seventh vector closest to the query. 

Since the vector search returns only the closest five items and none matches the tag filter, no results will be returned! This is an extreme example where no results are left over, but there will be some accuracy loss any time the retrieved set has less than k items matching the filter.

A diagram representing streaming filtering
Figure 2: Streaming filtering produces the correct result by exposing a get_next() function that can be called continuously until the right number of records are found.

In contrast, our StreamingDiskANN index has no “ef_search” type cutoff. Instead, as shown in Figure 2, it uses a streaming model that allows the index to continuously retrieve the “next closest” item for a given query, potentially even traversing the entire graph! The Postgres execution system will continuously ask for the “next closet” item until it has matched the LIMIT N items that satisfy the additional filters. This is a form of post-filtering that suffers absolutely no accuracy degradation.

As a side note, Pinecone made a big deal of the “ef_search” type limitation to deposition pgvector in their comparison. But, with the introduction of StreamingDiskANN, this criticism no longer applies. This just shows the power of open-source projects to move quickly to mitigate limitations.

Statistical binary quantization (SBQ): A new quantization algorithm

Many vector indexes use compression to reduce the space needed for vector storage and make index traversal faster at the cost of some loss in accuracy. The common algorithms are product quantization (PQ) and binary quantization (BQ). In fact, pgvector’s HNSW index just added BQ in their latest 0.7.0 release (hooray!). 

The way most vector databases work to retrieve K results is as follows. The system first retrieves N results (N>K) using the approximate quantized differences, then “corrects” for the error by rescoring. It calculates the full distance for the N results, sorts the list by the full distance, and returns the K items with the smallest distance. Yet, even with rescoring, accuracy is important because it allows you to decrease N (and thus query faster) and improve the chances that the accurate result will be in the set of N pre-fetched results.

We took a look at the BQ algorithm and were unhappy with the amount of accuracy loss it produced. We also immediately saw some low-hanging fruit to improve it. In tinkering with the algorithm, we developed a new compression algorithm we are calling statistical binary quantization (SBQ). 

The BQ compression algorithm transforms a floating-point vector into a binary vector in a surprisingly simple way: for each element in the vector, if the value is greater than 0.0, make the binary value 1; otherwise, set the binary value to 0. Then, the distance function simply becomes the XOR function. Why XOR? Well, you’ll find many mathematical explanations (none of which we quite like) but the intuition we use is that the binary vector divides the space into quadrants as seen in Figure 3, and the XOR function is simply a count of how many planes you have to cross to get from one quadrant to another. 

A diagram representing BQ
Figure 3: BQ for three dimensions. Quadrant 1 is represented by the binary vector [1,1,1] and any vector falling into that quadrant will have a distance of 0. The distance with vectors in other quadrants increases with the number of dimensions that are different.

One of the immediate things that struck us as odd is that the cutoff for each dimension is always 0.0. This was odd because in analyzing real embeddings we’ve previously found that the mean for each dimension is not even approximately 0.0. That means the quadrants we are defining in BQ are not dividing the space of points in half and thus missing out on opportunities for differentiation. 

Intuitively, you want the “origin” of your cutting plane in the middle of all the action, but in BQ, it’s off to the side. The solution was quite simple: we used a learning pass to derive the mean value for each dimension and then set the float-value cutoff to the mean instead of 0.0. Thus, we set the binary value for an element to 1 if and only if the float value is greater than the mean for the dimension.

But then we noticed yet another odd thing: the compression algorithm worked better for 1,536 dimensions than for 768 dimensions. This made little sense to use because the literature strongly implies that problems with higher dimensions are harder than lower dimensions (the so-called “curse of dimensionality”). But here, the opposite is true. 

However, thinking about the quadrant analogy, this kind of made sense—we’d have fewer quadrants with 768 dimensions, and each quadrant would be bigger and thus less differentiated. So we asked ourselves, could we create more quadrants with 768 dimensions? 

Our approach was to convert each floating-point dimension into two bits (which we later generalized). The idea was to use the mean and standard deviations to derive a z-score (a value’s distance from the mean normalized by standard deviation) and then divide the z-score into three regions. We then encode the three regions into two bits so that adjacent regions have a XOR distance of 1, and the distance increases with the z-score distance. In the two-bit case with three regions the encoding is 00, 01, 11. 

Experimentally, we found that two-bit encoding really helps accuracy with the 768-dimension case. Thus, by default, we use two-bit encoding for any data with less than about 900 dimensions and one-bit encoding otherwise. In one representative example on a dataset with 768 dimensions, the recall improved from 96.5 % to 98.6 % when switching from the one-bit to two-bit encoding, a significant improvement at such high recall levels.

In sum, these techniques help us achieve a better accuracy/performance trade-off.

A Better PostgreSQL for Vector Data

The three techniques we covered in this post allow us to develop a best-in-class index for vector data in PostgreSQL that rivals the performance of bespoke databases like Pinecone. We were able to achieve this with a small team by harnessing much of the infrastructure that PostgreSQL provides, including caching, WAL (write-ahead logging), and the associated recovery infrastructure, and a rock-solid disk writing system. 

We wrote this in Rust using the PGRX framework for writing Rust extensions for PostgreSQL. This further sped up development because we could rely on some of the safety guarantees that Rust and PGRX provide while developing our own safe wrappers for tricky parts of the code (like disk I/O). We think that this combination of tools is really useful and powerful for developing database features and extending the reach of PostgreSQL. 

Next steps

Our team has been working tirelessly in the last few months to equip PostgreSQL with these new advanced indexing techniques for vector data. Our goal is to help PostgreSQL developers become AI developers. But for that, we need your feedback.

Here’s how you can get involved: 

  • Share the news with your friends and colleagues: Share our posts announcing pgai and pgvectorscale on X/Twitter, LinkedIn, and Threads. We promise to RT back.
  • Submit issues and feature requests: We encourage you to submit issues and feature requests for functionality you’d like to see, bugs you find, and suggestions you think would improve both projects.
  • Make a contribution: We welcome community contributions for both pgvectorscale and pgai. Pgvectorscale is written in Rust, while pgai uses Python and PL/Python. For pgai specifically, let us know which models you want to see supported, particularly for open-source embedding and generation models. See the pgai GitHub for more.
  • Offer pgvectorscale and pgai extensions on your PostgreSQL cloud: Pgvectorscale and pgai are open-source projects under the PostgreSQL License. We encourage you to offer pgvectorscale and pgai on your managed PostgreSQL database-as-a-service platform, and we can even help you spread the word. Get in touch via our Contact Us form and mention pgai and pgvectorscale to discuss further.
  • Use pgai and pgvectorscale today: You can find installation instructions on the pgai GitHub and pgvectorscale GitHub repositories, respectively. You can also access both pgai and pgvectorscale on any database service on Timescale’s cloud PostgreSQL platform. For production vector workloads, we’re offering private beta access to vector-optimized databases with pgvector and pgvectorscale on Timescale. Sign up here for priority access.

Originally posted

Jun 11, 2024

Share

pgai

3.3k

pgvectorscale

1.5k

Subscribe to the Timescale Newsletter

By submitting you acknowledge Timescale's Privacy Policy.