Category: All posts
Nov 19, 2024
Posted by
Junaid Ahmed
Many businesses worldwide rely heavily on vector search technologies to power their tools and systems. As the volume and complexity of data involved in AI processes continue to grow, businesses are being driven to invest more in vector databases and real-time search capabilities.
However, managing and searching enormous amounts of vector data comes with a price tag, making efficient vector search integral to keeping operations competitive and scalable—cue in DiskANN.
Developed by Microsoft, DiskANN is a powerful, cost-effective solution for large-scale vector search. It is optimized for use cases where billions of vectors need to be stored and queried. This functionality gives businesses quicker results with lower costs, making it a prime choice for AI-driven systems.
To understand DiskANN, we have to understand the concept of nearest neighbor (NN) search. Let’s say you have a database that consists of vectors representing points in a multidimensional space. Given a query vector, the goal of a nearest neighbor search would be to report the vectors stored in the database that are closest to this query vector based on some distance metric.
This distance is normally computed by metrics such as cosine similarity, which is widely used in AI systems to calculate the closeness (or similarity) of two vectors based on the angle between them.
Cosine similarity becomes particularly useful in natural language processing, where vectors are word embeddings or document embeddings representing words or documents. Other distance metrics, such as the Euclidean distance, are also used for different applications and types of data.
NN search generally returns the top results, usually the 10 closest vectors, ordered from nearest to furthest. However, standard methods for NN search significantly struggle with scale when working with millions or billions of high-dimensional vectors and are usually very computationally intensive.
The results will be exact as long as the processing power is enough. For instance, consider a recommendation system for online streaming. It can store vector representations of user preferences and use them to find other users with similar tastes or recommend new items. However, a brute-force NN search across millions of users could be computationally expensive and inefficient. This hefty price tag prompted more research into developing more efficient search methods.
Approximate nearest neighbor, or ANN, is one method that attempts to make the searching process more efficient by trading a little bit of accuracy for increased speed. ANN techniques use probabilistic approaches to return the closest vectors quickly. They’re not necessarily 100 % accurate but are above 90 % in many cases.
In many real-world scenarios, approximate nearest neighbor (ANN) methods work well because they deliver relevant results, even if not entirely precise. This makes them particularly effective in cases where speed matters more than perfect accuracy.
Like in the recommendation system, ANN search can efficiently find a set of similar users or items. Even without perfect precision in search results, recommendations can still be highly effective in providing relevant suggestions for a user while ensuring responsiveness to the system.
This approximation is done through various approaches by ANN algorithms to find the nearest neighbors efficiently, making queries very fast in high-dimensional space. Let's look at some common algorithms and techniques:
The key idea of ANNOY is to split the vector space randomly and create a tree structure that splits the space. The algorithm constructs multiple trees, each generated by a sequence of splits of the vector space along random hyperplanes.
These splits induce partitions, each of which can be thought of as a neighborhood in high-dimensional space. During the search, the query point traverses these trees, and the algorithm eventually locates the relevant neighborhoods that identify the approximate nearest neighbors.
ANNOY achieves that by creating multiple trees, so the search will be more accurate as the number of trees increases. On the other hand, more trees make the search slower and eat up more memory. Fortunately, users can tune it, so it is highly flexible for large-scale ANN searches when adjusting the number of trees to balance accuracy with performance. The following image shows the ANNOY algorithm's trade-off between trees, accuracy, and search time.
Locality-sensitive hashing is another famous ANN search algorithm that relies on effective hashing and indexing techniques. Functions used within LSH, referred to as L, S, and H, correspondingly denote different methods of hash value generation. LSH groups points relatively close to each other in vector space and indexes them so that a search would only have to consider portions of the dataset relevant to it.
This drastically cuts computation time, making LSH significantly efficient at high-dimensional searches. The efficiency of LSH depends on the quality of the hash functions, which can be tuned to balance the search accuracy and speed.
LSH works well when the dataset size is very large since it reduces the number of points to be compared in the search process. However, efficiency may be reduced according to the impact of data dimensionality and distribution, making the tuning of its parameters very important for good performance.
Many AI applications operate over high-dimensional vector data. Quantization summarizes the vectors in lower-dimensional representations that support faster distance computations with limited loss of accuracy.
Quantization is often combined with other ANN techniques that further reduce search time. For example, in NLP (natural language processing), word embeddings can be quantized to save storage and accelerate similarity calculations, thus enabling large-scale search to be done with very modest hardware.
One of the key strengths of ANN algorithms is the ability to trade off accuracy with compute time. As in the case of ANNOY, the more random partitions used, the better the accuracy of the search, but computations increase. Similarly, by changing the number of hash functions and hash table sizes, LSH can fine-tune the trade-off between accuracy and performance.
Optimizations in the number of partitions or hash functions can result in significant savings in computation with minimal loss of accuracy. For example, DiskANN is able to achieve higher than 95 % accuracy while allowing fast search. This makes it extremely useful for large-scale AI systems requiring both speed and accuracy.
While ANN algorithms considerably outperform exact nearest neighbor search in terms of efficiency, they do contain some limitations:
While ANN is more efficient than exact NN search, it can still be computationally burdensome at scale. For ANN search, computational requirements for very large and high-dimensional datasets become prohibitively expensive. The need to store and process such a large number of data points requires substantial computational resources, which can usually be challenging to handle efficiently if the proper infrastructure is unavailable.
Most ANN algorithms use tremendous amounts of RAM to execute searches in a very short time. This dependency on RAM makes scaling up ANN solutions for extensive datasets quite costly. While methods like DiskANN try to reduce this RAM dependency by using SSD memory, many ANN algorithms require high RAM capacity, placing a budget constraint on organizations with limited resources.
ANN can be computationally expensive or even require a particular type of infrastructure to perform searches, making it a bottleneck where an organization is trying to apply these techniques at an affordable price. This may include reasons such as ANN being resource-intensive in terms of computation and storage, especially with improved hardware.
Finding inexpensive ways to scale up ANN can be quite challenging. In small companies, where making huge investments in hardware could be beyond their budget, the benefits of using ANN might simply not be worth the cost of its implementation.
High-dimensional data is often sparse, making it hard to partition or cluster effectively. This issue is known as the "curse of dimensionality," where many ANN techniques need to be carefully tuned and optimized to achieve good performance. The accuracy of ANN searches also depends on data distribution and algorithm type. Sometimes, when the approximate nature of the search is involved, suboptimal results can occur, especially when data points are clustered closely together.
While these limitations highlight several challenges, advanced techniques like DiskANN offer promising solutions by using innovative indexing and storage methods to overcome some of these issues.
DiskANN is a new approach for searching approximate nearest neighbors developed to work with SSD memory. As a result, it offers fast and accurate ANN searches without relying on RAM or special hardware. DiskANN's novelty comes from using Vamana graphs while building efficient indexing structures that enhance speed and precision during searching on vast datasets.
It first pre-processes a graph index using the Vamana algorithm. For Vamana to do its magic, it organizes data points into a graph-like structure where nodes represent data points and edges act to reach other similar nodes. Construction of this graph takes place by randomly assigning centroids.
Vamana then iterates over the graph structure, continuously updating the connectivity of nodes towards an efficient structure that will guarantee fast traversal in nearest-neighbor searches.
Most ANN algorithms start from some initial point, usually randomly chosen or determined by some other heuristic, and iteratively step towards the nearest neighbor, which is getting closer and closer, in a greedy fashion.
GreedySearch is extremely efficient for the Vamana graph structure: every step brings the search process closer to the goal neighbor. This yields a very efficient search process capable of returning approximate nearest neighbors in real time without consuming too much RAM.
When used efficiently, GreedySearch combined with Vamana graphs is powerful. It ensures that the graph structure is optimized for ANN search. This enables fast and efficient searches while minimizing the requirement to access all data points. In fact, this is a very practical trade-off between resource usage and performance, where the graph structure will allow for quick convergence to the nearest neighbor.
DiskANN's applications are vast. Active research and development are currently focused on scalability, reducing resource usage, and optimizing for specific use cases such as time-series data. Probably the most important recent advancements to the DiskANN ecosystem in terms of its applications are FreshDiskANN and StreamingDiskANN.
It is difficult to keep the indices updated in real time without sacrificing fast and accurate search performance. Most graph-based ANNS systems perform quite efficiently for a static dataset. However, when it comes to real-time updates in the data, such as insertion or deletion of data points, most graph-based techniques perform inefficiently.
FreshDiskANN solves this problem by proposing a graph-based index that can support real-time updates without expensive rebuilds. It supports both insertion and deletion while maintaining more than 95 % real-time recall accuracy against dynamic datasets.
It handles billions of points on a single machine with a hybrid memory system combining DRAM and SSD. By doing so, FreshDiskANN provides significant hardware cost savings in maintaining "fresh" indices in very large-scale dynamic datasets. This progress is helpful in domains where data is regularly updated, like recommendation engines, document indexing, and search engines.
Another application of DiskANN is StreamingDiskANN, which was developed by the Timescale team as part of the pgvectorscale PostgreSQL extension. We immediately recognized DiskANN's potential but also saw an opportunity to enhance its capabilities to meet the unique needs of time-series and streaming data applications. This recognition led us to develop an extension called StreamingDiskANN, designed explicitly for dynamically changing datasets that update continuously.
StreamingDiskANN keeps one of DiskANN's desirable features: disk-based indexing. It stores the ANN index on disk rather than in memory. Thus, StreamingDiskANN can represent a very cost-effective way to handle high-dimensional problems for PostgreSQL users. Using SSDs, StreamingDiskANN can store terabytes of vectors without the prohibitive costs of large in-memory databases.
More importantly, StreamingDiskANN covers the integration with pgvector so that PostgreSQL users can easily create vector indexes and perform high-performance searches. The integration is designed to be seamless, so developers do not have to make considerable changes in how they think about the data schema or query logic.
Regarding streaming filtering, StreamingDiskANN significantly outperforms traditional methods. In many real-world applications, extra filters are needed—usually on top of the vector similarity search, for example, to filter results based on metadata or certain tags.
For example, in HNSW, a traditional in-memory index, searches are limited by the ef_search
parameter, which controls how many records are fetched before filtering is applied. If none of the initially retrieved vectors match the filtering criteria, this may lead to incomplete or inaccurately returned results.
StreamingDiskANN uses a function get_next()
to fetch the next nearest vector continuously until a sufficient number of records pass in the filter condition. In such a streaming fashion, it is guaranteed to fetch accurate results even when secondary filters are applied to the query input and avoid missing relevant results due to some preset limits on the number of returned results, such as ef_search
.
DiskANN provides a robust nearest-neighbor search that strikes an effective balance between performance and cost. By leveraging SSD storage, DiskANN can achieve high-accuracy results for large-scale datasets. This makes it highly suitable for modern data systems requiring both speed and scalability.
With the technology still evolving, integrating StreamingDiskANN with advanced AI capabilities from Timescale will enable new possibilities around vector data for a wide array of applications, improving their performance and scalability. StreamingDiskANN is available by installing the pgvectorscale extension, which is open source and ready to use in your AI projects today. You can find installation instructions in the pgvectorscale GitHub repository.
To realize the power of vector search, find out more about Timescale Cloud's AI stack. With pgvector, pgai, pgai Vectorizer, and, of course, pgvectorscale, vector search capabilities are easily integrated into your AI workflows without the overhead of managing multiple systems. Build smarter, more efficient AI solutions today using a free PostgreSQL database on Timescale Cloud.