Written by Haziqa Sajid
Embeddings bridge the gap between human language and AI comprehension by representing words or sentences as points in a multidimensional space that captures their meanings. Various metrics, such as cosine similarity, are then used to measure how closely related these points are in that space.
Cosine similarity is a core computation that engines many systems in natural language processing (NLP). Imagine a component that gives computers the power to understand text-related meaning. That’s how effective this component is. In this article, we discuss cosine similarity. We will also learn its connection with NLP and review tools that enable cosine similarity computations.
Cosine similarity calculates the cosine of the angle between two vectors, revealing how closely the vectors are aligned. For instance, words like "cat" and "dog" will have a higher cosine similarity than "cat" and "car."
The inner product (or dot product) of two vectors, X and Y, involves multiplying their corresponding components and summing the results. For example, if X and Y are 100-dimensional vectors:
The inner product (〈X, Y 〉) is calculated as:
Geometrically, this inner product relates to the cosine of the angle (θ) between (X) and (Y):
(|X|) and (|Y|) represent the vectors' Euclidean lengths (or magnitudes). The length (|X|) is given by:
Thus, cosine similarity is computed as:
This computation uses basic operations like multiplication, addition, and square roots.
Cosine similarity is always in the interval ([-1, 1]):
Cosine_sim = 1: The angle (θ) is 0, meaning the vectors point in the same direction (one is a positive scalar multiple of the other).
Cosine_sim = 0: The angle (θ) is 90 degrees, meaning the vectors are orthogonal, which often indicates they are unrelated in context.
Cosine_sim = -1: The angle (θ) is 180 degrees, meaning the vectors are aligned but point in opposite directions.
While cosine similarity effectively compares the direction of vectors, it does not define a proper distance between them.
In natural language processing, embeddings transform text data into high-dimensional vectors, capturing the meaning of words and phrases. These vectors encode the frequency and context of words within a text corpus. The vector for a single word is designed to be cosine, similar to phrases or contexts where it frequently appears. This is achieved by training neural networks, such as word2vec, to create these embedding functions.
When applying semantic embedding to a body of text, cosine similarity effectively measures the relatedness of meanings between vectors. Vectors with cosine similarity values close to one correspond to texts that have similar meanings.
While other vector comparison metrics, like cosine distance or Euclidean distance, exist, cosine similarity is particularly popular for embedded vectors because of its simplicity and scalability.
Scalability: Cosine similarity involves basic computations like multiplication and addition, making it efficient even when dealing with vectors with millions of dimensions.
Sparse vectors: Embedding vectors are often sparse, meaning many entries are zero. The inner product * and vector length * can be heavily optimized in such cases.
Simplicity: Cosine similarity is frequently chosen for its straightforwardness. Unlike Euclidean distance, which can grow unbounded towards infinity, cosine similarity is confined to a range between -1 and 1. This bounded behavior helps avoid overflow issues, making it more practical for most models.
Alignment with embedding algorithms: Embedding algorithms are often designed with cosine-similarity-based comparisons in mind, making cosine similarity a natural and efficient choice for relating the meaning of vectors.
Cosine similarity in AI systems measures the similarity between text data. This powers applications like search engines and recommendation systems to identify related content. Here are the typical steps involved:
Gather a corpus of text documents, including anything from an organization’s internal records to public datasets. These documents are stored in a database like PostgreSQL for easy access and management.
The next step involves transforming the text data into a numerical format using semantic embeddings. Embeddings are high-dimensional vectors that represent the semantic meaning of text. For instance, models like OpenAI's text-embedding-ada-002
can convert a piece of text into a 1,536-dimensional vector. This is just one approach to converting text into embeddings; many others exist.
For instance, you can use word embeddings like Word2Vec, which captures semantic relationships between individual words, or sentence embeddings like SBERT (Sentence-BERT), which represent entire sentences as vectors. There are also proprietary, closed-source models like OpenAI’s text-embedding-ada-002
and Cohere’s embedding models.
A function is developed to facilitate semantic-based search by converting the user's search query into a vector using the same embedding model. This vector serves as a query to search through the stored document vectors.
Cosine similarity compares the query vector against the document vectors. It computes the cosine of the angle between two vectors, providing a measure of similarity that ranges from -1 (entirely dissimilar) to 1 (identical in direction). Documents with a similarity score closer to 1 are considered more relevant to the query.
Check out this blog post on creating, storing, and querying OpenAI embeddings for a step-by-step guide with code snippets.
Several tools and methods are available for computing cosine similarity, each suited to different use cases:
Manual implementation: You can implement cosine similarity directly in Python or other programming languages. We just need to replicate the formula discussed above in a function. Here’s an implementation:
from numpy import dot
from numpy.linalg import norm
def cosine_similarity(v1, v2):
return dot(v1, v2) / (norm(v1) * norm(v2))
dot
and norm
are numpy
functions for dot product and Euclidean distance, respectively.
Vector databases: Implementing many optimizations ourselves might result in us overlooking important details. Consequently, specialized databases have been developed specifically for handling vectors, incorporating various operations such as cosine similarity. Traditional databases like PostgreSQL use extensions like pgvector
, which include functions for cosine similarity, making it easier to work with high-dimensional data.
Nearest neighbor search algorithms: Searching for the closest vectors, or “nearest neighbors,” can be computationally expensive, especially with high-dimensional vectors. Various algorithms optimize this process by instead finding approximate nearest neighbors (ANNs):
Hierarchical Navigable Small World (HSNW) algorithms improve search efficiency by structuring data in a network-like format. In simple terms, HNSW builds a multi-layered graph structure that resembles a small-world network. Think of it like a digital map of a road network. When you zoom out, you see major highways connecting cities. Zooming in shows connections in a town, and at the closest level, you can see how neighborhoods and communities are connected.
Centroid Indexing is a measure for assessing the similarity between two clustering solutions based on their cluster structures. It works by mapping the centroids of each clustering and counting the number of centroids from one solution that does not match the other.
DiskANN Algorithm from Microsoft is a graph-based approximate nearest neighbor search algorithm that scales to large amounts of data while retaining high recall.
Timescale Cloud extends PostgreSQL’s capabilities by offering a built-in cosine similarity search function through the pgvector
extension. This inclusion allows comparisons between high-dimensional vectors for semantic search. Timescale also supports other comparison metrics if needed.
Timescale uses a state-of-the-art nearest neighbor algorithm based on DiskANN—StreamingDiskANN—ensuring top-tier performance for ANN queries. This new index method for pgvector data is part of a PostgreSQL open-source extension developed by the Timescale team called pgvectorscale.
Pgvectorscale’s Streaming DiskANN index has no “ef_search” type cutoff. Instead, it uses a streaming model allowing the index to continuously retrieve the “next closest” item for a given query, potentially traversing the entire graph!
Benchmark tests reveal that, compared to Pinecone’s storage-optimized index (s1), Timescale’s supercharged PostgreSQL with pgvector and pgvectorscale achieves 28x lower p95 latency and 16x higher query throughput for approximate nearest neighbor queries at 99 % recall, all at 75 % lower monthly cost when self-hosted on AWS EC2.
Additionally, there are several other benefits:
Rich support for backups: Timescale Cloud supports consistent backups, streaming backups, and incremental and full backups. In contrast, Pinecone only supports a manual operation to take a non-consistent copy of its data called “Collections.”
Point-in-time recovery: Timescale Cloud offers PITR to recover from operator errors.
High availability: This feature is designed for applications that need high-uptime guarantees.
Security: Timescale Cloud offers enterprise-grade security, including SOC2 Type II and GDPR compliance, with encryption and multi-factor authentication.
Flexibility: It supports various index types, allowing tailored solutions for AI needs.
Familiarity: Timescale Cloud complements, rather than replaces, pgvector, ensuring a low barrier to adoption for users already familiar with PostgreSQL.
Embeddings serve as points in space, and cosine similarity gives us a numerical number that tells their relationship. With this approach, you can group content with similar meanings, allowing systems to deliver results beyond simple keyword matching and offer more profound, context-aware responses.
To learn more about cosine similarity, check out these blog posts:
Developers can leverage powerful tools like Timescale Cloud to unlock the full potential of cosine similarity. Its open-source AI stack includes pgvector and pgvectorscale, a PostgreSQL extension that builds on pgvector for more performance and scale. Pgvectorscale adds a StreamingDiskANN index to pgvector and statistical binary quantization, unlocking large-scale, high-performance AI use cases previously achievable only with specialized vector databases like Pinecone.
Pgvectorscale is open source under the PostgreSQL License and is available for you to use in your AI projects today. You can find installation instructions on the pgvectorscale GitHub repository. You can also access pgvectorscale on any database service on Timescale’s cloud PostgreSQL platform.