Category: All posts
Oct 16, 2024
Posted by
Team Timescale
Vector embeddings are essential for AI applications like retrieval-augmented generation (RAG), agents, natural language processing (NLP), semantic search, and image search. If you’ve ever used ChatGPT, language translators, or voice assistants, there’s a high chance you’ve come across systems that use embeddings.
An embedding is a compact representation of raw data, such as an image or text, transformed into a vector comprising floating-point numbers. It’s a powerful way of representing data according to its underlying meaning. High-dimensional data is mapped into a lower-dimensional space (think of it as a form of “lossy compression”) that captures structural or semantic relationships within data, making it possible for embeddings to preserve important information while reducing the computational burden that comes with processing large datasets. It also helps uncover patterns and relationships in data that might not have been apparent in the original space.
Since an embedding model represents semantically similar things close together (the more similar the items, the closer the embeddings for those items are placed in the vector space), you can now have computers search for and recommend semantically similar things and cluster things by similarity with improved accuracy and efficiency.
In this article, we’ll examine vector embeddings in depth, including the types of vector embeddings, how neural networks create them, how vector embeddings work, and how you can create embeddings for your data.
While both terms are used interchangeably and refer to numerical data representations where data points are represented as vectors in high-dimensional space, they’re not the same thing. Vectors are simply an array of numbers where each number corresponds to a specific dimension or feature, while embeddings use vectors for representing data in a structured and meaningful way in continuous space.
Embeddings can be represented as vectors, but not all vectors are embeddings. Embeddings generate vectors, but there are other ways of generating them, too.
Embedding is the process of turning raw data into vectors, which can then be indexed and searched over. Meanwhile, indexing is the process of creating and maintaining an index over vector embeddings, a data structure that allows for efficient search and information retrieval from a dataset of embeddings.
There are many different kinds of vector embeddings, each representing a different kind of data:
Many kinds of data types and objects can be represented as vector embeddings. Some of the common ones include:
Neural networks, including large language models like GPT-4, Llama-2, and Mistral-7B, create embeddings through a process called representation learning. In this process, the network learns to map high-dimensional data into lower-dimensional spaces while preserving important properties of the data. They take raw input data, like images and texts, and represent them as numerical vectors.
During the training process, the neural network learns to transform these representations into meaningful embeddings. This is usually done through layers of neurons (like recurrent layers and convolutional layers) that adjust their weights and biases based on the training data.
The process looks something like this:
Neural networks often include embedding layers within the network architecture. These receive processed data from preceding layers and have a set number of neurons that define the dimensionality of the embedding space. Initially, the weights within the embedding layer are initialized randomly and then updated through techniques like backpropagation. These weights initially serve as the embedding themselves, and then they gradually evolve during training to encode meaningful relationships between input data points. As the network continues to learn, these embeddings become increasingly refined representations of data.
Through iterative training, the neural network refines its parameters, including the weights in the embedding layer, to better represent the meaning of a particular input and how it relates to another piece of input (like how one word relates to another). Backpropagation is used to adjust these weights along with other weights depending on whether the overall task involves image classification, language translation, or something else.
The training task is essential in shaping the learned embeddings. Optimizing the network for the task at hand forces it to learn embeddings that capture the underlying semantic relationships within the input data.
Let’s take an example to understand this better. Imagine you’re building a neural network for text classification that determines whether a movie review is positive or negative. Here’s how it works:
Vector embeddings work by representing features or objects as points in a multidimensional vector space, where the relative positions of these points represent meaningful relationships between the features or objects. As mentioned, they capture semantic relationships between features or objects by placing similar items closer together in the vector space.
Then, distances between vectors are used to quantify relationships between features or objects. Common distance metrics include Euclidean distance, cosine similarity, and Manhattan distances and measure how “close” or how “far” vectors are to each other in the multidimensional space.
Euclidean distance measures the straight-line distance between points. Meanwhile, cosine similarity measures the cosine of the angle between two vectors. The latter is often used to quantify how similar two vectors are, regardless of their magnitudes. The higher the cosine similarity value, the more similar the vectors.
Consider a word embedding space where words are represented as vectors in a two-dimensional space. In this space:
In this example, the Euclidean distance between the words “cat” and “dog” is less than the distance between the words “car” and “cat”, indicating that “cat” is more similar to “dog” than to “car.” Meanwhile, the cosine similarity between “cat” and “dog” is also higher than the similarity between “cat” and “car”, which further indicates their semantic similarity.
You can also use embeddings for data preprocessing tasks like language translation, sentiment analysis, normalization, and entity recognition. Plus, embeddings can allow GenAI models to generate more realistic content, whether that’s images, music, or text.
Similarly, you can create application experiences with retrieval-augmented generation.
Embedding-based RAG systems combine the benefits of both large language model (LLM) generation-based and retrieval-based approaches. For instance, in a support assistant application, you can use embeddings to retrieve relevant context related to the customer and then have an LLM generate responses based on the retrieved context to give the customer a more personalized and useful support response.
Creating vector embeddings of your data is also commonly called vectorization. Here’s a general overview of the vectorization process:
While different techniques are used to embed different kinds of data, you usually call an embedding model via an API to create vector representations:
If your data is already in PostgreSQL, you can simply use pgvectorizer, which is part of Timescale’s Python library and makes it simple to manage embeddings. In addition to creating embeddings from data, pgvectorizer keeps the embedding and relational data in sync as data changes, allowing you to leverage hybrid and semantic search within your applications.
Before you start, you first need to head over to the OpenAI website and create your API key.
Then, to create your own embeddings, make sure you install the OpenAI Python package.
!pip install openai
Next, import the OpenAI class from the module:
from openai import OpenAI
Next, create an instance of the OpenAI class (we’ll call it client) and initialize it with your API key to authenticate and authorize access to the API:
client = OpenAI(api_key="YOUR_API_KEY")
Then, as defined in the API docs, we use the client.embeddings.create()
method to generate embeddings for a given text. We also need to specify the model, enter the input, and choose the encoding format before sending the request:
# Generate an embedding for a given input sentence
response = client.embeddings.create(
model="text-embedding-3-small",
input="How to create vector embeddings using OpenAI",
encoding_format="float"
)
#Extract the embedding vector
embedding = response['data'][0]['embedding']
The result you get is the vector embeddings for the provided text, and looks something like this:
Traditional databases are not enough to handle the complexity of vector data, making it difficult to analyze it and extract meaningful insights. This is where vector databases come in—these are specialized databases designed to handle vectors and can efficiently store and retrieve vector embeddings.
A great way to store vector embeddings is using PostgreSQL since it allows you to combine vector data with other kinds of relational and time-series data, like business data, metrics, and analytics. While PostgreSQL wasn’t originally designed for vector search, extensions like pgvector make it possible to store vectors in a regular PostgreSQL table in a vector column, along with metadata. It’s a powerful extension for storing and querying vectors and enables applications like RAG, NLP, computer vision, semantic search, and image search.
You now know everything you need to know to get started with creating vector embeddings. In this article, we’ve talked about what vector embeddings really are, their types, how they work, and how neural networks create embeddings.
You now also know that applications involving LLMs, semantic search, and GenAI all rely on vector embeddings to power them, making it important to learn how to create, store, and query vector embeddings.
If you’re ready to get started, store vector embeddings on Timescale Cloud today. Sign up for a free trial.