Category: All posts
Dec 19, 2024
Posted by
Hervé Ishimwe
Proprietary embedding models like OpenAI’s text-embedding-large-3 and text-embedding-small are popular for retrieval-augmented augmentation (RAG) applications, but they come with added costs, third-party API dependencies, and potential data privacy concerns.
On the other hand, open-source embedding models provide a cost-effective and customizable alternative. By running these models locally, you can stop paying the OpenAI tax and regain complete control over the embedding creation process, enhance data privacy, and tailor the models to your needs.
However, evaluating open-source embedding models can be complex, time-consuming, and resource-intensive, causing many engineers to default to proprietary solutions.
This blog post will walk you through an easy-to-replicate workflow for comparing open-source embedding models using Ollama, an open-source platform for running large language models (LLMs) locally, and pgai Vectorizer, a PostgreSQL-based tool for automating embedding generation and management with a single SQL command. Paul Graham’s essays will be our evaluation dataset to demonstrate this workflow.
An evaluation workflow for comparing open-source embedding models typically includes the following steps:
While this workflow may sound straightforward, implementing it can quickly become complex and resource-intensive due to several challenges:
Fear not—we can make this easier!
While the specifics of a robust evaluation pipeline may vary depending on your RAG application, you can significantly reduce the complexity with just two tools: Ollama for accessing and managing the embedding models and pgai Vectorizer for automating embedding generation and management across multiple models (we shared how to automate embedding generation in a previous article).
Want to follow along? Check out this GitHub repository for all the code used in this post.
Ollama makes running open-source models effortless by eliminating dependency and compatibility headaches. Simply download and run the model—no complex setup required. It works seamlessly across macOS, Linux, Windows, and Docker environments. In this evaluation, we are running Ollama within a Docker container.
Ollama simplifies model management by bundling a model’s configuration, data, and weights. This bundle makes cleanup and experimentation straightforward while ensuring full data ownership—you retain complete control over how your data is handled and where it flows.
Ollama provides access to state-of-the-art large language models. In this evaluation, we compared three of the most popular embedding models available on Ollama:
Embedding Model | Parameters | Dimensions | Size |
137 M | 768 | 274 MB | |
334 M | 1,024 | 670 MB | |
567 M | 1,024 | 1.2 GB |
These open-source embedding models rival industry-standard proprietary embedding models like OpenAI’s:
Pgai Vectorizer eliminates the need to build complex automation infrastructure to generate and manage embeddings across multiple models. It is an open-source, powerful tool designed to automate embedding creation and management directly in PostgreSQL, a widely adopted and robust database with vector capabilities via extensions like pgvector and pgai.
In this evaluation, we use PostgreSQL as our database to store the evaluation dataset and its corresponding embeddings.
What sets pgai Vectorizer apart for this use case is its integration with Ollama, allowing you to generate embeddings using any open-source model supported by Ollama.
To configure a vectorizer for each embedding model, just use one SQL command with all the configurations needed for your embeddings, as demonstrated below in the create_vectorizer
function. You can find more about these configurations in pgai Vectorizer’s API reference.
def create_vectorizer(embedding_model, embeddings_dimensions):
embeddings_view_name = f"{'essays'}{'_'}{embedding_model.replace('-','_')}{'_'}{'embeddings'}"
with connect_db() as conn:
with conn.cursor() as cur:
cur.execute("""
SELECT ai.create_vectorizer(
'essays'::regclass,
destination => %s,
embedding => ai.embedding_ollama(%s, %s),
chunking => ai.chunking_recursive_character_text_splitter('text', 512, 50),
formatting => ai.formatting_python_template('title: $title $chunk')
);
""", (embeddings_view_name, embedding_model, embeddings_dimensions, )
)
From there, pgai Vectorizer handles all the heavy lifting:
Using this Docker compose file, you can quickly configure PostgreSQL, the pgai Vectorizer worker, and Ollama services in your Docker environment.
To get started, check out this quick start guide for more on pgai Vectorizer’s integration with Ollama.
After running your PostgreSQL service, install the pgai extension. Then, you can insert the evaluation dataset, Paul Graham’s essays, using the pgai function load_datasets
, which loads datasets from Hugging Face directly into your database!
with connect_db() as conn:
with conn.cursor() as cur:
# Load Paul Graham's essays dataset into the 'essays' table
cur.execute("""
SELECT ai.load_dataset(
'sgoel9/paul_graham_essays',
table_name => 'essays',
if_table_exists => 'append');
""")
Let’s configure a vectorizer for each embedding model using the create_vectorizer
function!
EMBEDDING_MODELS = [
{'name':'mxbai-embed-large', 'dimensions': 1024},
{'name':'nomic-embed-text','dimensions': 768},
{'name':'bge-m3','dimensions': 1024},
]
for model in EMBEDDING_MODELS:
create_vectorizer(model['name'], model['dimensions'])
The order in which the vectorizers are created is the same as in the embedding generation queue. You can view the queue using the vectorizer_status
function like this:
with connect_db() as conn:
with conn.cursor() as cur:
cur.execute("SELECT * FROM ai.vectorizer_status;")
for row in cur.fetchall():
print(f"Vectorizer ID: {row[0]}, Embedding Table: {row[2]}, Pending Items: {row[4]}")
Rich embeddings—dense vector representations that capture text's underlying meaning, relationships, and context—are essential for a RAG application to deliver accurate and relevant results. Our evaluation process focuses on two key aspects of embeddings:
The evaluation pipeline consists of two main stages: test data generation and model evaluation.
We create a testing dataset by leveraging the text chunks created by vectorizers during the embedding process. Here’s the step-by-step breakdown of our approach:
The questions were evenly distributed across the following five categories to mimic how humans ask questions. These questions allow us to simulate potential queries our RAG application would receive:
Question Type | Description | Evaluation Focus |
Short Questions | Simple, direct questions under 10 words | Semantic understanding (tests for basic comprehension) |
Long Questions | Detailed, comprehensive questions with specific deals | Contextual retrieval (tests handling deeper and more detailed queries) |
Direct Questions | Refers to explicit content in the text | Semantic understanding (tests retrieving exact matches) |
Implied Questions | Context-based questions requiring inference | Contextual retrieval (test understanding meaning beyond the obvious) |
Unclear Questions | Vague or ambiguous questions | Tests both evaluation criteria (model’s ability to handle uncertainty and meaning). |
This function generates questions of a specific type for each text chunk.
def generate_questions_by_question_type(chunk, question_type, num_questions):
prompts = {
'short': "Generate {count} short, simple questions about this text. Questions should be direct, under 10 words",
'long': "Generate {count} detailed, comprehensive questions about this text. Include specific details:",
'direct': "Generate {count} questions that directly ask about explicit information in this text",
'implied': "Generate {count} questions that require understanding context and implications of the text:",
'unclear': "Generate {count} vague, ambiguous questions about the general topic of this text:"
}
prompt = prompts[question_type].format(count=num_questions) + f"\n\nText: {chunk}"
system_instructions = """
Generate different types of questions about the given text following the prompt provided.
Each question must be on a new line. Do not include empty lines or blank questions.
"""
with connect_db() as conn:
with conn.cursor() as cur:
cur.execute("""
SELECT ai.ollama_generate(
'llama3.2',
%s,
system_prompt=>%s,
host=>%s
)->>'response';
""",(prompt, system_instructions, OLLAMA_HOST))
generated_questions = [q.strip() for q in cur.fetchone()[0].split("\n") if q.strip()]
print(f"Number of questions generated for {question_type}: {len(generated_questions)}")
return generated_questions
Here are some of the key insights:
Here is an example:
A text chunk selected from Paul Graham’s How to Start a Startup? (March 2005):
I worried about how small and obscure we were. But in fact we were doing exactly the right thing. Once you get big (in users or employees) it gets hard to change your product. That year was effectively a laboratory for improving our software. By the end of it, we were so far ahead of our competitors that they never had a hope of catching up. And since all the hackers had spent many hours talking to users, we understood online commerce way better than anyone else.
Questions generated from the text chunk and used to test it:
We use vector similarity search to evaluate each embedding model’s ability to retrieve the correct parent text chunk. The goal is to check if the original text chunk appears among the top_K
closest matches (retrieval window) for each question in the testing dataset. Here are the steps involved:
top_K
closest embeddings.top_K
results, a score of 1 is recorded. Otherwise, a score of 0 is recorded.Selecting the size of the retrieval window is often a balance between precision and recall. A smaller window can miss correct results ranked slightly lower, while a larger one can skew the overall accuracy. We chose 10 as our top_K
because it strikes a balance: it’s large enough to account for semantic overlap in textual data, where many chunks may have similar embeddings yet small enough to maintain meaningful evaluation results.
Our evaluation dataset, Paul Graham’s essays, offered a diverse mix of short, direct, and contextually rich text. The data was split into 6,257 text chunks of 512 characters each, with a 50-character overlap. We completed this workflow using only open-source LLMs (embedding and generative models) and cost-free—from generation to evaluation!
bge-m3
achieved the highest overall retrieval accuracy at 72 %, significantly outperforming the other models. mxbai-embed-large
followed with 59.25 %, while nomic-embed-text
ranked last with 57.25 %.
While embedding models with higher dimensions (1,024) performed the best overall, the gap between bge-m3
and the other models across all question types is notable. This superior performance is likely due to bge-m3
’s multi-functionality, allowing it to efficiently handle diverse embedding types such as dense, multi-factor, and sparse retrieval. This versatility enables better context comprehension, especially for long and implied questions.
bge-m3
particularly excelled at long questions, achieving its highest retrieval accuracy of 92.5 %, showcasing its strong contextual understanding. Similarly, mxbai-embed-large
performed well in this category, with an accuracy of 82.5 %, further supporting the correlation between higher embedding dimensions and improved contextual capabilities. Interestingly, nomic-embed-text
also achieved its best performance on long questions, suggesting that embedding models, like humans, handle detailed and context-rich queries more effectively.
On the other hand, despite the difference in embedding dimensions between mxbai-embed-large
and nomic-embed-text
, their performances were comparable across all question types. nomic-embed-text
outperformed mxbai-embed-large
on short and direct questions, achieving retrieval accuracies of 57.5 % and 63.75 %, respectively, showcasing its strength in handling more minor semantic queries.
While mxbai-embed-large
performed better on context-heavy questions, such as long and implied ones, the gap in accuracy was not significant. This suggests that while embedding dimensions contribute to performance, they are not the sole determining factor when selecting the best embedding model for your RAG application.
Finally, all three models performed poorly on unclear and vague questions, achieving the lowest accuracies ranging from 51.25 % for bge-m3
to 37.5 % for nomic-embed-text
.
Now that we have explored the evaluation results, how do you select the best open-source embedding model for your RAG application?
Fortunately, cost is not part of the conversation here, as all these models are free to use. Instead, your choice should depend on the following key considerations:
Will your queries be short and direct, or will they involve context-heavy and detailed questions?
This distinction helps determine the embedding dimensions you need. For example, models like bge-m3
excel at handling context-rich queries due to their higher embedding dimensions, while models like nomic-embed-text
are better suited for short semantic queries.
While embedding dimensions are critical for performance, you must also consider the model size and whether it fits your available resources (e.g., storage, computing).
For instance, if you’re constrained on storage but still need strong performance, mxbai-embed-large
is a good option. It balances size and sophistication, outperforming smaller models like nomic-embed-text
thanks to its higher dimensions.
Another factor to consider is the speed at which the model generates embeddings. While embedding generation is often done asynchronously, creating the impression of instant processing for users, working with models locally can introduce challenges. Limited computational power on local machines can impact delivering a quick and seamless user experience.
For instance:
bge-m3
and mxbai-embed-large
take longer to generate embeddings due to their higher dimensions and complexity. However, they produce richer, more context-aware embeddings.nomic-embed-text
generate embeddings much faster but at the cost of reduced richness and depth.When selecting an open-source embedding model, evaluating whether the embedding generation speed is critical for your use case and balancing it against the quality of embeddings needed is essential.
This blog post explored an evaluation workflow demonstrating that choosing the best open-source embedding model for your RAG application requires balancing performance, resources, and query types.
Thanks to Ollama and pgai Vectorizer, implementing this workflow was simple and efficient. Ollama simplified model access and management, while pgai Vectorizer automated embedding generation and storage in PostgreSQL, removing the need for complex infrastructure.
These tools make evaluating and comparing open-source models more straightforward than ever, empowering you to find the best open-source solution for your needs—cost-free and with complete control over your data.
To try this workflow with your own data and models, check out pgai Vectorizer’s documentation, the quick start guide with Ollama, and pgai’s GitHub repository!
To learn more, check out these blog posts on open-source LLMs and RAG applications with PostgreSQL: