Dec 18, 2024
Posted by
Jacky Liang
When building a search or RAG (retrieval-augmented generation) application, you face a common challenge—which embedding model should you use? The most common choice is between proprietary models, like those from OpenAI, and open-source models, of which there are dozens to choose from. Once you’ve narrowed down candidates for testing, the actual testing process can also be very time-consuming. You need to tick the following checklist:
Who’s got time for all of this?
Oh, and there’s another problem!
If you already have a working search application, testing new models often means disrupting your existing setup or building a separate testing environment. Many times, this leads to testing and experimentation taking a backseat in priority, leaving potential accuracy gains and cost savings on the table.
We wanted to find a simpler way to evaluate different embedding models and understand their real-world performance. In this guide, we'll show you how to use pgai Vectorizer, an open-source tool for embedding creation and sync, to test different embedding models on your own data. We'll walk through our evaluation of four popular models (both open-source and OpenAI) using Paul Graham's essays as test data and share a reusable process you can adapt for your own model evaluations.
Here's what we learned from our tests—and how you can run similar comparisons yourself.
In this embedding model evaluation, we will compare the following embedding models:
We chose these models because they represent both popular closed-source and open-source options for embeddings. OpenAI's models are widely used and considered industry standard, while BGE large and nomic-embed-text are leading open-source alternatives that can run locally.
Given this text chunk from Paul Graham’s essay "Cities and Ambitions”:
"Cambridge as a result feels like a town whose main industry is ideas, while New York's is finance and Silicon Valley's is startups."
We tested questions like the following:
Essentially, we are testing how different embedding models handle different ways of asking for the same information. By taking text chunks and generating various types of questions about them—from direct to paraphrased questions—we can see if the model understands not just exact matches but also meaning and context, which is typically how humans ask questions.
Instead of building the test infrastructure from scratch, we will use pgai Vectorizer to simplify the test massively. You can follow the pgai Vectorizer quick start guide to set up pgai with Ollama in just a few minutes.
Using pgai Vectorizer saves you significant development time by handling embedding operations directly in PostgreSQL. Here’s a list of what it can do:
Since pgai Vectorizer is built on PostgreSQL, you can use familiar SQL commands and integrate them with your existing database. Who needs bespoke vector databases?
First, let’s load Paul Graham’s essays essays into the database. We have a convenient function that loads datasets from Hugging Face directly into your PostgreSQL database.
SELECT ai.load_dataset('sgoel9/paul_graham_essays');
Testing different embedding models is super simple with pgai Vectorizer.
You simply create multiple vectorizers, each using a different model you want to evaluate. The vectorizer handles all the complexity of creating and managing embeddings for each model. Here's how we set up two models (nomic-embed-text and text-embedding-3-small, both are small embedding models) to compare:
-- Set up Nomic embed-text
SELECT ai.create_vectorizer(
'pg_essays'::regclass,
destination => 'essays_nomic_embeddings',
embedding => ai.embedding_ollama('nomic-embed-text', 768),
chunking => ai.chunking_recursive_character_text_splitter('text', 512, 50)
);
-- Set up OpenAI's small embedding model
SELECT ai.create_vectorizer(
'pg_essays'::regclass,
destination => 'essays_openai_small_embeddings',
embedding => ai.embedding_openai('text-embedding-3-small', 768),
chunking => ai.chunking_recursive_character_text_splitter('text', 512, 50)
);
You can query the generated embeddings directly in the embedding view for each model.
SELECT id, title, date, chunk, embedding
FROM essays_nomic_embeddings
LIMIT 5;
Pgai Vectorizer comes with a number of useful helper functions, such as monitoring the status of all vectorizers. This function is a good way to get an overview of all vectorizers created and the progress of their respective embedding tasks.
SELECT * FROM ai.vectorizer_status;
If you want to try it out, here’s the full API reference for pgai Vectorizer.
This evaluation will focus on how well each model can find relevant text when given different types of questions. The methodology is as follows:
source_chunk_id
appears in TOP_K
results.(NUM_CHUNKS * NUM_QUESTIONS_PER_CHUNK)
.
The advantage of this method of evaluation is the simplicity and the fact that you don’t have to curate the ground truth manually. In practice, we’ve seen this method work well, but it does have its limitations (as all methods do): If the content in the dataset is too semantically similar, the questions generated by the LLM may not be specific enough to retrieve the chunk the question was generated from, the evaluation does not check about the rank of the answer within the top-k, etc. You should always spot-check any eval method on your particular dataset. /rant
The full evaluation code is available on GitHub if you want to run your own tests on different embedding models or change evaluation parameters. If you’re a video person, we got you covered, too.
Here are some key highlights from the evaluation code.
Create questions for testing model understanding:
def generate_questions(self, chunk: str, question_type: str, count: int) -> List[str]:
prompts = {
'short': "Generate {count} short, simple questions about this text. Questions should be direct and under 10 words:",
'long': "Generate {count} detailed, comprehensive questions about this text. Include specific details:",
'direct': "Generate {count} questions that directly ask about explicit information in this text:",
'implied': "Generate {count} questions that require understanding context and implications of the text:",
'unclear': "Generate {count} vague, ambiguous questions about the general topic of this text:"
}
prompt = prompts[question_type].format(count=count) + f"\n\nText: {chunk}"
questions = []
for attempt in range(max_retries):
response = openai.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": "Generate different types of questions about the given text. Each question must be on a new line. Do not include empty lines or blank questions."},
{"role": "user", "content": prompt}
],
)
Evaluate how well each model finds relevant content:
def step3_evaluate_models(self):
"""Test how well each model finds the relevant chunks"""
for table in Config.EMBEDDING_TABLES:
scores = []
for q in self.questions_data:
search_results = self.db.vector_search(
table,
q['question'],
Config.TOP_K
)
# Check if the model found the correct chunk
found = any(
r[0] == q['source_chunk_id'] and
r[1] == q['source_chunk_seq']
for r in search_results
)
scores.append(1 if found else 0)
OpenAI's large model performed best overall with 80.5 % accuracy, while their smaller model achieved 75.8 %. The open-source models were competitive—BGE large reached 71.5 % accuracy, and nomic-embed-text hit 71 %.
For our test data of 215 essays, our vectorizer created 8,426 chunks using a 512-token size and 50-token overlap setting. The embedding costs were reasonable: $0.03 for text-embedding-3-small and $0.15 for text-embedding-3-large. Open-source models had no API costs, only compute resources and time for running them locally.
All models handled detailed questions surprisingly (or unsurprisingly?) well. Even the open-source options achieved around 90 % accuracy, with OpenAI's large model reaching a stunning 97.5 %. Unsurprisingly, it appears that when users provide more context in their queries, the models do better at finding relevant information.
However, the biggest differences showed up in the questions requiring context understanding. OpenAI's large model reached 88.8 % accuracy here, while other models stayed around 75-78 %. We suspect that the extra dimensions in the larger model seem to help it capture more subtle relationships in the text. Open-source models still need a bit more progress on handling lack of context to catch up to closed source.
Vague questions were challenging for every model, with accuracy between 42 % and 57 %. This isn't surprising—even humans struggle with ambiguous queries.
After reviewing the results, we noticed some critical points to consider when choosing your embedding model, chunking strategies, and input data quality:
Context matters a lot:
Size vs. performance trade-off:
Cost considerations:
Before choosing a model, consider the following:
Cost constraints: OpenAI costs about $0.020 / 1M tokens for text-embedding-3-small and $0.130 / 1M tokens for text-embedding-3-large, while open-source models have no usage fees but need compute and time resources.
Choosing the right embedding model can make or break your AI application due to its implications on cost and efficiency. In this blog post, we used pgai Vectorizer to evaluate four embedding models and shared a checklist that you can use to test other models you may be interested in.
If you want to try this process on your own dataset, install pgai Vectorizer and take it out for a spin. While you’re at it, start saving time by using it to handle embedding operations without leaving PostgreSQL—no specialized databases required.
Here are more blog posts about RAG with PostgreSQL and different tools: