Dec 20, 2024
Posted by
Jacky Liang
When building a search or RAG application, you face a crucial decision—which embedding model should you use? The choice is no longer just between proprietary models like OpenAI and open-source alternatives. Now, you also need to consider domain-specific models trained for particular fields like finance, healthcare, or legal text. Once you've identified potential candidates, the testing process begins. You need to check the following items off your list:
Who's got time for all of this when you're trying to ship 🚀 your application?
Oh, and there's another challenge!
If you already have a working search system, testing new models often means either disrupting your production environment or building a separate testing setup. This frequently leads to postponing model evaluation, potentially missing out on significant accuracy improvements and cost savings—especially critical when dealing with specialized domain knowledge.
We wanted to find a straightforward way to evaluate different embedding models and understand their real-world performance on domain-specific text.
In this guide, we'll demonstrate this process using financial data as our example. We'll show you how to use pgai Vectorizer, an open-source tool for embedding creation and sync, to test different embedding models on your own data. We'll walk through our evaluation comparing a general-purpose model (OpenAI's text-embedding-3-small) against a finance-specialized model (Voyage AI's finance-2) using real financial statements (SEC filings) as test data and share a reusable process you can adapt for your domain.
Here's what we learned from our tests—and how you can run similar comparisons yourself.
In this embedding model evaluation, we will compare two models with different specializations:
We chose these models because they represent an interesting comparison: OpenAI's model is widely used and considered an industry standard for general-purpose embeddings, while Voyage AI's finance-2 model is specifically trained on financial text and documentation.
Given this text chunk from an SEC filing: "The Company's adjusted EBITDA increased by 15 % year-over-year, primarily driven by operational efficiencies and market expansion, while maintaining a healthy debt-to-equity ratio of 0.8."
We tested questions like the following:
Essentially, we are testing how different embedding models handle financial terminology and relationships. By taking SEC filing chunks and generating various types of questions—from direct metrics to implied financial health—we can see if the models understand not just explicit financial data but also financial context and implications, which is how analysts typically analyze companies.
Instead of building the test infrastructure from scratch, we'll use pgai Vectorizer to simplify the test massively. You can follow the pgai Vectorizer quick start guide to set up pgai in just a few minutes.
Using pgai Vectorizer saves you significant development time by handling embedding operations directly in PostgreSQL. Here's what it can do:
Since pgai Vectorizer is built on PostgreSQL, you can use familiar SQL commands and integrate them with your existing database. Who needs bespoke vector databases?
First, let's create our SEC filings table and load the data into the database. We have a convenient function that loads datasets from Hugging Face directly into your PostgreSQL database:
CREATE TABLE sec_filings (
id SERIAL PRIMARY KEY,
text text
);
SELECT ai.load_dataset(
name => 'MemGPT/example-sec-filings',
table_name => 'sec_filings',
batch_size => 1000,
max_batches => 10,
if_table_exists => 'append'
);
Testing different embedding models is super simple with pgai Vectorizer. You simply create multiple vectorizers, each using a different model you want to evaluate. The vectorizer handles all the complexity of creating and managing embeddings for each model. Here's how we set up our two models to compare:
-- Set up OpenAI's general-purpose model
SELECT ai.create_vectorizer(
'sec_filings'::regclass,
destination => 'sec_filings_openai_embeddings',
embedding => ai.embedding_openai(
'text-embedding-3-small',
768
),
chunking => ai.chunking_recursive_character_text_splitter(
'text',
chunk_size => 512,
chunk_overlap => 50
)
);
-- Set up Voyage's finance-specialized model
SELECT ai.create_vectorizer(
'sec_filings'::regclass,
destination => 'sec_filings_voyage_embeddings',
embedding => ai.embedding_voyageai(
'voyage-finance-2',
1024
),
chunking => ai.chunking_recursive_character_text_splitter(
'text',
chunk_size => 512,
chunk_overlap => 50
)
);
You can query the generated embeddings directly in the embedding view for each model:
SELECT * FROM sec_filings_voyage_embeddings LIMIT 5;
If you want to try it out, here’s the full API reference for pgai Vectorizer.
This evaluation will focus on how well each model can find relevant text when given different types of questions. The methodology is as follows:
source_chunk_id
appears in TOP_K
results.(NUM_CHUNKS * NUM_QUESTIONS_PER_CHUNK)
.The advantage of this evaluation method is the simplicity and the fact that you don’t have to curate the ground truth manually. In practice, we’ve seen this method work well, but it does have its limitations (as all methods do): If the content in the dataset is too semantically similar, the questions generated by the large language model may not be specific enough to retrieve the chunk the question was generated from, the evaluation does not check about the rank of the answer within the top-k
, etc. You should always spot-check any eval method on your particular dataset.
The full evaluation code is available on GitHub if you want to run your own tests on different embedding models. Here are the key highlights from our financial evaluation code.
Create financially-focused test questions:
def generate_questions(self, chunk: str, question_type: str, count: int) -> List[str]:
prompts = {
'short': "Generate {count} short but challenging finance-specific questions about this SEC filing text. Questions should be under 10 words but test deep understanding:",
'long': "Generate {count} detailed questions that require analyzing financial metrics, trends, and implications from this SEC filing text:",
'direct': "Generate {count} questions about specific financial data, numbers, or statements explicitly mentioned in this SEC filing:",
'implied': "Generate {count} questions about potential business risks, market implications, or strategic insights that can be inferred from this SEC filing:",
'unclear': "Generate {count} intentionally ambiguous questions about financial concepts or business implications that require careful analysis of this SEC filing:"
}
system_prompt = """You are an expert in financial analysis and SEC filings.
Generate challenging, finance-specific questions that test deep understanding of financial concepts,
business implications, and regulatory compliance. Questions should be difficult enough to
challenge both general-purpose and finance-specialized language models."""
prompt = prompts[question_type].format(count=count) + f"\n\nSEC Filing Text: {chunk}"
questions = []
for attempt in range(max_retries):
response = openai.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": prompt}
],
temperature=0.7 + (attempt * 0.1)
)
Evaluate how well each model understands financial context:
def step3_evaluate_models(self):
"""Test how well each model understands financial content"""
print("Step 3: Evaluating models...")
self.results = {}
detailed_results = []
for table in Config.EMBEDDING_TABLES:
print(f"Testing {table}...")
scores = []
for q in self.questions_data:
# For Voyage's finance model
if 'voyage' in table:
search_results = self.db.vector_search(
table,
q['question'],
Config.TOP_K
)
# For OpenAI's general model
elif 'openai' in table:
search_results = self.db.vector_search(
table,
q['question'],
Config.TOP_K
)
found = any(
r[0] == q['source_chunk_id'] and
r[1] == q['source_chunk_seq']
for r in search_results
)
scores.append(1 if found else 0)
detailed_results.append({
'model': table,
'question': q['question'],
'question_type': q['question_type'],
'found_correct_chunk': found,
'num_results': len(search_results)
})
# Calculate accuracy by question type
self.results[table] = {
'overall_accuracy': sum(scores) / len(scores),
'by_type': {
q_type: sum(scores[i] for i, q in enumerate(self.questions_data)
if q['question_type'] == q_type) /
Config.QUESTION_DISTRIBUTION[q_type] /
Config.NUM_CHUNKS
for q_type in Config.QUESTION_DISTRIBUTION.keys()
}
}
Our evaluation comparing the finance-specialized Voyage model against OpenAI's general-purpose model revealed quite significant differences in their ability to handle financial text. Testing with about 10,000 rows of SEC filings data, the Voyage finance-2 model achieved 54 % overall accuracy, significantly outperforming OpenAI's text-embedding-3-small at 38.5 %.
The gap was most dramatic in direct financial queries, where Voyage reached 63.75 % accuracy compared to OpenAI's 40 %. Even with ambiguous financial questions, the specialized model maintained its edge at 62.5 % versus 48.75 %. This suggests that domain-specific training substantially improves the handling of financial terminology and concepts.
Cost and processing times showed interesting trade-offs. While Voyage took a few minutes to process our test data, OpenAI completed the task in under a minute at a cost below $0.01 for roughly 190,000 tokens. However, for applications heavily focused on financial data, the specialized model's 15.5 % overall accuracy advantage justifies the slight additional resource investment.
These results indicate that your choice between general and finance-specialized models should consider your needs: document volume, search patterns, accuracy requirements, and cost constraints. While specialized models require more resources, they could provide significant value through improved financial search accuracy and a better understanding of complex financial relationships.
After reviewing the results, consider these factors when choosing between general and finance-specialized embedding models:
Before choosing your model, consider these practical factors:
Choosing between general and finance-specialized embedding models significantly impacts your application's effectiveness and costs. We used pgai Vectorizer to evaluate both approaches and provided a framework for making this decision.
Ready to test these models on your financial data? Install pgai Vectorizer and try it yourself. Handle all your embedding operations in PostgreSQL—no specialized databases needed.