Nov 15, 2024
Search technologies are the backbone of modern data retrieval systems. Over the years, they have evolved from simple keyword matching to semantic search algorithms that capture context and user intent.
However, you often encounter situations where a single search method is insufficient. Imagine the following search query: "Recent studies on the impact of climate change on Arctic wildlife." The keyword match method would retrieve documents with specific words like "climate change" and "Arctic wildlife," whereas semantic search can help identify relevant studies even without exact keywords, such as articles on "climate change effects on polar bears."
Therefore, the new approach is to combine full-text search for precise retrieval with semantic search for higher relevance. This paradigm is known as hybrid search, and it offers the best of both worlds. In this article, we'll explain how to implement hybrid search in PostgreSQL using Cohere, pgvector, and pgai (and a few more Timescale innovations). This will allow you to build a fast, relevant, and accurate search system using your existing PostgreSQL database.
Before diving into the implementation, let's take a look at how full-text search differs from semantic search.
Full-text search, or keyword search, finds precise matches for keywords in a query. Full-text search algorithms index text documents by breaking them into terms and creating an inverted index that maps terms to their locations in the documents. During the search, the algorithm retrieves relevant documents by matching query terms against the index and ranks the results using algorithms like TF-IDF or BM25.
Which industries use full-text search? E-commerce sites use it to help customers find products; stock market applications use it to help users find ticker symbols; news websites use it to surface relevant articles; healthcare systems use it to find patient records.
Despite its utility, full-text search has certain limitations. Since it does not capture user intent, it often leads to irrelevant results. For example, a full-text search with the query “apple phone” might retrieve results about the fruit apple and even Android phones.
Semantic search goes beyond just matching keywords; it understands the meaning of words and how they relate to each other through vectors. Vectors are numerical representations generated by AI models that capture the contextual meaning and relationships between words. This allows semantic search to find relevant results even when the exact keywords aren't present.
Semantic search has applications in a wide range of domains. You can use it to build product search and recommendation systems by finding relevant products based on user queries. You can leverage it for customer support by using it to accurately interpret user queries. In healthcare, you can use it to match patient queries with relevant medical information.
In many scenarios, semantic search is a better choice than keyword search. Since it can capture the meaning behind queries, it can understand related terms or even related concepts across languages. This often leads to more relevant search results.
If semantic search performs better than full-text search in understanding user queries, why would you need the latter? Most domains have specific vocabulary, abbreviations, annotations, symbols, or names that semantic models may not capture. A keyword search ensures that these specific terms aren’t missed.
So, whenever you need to combine exact matches with relevant results, using hybrid search will yield better outcomes.
Let’s take an example from finance. Using hybrid search, a query like “AAPL news and predictions” will retrieve documents containing the term “AAPL,” along with other documents related to Apple stock market news.
Hybrid search enables you to combine the precision of keyword search with the contextual understanding of vector search. This ensures results are both precise and contextually relevant. It also adapts to various query types, from specific technical terms to broader searches.
Due to its flexibility, hybrid search can be used in a range of industries:
Let’s take a look at the steps to implement hybrid search. We will use the Cohere embedding model embed-english-v3.0
and leverage the pgvector and pgai extensions to bring AI capabilities to the PostgreSQL database.
embed-english-v3.0
: A text embedding model from Cohere designed to generate high-quality vector embeddings from text.
pgai
: Enables integration with AI models and allows you to perform AI-powered queries within the PostgreSQL database. While pgai enables you to perform complex AI operations (including text embedding), with pgai Vectorizer, you can also automate the process of creating and managing the embeddings without leaving your PostgreSQL database environment.
pgvector
: Allows you to store and query vector embeddings, enabling efficient semantic searches on your data. In this tutorial, we use pgvectorscale which adds query performance and scalability to pgvector’s functionality, supercharging your PostgreSQL database to handle large datasets with ease.
Before proceeding further, you will need an API key from Cohere. Save the API key as an environment variable.
You will also need a working setup of PostgreSQL along with the extensions pgai and pgvector. You can install these manually, use a pre-built Docker container, or simply use Timescale Cloud, which comes preinstalled with pgai and pgvector.
Install the required libraries…
!pip install pandas pyarrow pgvector cohere psycopg2-binary
…and then import them.
import pandas as pd
import cohere
import psycopg2
import os
from ast import literal_eval
Next, initialize and set up the PostgreSQL database connection.
conn = psycopg2.connect("postgres://<username>:<password>@<host>:<port>/<database>")
We also need to enable the extensions.
with conn:
with conn.cursor() as cursor:
cursor.execute("CREATE EXTENSION IF NOT EXISTS vectorscale CASCADE;")
cursor.execute("CREATE EXTENSION IF NOT EXISTS ai CASCADE;")
To use Cohere within psql
, you should do the following:
export COHERE_API_KEY=YOUR_API_KEY
PGOPTIONS="-c ai.cohere_api_key=$COHERE_API_KEY"
psql -d "postgres://postgres:password@localhost/postgres"
We are using the Cohere/Movies dataset from Hugging Face. Download the dataset and create a CSV file for convenience, using the first one hundred rows from the dataset.
# Load the dataset
df = pd.read_parquet("hf://datasets/Cohere/movies/movies.parquet")
df = df.head(100)
df.to_csv('movies.csv', index=False)
The dataset has the following fields for each ‘movie’:
Field | Description |
title | A string containing the title of the movie |
overview | A string containing a brief description of the movie |
genres | A string containing a comma-separated list of genres the movie belongs to |
producer | A string containing a comma-separated list of producers |
cast | A string containing a comma-separated list of cast |
For our hybrid search example, we will use a combination of the title and overview fields.
We will now create the table, populate it with data, and then generate embeddings using the cohere_embed
function and embed-english-v3.0
embedding model.
The first two steps are straightforward: we will use SQL queries to create the “movies” table with an embedding column to store the vector embeddings. We then populate the table from our CSV file.
Finally, we generate the embeddings
. To do this, we will use the pgai function cohere_embed
. With just one SQL query, pgai allows us to generate embeddings and store them in the embedding column.
COHERE_API_KEY = os.getenv("COHERE_API_KEY")
def ingest_data(conn):
with conn:
with conn.cursor() as cur:
# Create the table
cur.execute("""
CREATE TABLE IF NOT EXISTS movies (
title TEXT NOT NULL,
overview TEXT,
genres TEXT,
producer TEXT,
"cast" TEXT,
embedding VECTOR(1024)
)
""")
# Copy data from CSV file
with open('movies.csv', 'r') as f:
cur.copy_expert('COPY movies(title, overview, genres, producer, "cast") FROM STDIN WITH CSV HEADER', f)
with conn.cursor() as cursor:
# Use cohere_embed to create embeddings
cursor.execute("""
UPDATE movies
SET embedding = ai.cohere_embed(
'embed-english-v3.0'
, CONCAT_WS('. ',
title,
COALESCE(overview, '')
)
, input_type=>'search_document'
, api_key=>%s
) where embedding is null;
""", (COHERE_API_KEY, ))
ingest_data(conn)
At this stage, you should have a table of data with vector embeddings generated from the title and overview columns.
Next, we will implement keyword search.
Traditional search operators like ~
, ~*
, LIKE
, and ILIKE
, were simplistic: they couldn’t handle derivatives of words, didn’t rank results, and lacked indexing support. To overcome these limitations, PostgreSQL introduced the datatypes tsquery
and tsvector
for more advanced text search capabilities. We will use these to perform our keyword search:
def keyword_search(conn, query):
sql = """
SELECT title, overview
FROM movies, plainto_tsquery('english', %s) query
WHERE to_tsvector('english', title || ' ' || COALESCE(overview, '')) @@ query
ORDER BY ts_rank_cd(to_tsvector('english', title || ' ' || COALESCE(overview, '')), query) DESC
LIMIT 5
"""
with conn:
with conn.cursor() as cur:
cur.execute(sql, (query,))
return cur.fetchall()
Here’s a breakdown of the SQL statement:
plainto_tsquery
converts plain text into a tsquery (text search query).to_tsvector('english', ...)
converts the concatenated title and overview into a tsvector (text search vector).@@
is the text search match operator.ts_rank_cd()
is a function that ranks the matches and arranges the results in descending order.You can now test the function by passing a query.
query = "The Avengers"
results = keyword_search(conn, query)
The results will contain exact keyword matches from the “movies” table.
[('Avengers: Age of Ultron',
'When Tony Stark tries to jumpstart a dormant peacekeeping program, things go awry and Earth's Mightiest Heroes are put to the ultimate test as the fate of the planet hangs in the balance. As the villainous Ultron emerges, it is up to The Avengers to stop him from enacting his terrible plans, and soon uneasy alliances and unexpected action pave the way for an epic and unique global adventure.'),
('The Avengers',
'When an unexpected enemy emerges and threatens global safety and security, Nick Fury, director of the international peacekeeping agency known as S.H.I.E.L.D., finds himself in need of a team to pull the world back from the brink of disaster. Spanning the globe, a daring recruitment effort begins!'),
('Captain America: Civil War',
'Following the events of Age of Ultron, the collective governments of the world pass an act designed to regulate all superhuman activity. This polarizes opinion amongst the Avengers, causing two factions to side with Iron Man or Captain America, which causes an epic battle between former allies.'),
('Captain America: The Winter Soldier',
'After the cataclysmic events in New York with The Avengers, Steve Rogers, aka Captain America is living quietly in Washington, D.C. and trying to adjust to the modern world. But when a S.H.I.E.L.D. colleague comes under attack, Steve becomes embroiled in a web of intrigue that threatens to put the world at risk. Joining forces with the Black Widow, Captain America struggles to expose the ever-widening conspiracy while fighting off professional assassins sent to silence him at every turn. When the full scope of the villainous plot is revealed, Captain America and the Black Widow enlist the help of a new ally, the Falcon. However, they soon find themselves up against an unexpected and formidable enemy--the Winter Soldier.')]
Next, let’s build the semantic search capability. In semantic search, we first convert the query into an embedding using the cohere_embed function and the same embedding model we used before. Then, we use a distance metric (or distance function) to find embeddings from our “movies” table that are closest to our query embedding. The smaller the distance between a vector and our query vector, the higher the similarity.
pgvector supports several distance metrics.
Operator | Distance Metric |
<-> | L2 Distance |
<=> | Cosine Similarity |
<#> | Negative Inner Product |
<+> | L1 Distance |
We will use the negative inner product in our SQL query, represented by the operator <=>
.
# Semantic search function
def semantic_search(conn, query):
sql = """
WITH query_embedding AS (
SELECT cohere_embed(
'embed-english-v3.0'
, %s
, _input_type=>'search_query'
, _api_key=>%s
) AS embedding
)
SELECT title, overview
FROM movies, query_embedding
ORDER BY movies.embedding <=> query_embedding.embedding
LIMIT 5
"""
with conn:
with conn.cursor() as cur:
cur.execute(sql, (query, COHERE_API_KEY,))
return cur.fetchall()
You can test the semantic search function by passing a query.
query = "Exciting space adventure movies"
results = semantic_search(conn, query)
Next, let’s combine keyword search and semantic search to create our hybrid search function.
[('Interstellar',
'Interstellar chronicles the adventures of a group of explorers who make use of a newly discovered wormhole to surpass the limitations on human space travel and conquer the vast distances involved in an interstellar voyage.'),
('Transformers: Dark of the Moon',
"Sam Witwicky takes his first tenuous steps into adulthood while remaining a reluctant human ally of Autobot-leader Optimus Prime. The film centers around the space race between the USSR and the USA, suggesting there was a hidden Transformers role in it all that remains one of the planet's most dangerous secrets."),
('Star Wars',
'Princess Leia is captured and held hostage by the evil Imperial forces in their effort to take over the galactic Empire. Venturesome Luke Skywalker and dashing captain Han Solo team together with the loveable robot duo R2-D2 and C-3PO to rescue the beautiful princess and restore peace and justice in the Empire.'),
('Avatar',
'In the 22nd century, a paraplegic Marine is dispatched to the moon Pandora on a unique mission, but becomes torn between following orders and protecting an alien civilization.'),
('The Martian',
'During a manned mission to Mars, Astronaut Mark Watney is presumed dead after a fierce storm and left behind by his crew. But Watney has survived and finds himself stranded and alone on the hostile planet. With only meager supplies, he must draw upon his ingenuity, wit and spirit to subsist and find a way to signal to Earth that he is alive.')]
We’ll combine the results of both search functions to implement hybrid search. To sort the results by relevance, we’ll use a Cohere reranker model, rerank-english-v3.0.
Reranker models reorder the results by calculating a relevance score between the query and the data objects, then sorting them from most to least relevant. Since reranking is computationally expensive, we typically use it at the final stage after hybrid search has retrieved an initial set of results.
def hybrid_search(conn, query):
sem_results = semantic_search(conn, query)
key_results = keyword_search(conn, query)
# Combine results and remove duplicates
combined_results = list({(title, overview) for title, overview in sem_results + key_results})
# Prepare documents for reranking
documents = [f"{title}. {overview}" for title, overview in combined_results]
# Prepare the JSON array of documents
documents_json = json.dumps([{"text": doc} for doc in documents])
# SQL query for reranking using pgai
sql = """
SELECT
x."index",
x.document->>'text' as "text",
x.relevance_score
FROM jsonb_to_recordset(
ai.cohere_rerank(
'rerank-english-v3.0',
%s,
%s::jsonb,
return_documents => true,
api_key=>%s
)->'results'
) AS x("index" int, "document" jsonb, relevance_score float8)
ORDER BY relevance_score DESC
LIMIT 5;
"""
reranked_results = []
with conn.cursor() as cur:
cur.execute(sql, (query, documents_json, COHERE_API_KEY,))
reranked_results = cur.fetchall()
# Map reranked results back to original data
final_results = []
for result in reranked_results:
index = documents.index(result[1])
final_results.append(combined_results[index])
return final_results
The function cohere_rerank
uses a language model to assess the relevance of each document to the query and then reorders the documents based on their relevance.
We can now use our hybrid search function to retrieve documents that leverage both keyword matching and semantic search.
query = "Space adventure movies with advanced alien technology"
results = hybrid_search(conn, query)
[('Interstellar',
'Interstellar chronicles the adventures of a group of explorers who make use of a newly discovered wormhole to surpass the limitations on human space travel and conquer the vast distances involved in an interstellar voyage.'),
('Avatar',
'In the 22nd century, a paraplegic Marine is dispatched to the moon Pandora on a unique mission, but becomes torn between following orders and protecting an alien civilization.'),
('Transformers: Dark of the Moon',
"Sam Witwicky takes his first tenuous steps into adulthood while remaining a reluctant human ally of Autobot-leader Optimus Prime. The film centers around the space race between the USSR and the USA, suggesting there was a hidden Transformers role in it all that remains one of the planet's most dangerous secrets."),
('Indiana Jones and the Kingdom of the Crystal Skull',
"Set during the Cold War, the Soviets - led by sword-wielding Irina Spalko - are in search of a crystal skull which has supernatural powers related to a mystical Lost City of Gold. After being captured and then escaping from them, Indy is coerced to head to Peru at the behest of a young man whose friend - and Indy's colleague - Professor Oxley has been captured for his knowledge of the skull's whereabouts."),
('Mission: Impossible - Ghost Protocol',
'In the 4th installment of the Mission Impossible series, Ethan Hunt (Cruise) and his team are racing against time to track down a dangerous terrorist named Hendricks (Nyqvist), who has gained access to Russian nuclear launch codes and is planning a strike on the United States. An attempt to stop him ends in an explosion causing severe destruction to the Kremlin and the IMF to be implicated in the bombing, forcing the President to disavow them. No longer being aided by the government, Ethan and his team chase Hendricks around the globe, although they might still be too late to stop a disaster.')]
This completes our implementation of hybrid search in PostgreSQL. You can use this methodology to build search systems, Retrieval-Augmented Generation (RAG), or other applications that need to use precise retrieval.
In the demo above, we implemented hybrid search in PostgreSQL. For your specific use case, you might want to optimize the approach further using one of the following methods:
In this article, we discussed full-text search and semantic search and explained how they differ. We demonstrated how to combine them to implement a hybrid search that leverages the benefits of both approaches to deliver more precise and contextually relevant results.
In our implementation, we used:
Together, these technologies helped us create a powerful search engine capable of handling complex queries and improving search quality. To get started with building your AI application today with PostgreSQL, pgai, and pgvector, create a free Timescale account.
Timescale’s PostgreSQL open-source stack, including pgai, pgai Vectorizer, and pgvectorscale, enables you to build AI applications while simplifying your workflows and supercharging your performance. You can access this simplified stack directly from Timescale’s PostgreSQL cloud platform or get installation instructions from each of the GitHub repositories (pgai, pgai Vectorizer, and pgvectorscale). And don’t forget to leave us a GitHub ⭐!
In case you missed the pgai Vectorizer launch, learn why vector databases are the wrong abstraction and how you can create AI embeddings right within your PostgreSQL database.