Category: All posts
Nov 22, 2024
Posted by
Haziqa Sajid
Google was founded more than 25 years ago and has set high standards for search engines. In the early days, search engines were heavily dependent on keyword searching, but as time passed, the algorithms became more powerful to understand the context behind the words.
Nowadays, search engines use hybrid search (a combination of multiple search algorithms) to perform search queries. This article is a take on creating one by ourselves. Afterward, using the power of large language models or LLMs, we'll leverage RAG (retrieval-augmented generation) to have an LLM answer questions using our retrieved results.
In this article, we'll combine keyword and semantic search to achieve better search results. Keyword search matches words directly or to synonyms, while semantic search seeks to match the meaning behind the words in the query. We'll leverage Cohere and pgvector to implement hybrid search on PostgreSQL and see how they make our work easier.
Hybrid search merges different search methods to enhance result quality. This article focuses on keyword and semantic search to create a hybrid search engine that combines the best of both worlds. Here are the benefits:
Precision | Context | Relevance | |
Keyword | Matches words exactly | Ignores context | Results contain exact phrase |
Semantic | Matches context and meaning | Favors context over keyword | Results contain similar meanings |
Hybrid | Ranks word matches by context relevance | Incorporates context as a factor | Results contain exact phrases and relevant meanings |
This section will implement a hybrid search engine leveraging Cohere and pgvector on Timescale Cloud. Before that, let's build an overall understanding of the architecture and concepts required. Here’s the schematic diagram:
The architecture comprises five steps:
1. Documents: The process starts with a collection of documents.
2. Embedding generation:
a. Semantic search using Cohere: Documents are processed using Cohere to create dense embeddings, which capture the semantic meaning of the text.
b. Keyword search using PostgreSQL: Documents are processed using PostgreSQL to create sparse embeddings focusing on the text's specific keywords.
3. Storage: Both dense and sparse embeddings are stored on Timescale’s PostgreSQL.
4. Retrieval and reranking: The system retrieves results from the Timescale database in response to a query. The results are then reranked using Cohere to prioritize the most relevant documents.
5. Results generation: The reranked results are compiled into a final list, generating a ranking of the most relevant documents for the query.
The implementation detailed in the following sections has numerous applications. One such application, discussed later in the article, demonstrates the development of advanced RAG systems.
Now that we have a cursory look at the process, we can implement our components. Let’s start with setup.
Note: You will find the code below in our Colab notebook.
In this section, we'll install and import the necessary libraries for hybrid search and show you how to set up Cohere. It offers advanced LLMs and tailored RAG capabilities for enterprise use cases.
In our example, we'll leverage Cohere embedding models for semantic search. To get started, head to Cohere and sign up. After that, generate your API key, which you will need to access the embedding model, as seen below:
You will need the following libraries for our hybrid search.
!pip install -q datasets psycopg2 pgvector cohere
import cohere
import asyncio
import psycopg2
import itertools
import numpy as np
Let’s also create our Cohere client.
co = cohere.Client('Your API Key')
The CNN-DailyMail Dataset consists of over 300,000 unique English-language news articles. It supports both extractive and abstractive summarization. It was originally designed for machine reading, comprehension, and abstractive question answering. The data fields are as follows:
Fields | Description |
id | A string containing the hexadecimal SHA1 hash of the URL where the story was retrieved. |
article | A string containing the body of the news article. |
highlights | A string containing the article's highlights, written by the author. |
We'll utilize the article field for our hybrid search engine. Given that the training set includes over 280,000 articles, we will use a smaller subset of approximately 1,000 articles for this demonstration.
Let’s load the dataset from Huggingface🤗
using the datasets
library and split it on the training set as shown below:
from datasets import load_dataset
dataset = load_dataset("cnn_dailymail", "3.0.0")
content = dataset["train"]
Now, we'll use datasets
functions like shuffle
and select
to randomize the thousand selected articles.
content = content.shuffle(seed=42).select(range(0,1000))
After subsetting the dataset, let’s connect to our database.
A critical component for a hybrid search application is a vector database, which enables querying indexed documents to retrieve the most relevant ones for providing context to the search. In this tutorial, I’ll use Timescale Cloud, a cloud PostgreSQL platform tailored for AI applications, to host the PostgreSQL database that stores the sparse embeddings for our keyword search feature.
To start, sign up, create a new database, and follow the provided instructions. For more information, refer to this guide.
After signing up, connect to the Timescale database by providing the service URI, which can be found under the service section on the dashboard. The URI will look something like this:
postgres://tsdbadmin:@.tsdb.cloud.timescale.com:/tsdb?sslmode=require
You can create the password in Project settings by clicking Create credentials.
The following code checks if the CONNECTION
is valid and runs a basic query to confirm database access. If it runs successfully, the database is accessible.
CONNECTION = "..."
conn = psycopg2.connect(CONNECTION)
cursor = conn.cursor()
# use the cursor to interact with your database
cursor.execute("SELECT 'hello world'")
print(cursor.fetchone())
After connecting to your PostgreSQL database on Timescale, it's time to create a table to store our documents and embeddings. In this section, we'll create a table named documents
, which holds the following columns:
id
: All relational databases always have a primary key to identify a row uniquely. id
will serve this purpose. contents
: The content of the articles will be stored as TEXT
. embedding
: The embeddings of the articles will be stored as VECTOR
. As demonstrated in the code, the VECTOR
size is 1024. This is because the Cohere model embed-english-v3.0
has an embedding dimension of 1024, as shown below:We'll install an extension on PostgreSQL called vector to use the operations on embeddings.
extension = """CREATE EXTENSION IF NOT EXISTS vector"""
cursor.execute(extension)
conn.commit()
The query is as follows for creating the table in the schema specified above:
document_table = """CREATE TABLE IF NOT EXISTS documents (
id BIGINT PRIMARY KEY GENERATED BY DEFAULT AS IDENTITY,
contents TEXT,
embedding VECTOR(1024)
)"""
cursor.execute(document_table) # Execute the query
conn.commit() #Commit the query
The current table is empty 😞. Let's populate it with some data.
To fill up the embedding
rows, we'll generate embeddings using Cohere’s newest embedding model, embed-english-v3.0
. With just a few lines of code, we will have our embeddings.
The embed
API from Cohere generates embeddings. The arguments texts
and model
are obvious here. For the other two arguments:
input_type
parameter in v3 embeddings models allows users to specify the purpose of the embeddings, with "search_document" indicating texts intended for storage in a vector database.embeddings_types
parameter in the endpoint offers compression options such as float
, int8
, unint8
, binary
, and ubinary
.Let’s insert the content and the embeddings into our table.
sql = 'INSERT INTO documents (contents, embedding) VALUES ' + ', '.join(['(%s, %s)' for _ in doc_embeddings.embeddings.float_])
params = list(itertools.chain(*zip(content["article"], doc_embeddings.embeddings.float_)))
cursor.execute(sql, params)
conn.commit()
Now, let’s write our functions for searching the articles.
This section will discuss and implement the core functions of our hybrid search engine, keyword, and semantic search functions.
Text search, or full-text searching, enables the identification of natural-language documents matching a query and potentially sorting them by relevance. Traditional database textual search operators like ~
, ~*
, LIKE
and ILIKE
lack crucial properties needed for modern information systems:
PostgreSQL introduced the tsvector
data type, which stores preprocessed documents, while tsquery
represents processed queries. Various functions and operators are available for these data types, the most significant being the match operator @@
. Let’s see an example:
sql = """SELECT id, contents FROM documents, plainto_tsquery('english', %s) query WHERE to_tsvector('english', contents) @@ query ORDER BY ts_rank_cd(to_tsvector('english', contents), query) DESC LIMIT 5"""
This SQL query selects the id
and contents
columns from the documents
table. It then generates a tsquery
using the plainto_tsquery
function with the language specified as English and a parameterized input. This tsquery
is aliased as query
.
The WHERE
clause filters rows where the tsvector
representation of the contents
column matches the generated tsquery
.
The results are ordered by their rank, calculated using the ts_rank_cd
function. This function considers the frequency of the terms and the proximity of matching lexemes. The results are sorted in descending order by rank, and only the top five are returned, as specified by the LIMIT 5
clause.
def keyword_search(conn, query):
with conn.cursor() as cur:
cur.execute(sql, (query[0],))
return cur.fetchall()
query = ["A rare meeting of U.N. Security Council heads of state"]
keyword_search(conn, query)
The above query will match the article containing the text. Now, over to semantic search.
First, we'll convert the provided query into an embedding using the Cohere model for semantic search. We'll then compare this embedding to existing vectors in the vector database. We compare the embedding with a distance metric in which a smaller distance implies a more similar meaning. There are many such metrics available in pgvector
, such as:
Operators | Distance Function |
<-> | L2 Distance |
<=> | Negative inner product |
<#> | Cosine distance |
<+> | L1 distance |
First, let’s create the embeddings for our query as shown below:
query_embeddings = co.embed(texts=query,
model=model,
input_type="search_query", # This time, it is a search query
embedding_types=['float'])
We'll use the inner product ( ⇔ ) to calculate distances to determine the similarity. Here is the function:
def semantic_search(conn, query):
query_embeddings = co.embed(texts=query,
model=model,
input_type="search_query",
embedding_types=['float'])
with conn.cursor() as cur:
cur.execute('SELECT id, contents FROM documents ORDER BY embedding <=> %s::vector LIMIT 5', (query_embeddings.embeddings.float_[0],))
return cur.fetchall()
query = ["News related to people who died due to Carbon Monoxide"]
print(semantic_search(conn, query))
That finishes the most crucial part of the implementation. In the next section, we'll use rerank to ensure relevance.
Reranking improves search relevance by reordering a retriever's result using a different model. It computes a relevance score between the query and each data object, sorting them from most to least relevant. This two-stage process ensures efficiency by retrieving relevant objects before reranking them.
In the code below, we'll retrieve the semantic and keyword search results and combine them. Then, given the query, we'll use Cohere's rerank API to retrieve the most relevant documents. We’ll save reranking for later, as it is a computationally expensive algorithm.
def hybrid_search(conn, query):
sem_search, key_search = semantic_search(conn, query), keyword_search(conn, query)
combined_search = [text for (text,) in sem_search + key_search]
rerank = co.rerank(model="rerank-english-v3.0", query=query[0], documents=set(combined_search), top_n=2, return_documents=True)
return rerank.results
We have our function ready for hybrid search. Now, let's format our results in a document to showcase their application in RAG.
query = ["News related to people who died due to Carbon Monoxide"]
rerank_results = hybrid_search(conn, query )
documents = [Document(page_content=item.document.text) for item in rerank_results]
Having a list of documents will help us inherit a BaseRetriever
class in Langchain to make our RAG application. Let’s see that in action.
In this section, we'll use a hybrid search in our retrieval-augmented generation (RAG) application so it can chat on the news. This section is divided into three sub-sections: prompts
, retriever
, and chains
. We'll use Langchain for the RAG pipeline as it is the most popular choice among developers.
Our first step in chatting with our document is to tell our LLM what it needs to do. LLMs are well-trained on a large corpus of data, so they can quickly learn if prompted well.
Prompt engineering customizes chatbots from a general to a specialized level. LangChain's prompt templates simplify this process by integrating default messages, user input, chat history, and optional additional context. The provided prompt can be iterated for improvement.
instructions = """You are a friendly chatbot capable of answering questions related to CNN news."""
human = """
The context is provided as: {context}
New human question: {question}
"""
prompt = ChatPromptTemplate(
messages=[
SystemMessagePromptTemplate.from_template(instructions),
HumanMessagePromptTemplate.from_template(human), #User query will go here
],
input_variables=['question','context'], # context provided by retriever and question by the user
)
Retrievers are used to retrieve relevant documents given a query. As our hybrid and reranked function will manage that, we just need to wrap our documents in a retriever class to create our RAG pipeline.
class CustomRetriever(BaseRetriever):
documents: List[Document]
"""List of documents to retrieve from."""
k: int
"""Number of top results to return"""
def _get_relevant_documents(
self, query: str, *, run_manager: CallbackManagerForRetrieverRun
) -> List[Document]:
"""Sync implementations for retriever."""
matching_documents = []
for document in self.documents:
if len(matching_documents) > self.k:
return matching_documents
if query.lower() in document.page_content.lower():
matching_documents.append(document)
return matching_documents
Chains in LangChain combine everything, including our LLM, prompts, and retriever. In this article, our choice for LLM is Cohere’s chat API. Let’s see that in action.
llm = ChatCohere(cohere_api_key = 'Your API key')
def advanced_RAG(query):
rerank_results = hybrid_search(conn,query)
documents = [Document(page_content=item.document.text) for item in rerank_results]
retriever = CustomRetriever(documents=documents, k=len(documents))
qa_chain = RetrievalQA.from_chain_type(
llm, retriever= retriever, chain_type_kwargs={"prompt": prompt}
)
return qa_chain({"query": query[0] })
In the function above, we used hybrid search and rerank functions to retrieve our results. These results are converted to documents and then given to our custom retriever. Later, the chain combines all components and is given a query. Let’s test it out.
advanced_RAG(["Is there any news related to people who died due to Carbon Monoxide"])
Here’s the response:
>>> 'Yes, there have been several recent news stories related to people who died due to carbon monoxide poisoning. Here are a few examples- In December 2022, a family of four, including two young children, was found dead in their home in Pennsylvania. … … … \n\nThese tragic incidents serve as a reminder of the importance of carbon monoxide detectors and regular maintenance of fuel-burning appliances to prevent future deaths from this colorless, odorless, and deadly gas.
This is true. As we can see in our dataset, there has been a family who died due to carbon monoxide (CO).
In this article, we discussed the nuts and bolts of hybrid search and used it in our RAG application. By combining keyword and semantic search, the hybrid search system ensures precise, contextually relevant results. Here are the benefits of the tech stack we used:
Together, these technologies create a robust search engine capable of handling complex queries and improving search quality. With Timescale Cloud, developers can access pgvector, pgvectorscale, and pgai—extensions that turn PostgreSQL into an easy-to-use and high-performance vector database, plus a fully managed cloud database experience. Build your AI application with Timescale Cloud today.
If you want to keep learning about embeddings and how to build LLM applications, check out these resources: