PostgreSQL Hybrid Search Using Pgvector and Cohere

Google was founded more than 25 years ago, and its inception has set high standards for search engines. In the early days, search engines were heavily dependent on keyword searching, but as time passed, the algorithms became more powerful to understand the context behind the words.

Nowadays, search engines use hybrid search (a combination of multiple search algorithms) to perform search queries. This article is a take on creating one by ourselves. Afterward, using the power of large language models or LLMs, we will leverage RAG (retrieval-augmented generation) to have an LLM answer questions using  our retrieved results.

In this article, we will combine keyword and semantic search to achieve better search results. Keyword search matches words directly or to synonyms, while semantic search seeks to match the meaning behind the words in the query. We will leverage Cohere and pgvector to implement hybrid search on PostgreSQL and see how they make our work easier. 

What Is a Hybrid Search Engine?

Hybrid search merges different search methods to enhance result quality. This article focuses on keyword and semantic search to create a hybrid search engine that combines the best of both worlds. Here are the benefits:

  • Precision: Keyword search enables exact matches to the query, leaving no room for ambiguity. 
  • Context: Semantic search allows algorithms to understand the intent of the query. If no keywords are matched, the semantic search will step in to analyze the context and meaning behind the query, ensuring that relevant results are still provided and covering any gaps in keyword-based matching.
  • Relevance: Both techniques complement each other and improve relevance for unseen queries.

Implementing Hybrid Search With Cohere and Pgvector 

This section will implement a hybrid search engine leveraging Cohere and pgvector on Timescale Cloud. Before that, we will build an overall understanding of the architecture and concepts required. Here’s the schematic diagram:

Hybrid search engine architecture diagram

The architecture comprises five steps: 

1. Documents: The process starts with a collection of documents. 

2. Embedding generation:

a. Semantic search using Cohere: Documents are processed using Cohere to create dense embeddings, which capture the semantic meaning of the text.

b. Keyword search using PostgreSQL: Documents are processed using PostgreSQL to create sparse embeddings focusing on the text's specific keywords.

3. Storage: Both dense and sparse embeddings are stored on Timescale’s PostgreSQL.

4. Retrieval and reranking: When a query is made, the system retrieves results from the Timescale database. The results are then reranked using Cohere to prioritize the most relevant documents.

5. Results generation: The reranked results are compiled into a final list, generating a ranking of the most relevant documents for the query.

The implementation detailed in the following sections has numerous applications. One such application, discussed later in the article, demonstrates the development of advanced RAG systems. 

Now that we have a cursory look at the process, we can implement our components. Let’s start with setup.

Note: You will find the code below in our Colab notebook.

Setup and Imports

In this section, we will install and import the necessary libraries for hybrid search and show you how to set up Cohere. It offers advanced LLMs and tailored RAG capabilities for enterprise use cases, effectively addressing real-world challenges. This article will leverage Cohere embedding models for semantic search. To get started, head to Cohere and sign up. After that, generate your API key, which you will need to access the embedding model, as seen below:

Cohere dashboard

You will need the following libraries for our hybrid search. 

!pip install -q datasets psycopg2 pgvector cohere
import cohere
import asyncio
import psycopg2
import itertools
import numpy as np

Let’s also create our Cohere client.

co = cohere.Client('Your API Key')

Introduction to Our Dataset

The CNN-DailyMail Dataset consists of over 300,000 unique English-language news articles. It supports both extractive and abstractive summarization. It was originally designed for machine reading, comprehension, and abstractive question answering. The data fields are as follows: 

Fields

Description

id

A string containing the hexadecimal SHA1 hash of the URL where the story was retrieved.

article

A string containing the body of the news article.

highlights

A string containing the article's highlights, written by the author.

We will utilize the article field for our hybrid search engine. It contains complete information about the incidents. Additionally, given that the training set includes over 280,000 articles, we will use a smaller subset of approximately 1,000 articles for this demonstration.

Let’s load the dataset from Huggingface🤗 using the datasets library and split it on the training set as shown below:

from datasets import load_dataset

dataset = load_dataset("cnn_dailymail", "3.0.0")

content = dataset["train"]

Now, we will use datasets functions like shuffle and select to randomize the thousand articles selected.

content = content.shuffle(seed=42).select(range(0,1000))

After subsetting the dataset, let’s connect to our database.

Connecting to PostgreSQL Using Timescale

A critical component for a hybrid search application is a vector database, which enables querying indexed documents to retrieve the most relevant ones for providing context to the search. In this tutorial, I’ll use Timescale Cloud, a cloud PostgreSQL platform tailored for AI applications, to host the PostgreSQL database that stores the sparse embeddings for our keyword search feature. 

To start, sign up, create a new database, and follow the provided instructions. For more information, refer to the Get started with Timescale guide

After signing up, connect to the Timescale database by providing the service URI, which can be found under the service section on the dashboard. The URI will look something like this: 

postgres://tsdbadmin:@.tsdb.cloud.timescale.com:/tsdb?sslmode=require

Configuration dashboard for connecting to the service

The password can be created in the  Project settings by clicking Create credentials

Project settings page for creating credentials

The following code checks if the CONNECTION is valid and runs a basic query to confirm database access. If it runs successfully, the database is accessible.

CONNECTION = "..."

conn = psycopg2.connect(CONNECTION)
cursor = conn.cursor()
# use the cursor to interact with your database
cursor.execute("SELECT 'hello world'")
print(cursor.fetchone())

Table Creation

After connecting to your PostgreSQL database on Timescale, it's time to create a table to store our documents and embeddings. In this section, we will create a table named documents, which holds the following columns: 

  • id: All relational databases always have a primary key to identify a row uniquely. id will serve this purpose. 
  • contents: The content of the articles will be stored as TEXT
  • embedding: The embeddings of the articles will be stored as VECTOR. As demonstrated in the code, the VECTOR size is 1024. This is because the Cohere model embed-english-v3.0 has an embedding dimension of 1024, as shown below:
Embedding model from Cohere dimension and performance table

We will install an extension on PostgreSQL called vector to use the operations on embeddings.

extension = """CREATE EXTENSION IF NOT EXISTS vector"""

cursor.execute(extension)
conn.commit()

The query is as follows for creating the table in the schema specified above:

document_table = """CREATE TABLE IF NOT EXISTS documents  (
    id BIGINT PRIMARY KEY GENERATED BY DEFAULT AS IDENTITY,
    contents TEXT,
    embedding VECTOR(1024)
)"""

cursor.execute(document_table) # Execute the query
conn.commit() #Commit the query

The current table is empty 😞. Let's populate it with some data.

Data Insertion

As the schema is defined, this section will populate the data. To fill up the embedding rows, we will generate embeddings using Cohere’s newest embedding model, embed-english-v3.0. With just a few lines of code, we will have our embeddings.

The embed API from Cohere generates embeddings. The arguments, texts, and modelare obvious here. The explanation of the other two is as follows:

  • The input_type parameter in v3 embeddings models allows users to specify the purpose of the embeddings, with "search_document" indicating texts intended for storage in a vector database.
  • The embeddings_types parameter in the endpoint offers compression options such as float, int8, unint8, binary, and ubinary.

Let’s insert the content and the embeddings into our table.

sql = 'INSERT INTO documents (contents, embedding) VALUES ' + ', '.join(['(%s, %s)' for _ in doc_embeddings.embeddings.float_])

params = list(itertools.chain(*zip(content["article"], doc_embeddings.embeddings.float_)))

cursor.execute(sql, params)
conn.commit()

Now, let’s write our functions for searching the articles. 

This section will discuss and implement the crux of our hybrid search engine, keyword, and semantic search functions.

Text search, or full-text searching, enables the identification of natural-language documents matching a query and potentially sorting them by relevance. Traditional database textual search operators like ~, ~*, LIKE and ILIKE lack crucial properties needed for modern information systems:

  • They lack linguistic support, making it difficult to handle derived words effectively, e.g., hang and hanging.
  • They don't rank search results, rendering them ineffective for large result sets.
  • They tend to be slow due to the absence of index support, which requires processing all documents for each search.

PostgreSQL introduced the tsvector data type, which stores preprocessed documents, while tsquery represents processed queries. Various functions and operators are available for these data types, the most significant being the match operator @@. Let’s see an example:

sql = """SELECT id, contents FROM documents, plainto_tsquery('english', %s) query WHERE to_tsvector('english', contents) @@ query ORDER BY ts_rank_cd(to_tsvector('english', contents), query) DESC LIMIT 5"""

This SQL query selects the id and contents columns from the documents table. It then generates a tsquery using the plainto_tsquery function with the language specified as English and a parameterized input. This tsquery is aliased as query.

The WHERE clause filters rows where the tsvector representation of the contents column matches the generated tsquery.

The results are ordered by their rank, calculated using the ts_rank_cd function. This function considers the frequency of the terms and the proximity of matching lexemes. The results are sorted in descending order by rank, and only the top five are returned, as specified by the LIMIT 5 clause.

def keyword_search(conn, query):

    with conn.cursor() as cur:
        cur.execute(sql, (query[0],))
        return cur.fetchall()

query = ["A rare meeting of U.N. Security Council heads of state"]
keyword_search(conn, query)

The above query will match the article containing the text. Now, over to semantic search.

The provided query will first be converted into an embedding using the Cohere model for semantic search. This embedding will then be compared to existing vectors in the vector database. The distances between the vectors are measured, with smaller distances indicating greater similarity in the meaning of the sentences or articles. There are many distances used in ​pgvector, as provided below:

Operators

Distance Function

<->

L2 Distance

<=>

Negative inner product

<#>

Cosine distance

<+>

L1 distance

First, let’s create the embeddings for our query as shown below:

query_embeddings = co.embed(texts=query,
                model=model,
                input_type="search_query", # This time, it is a search query
                embedding_types=['float'])

We will use the inner product ( ⇔ ) to calculate distances to determine the similarity. Here is the function:

def semantic_search(conn, query):

    query_embeddings = co.embed(texts=query,
                model=model,
                input_type="search_query",
                embedding_types=['float'])

    with conn.cursor() as cur:
        cur.execute('SELECT id, contents FROM documents ORDER BY embedding <=> %s::vector LIMIT 5', (query_embeddings.embeddings.float_[0],))
        return cur.fetchall()

query = ["News related to people who died due to Carbon Monoxide"]
print(semantic_search(conn, query))

We are done with the most crucial part of the implementation. In the next section, we will use rerank to ensure relevance.

Rerank

Reranking improves search relevance by reordering the result set from a retriever using a different model. It computes a relevance score between the query and each data object, sorting them from most to least relevant. This two-stage process ensures efficiency by retrieving relevant objects before reranking them. 

In the code below, we will retrieve the semantic and keyword search results and combine them. Then, given the query, we will use Cohere's rerank API to retrieve the most relevant documents. We’ll save reranking for later, as it is a computationally expensive algorithm.

def hybrid_search(conn, query):

  sem_search, key_search = semantic_search(conn, query), keyword_search(conn, query)

  combined_search = [text for (text,) in sem_search + key_search]


  rerank = co.rerank(model="rerank-english-v3.0", query=query[0],   documents=set(combined_search), top_n=2, return_documents=True)

  return rerank.results

Results

We have our function ready for hybrid search, and we will now format our results in a document to showcase their application in RAG.

query = ["News related to people who died due to Carbon Monoxide"]

rerank_results = hybrid_search(conn, query )

documents = [Document(page_content=item.document.text) for item in rerank_results]

Having a list of documents will help us inherit a BaseRetriever class in Langchain to make our RAG application. Let’s see that in action.

Application to RAG

In this section, we will use a hybrid search in our retrieval-augmented generation (RAG) application so it can chat on the news. This section is divided into three sub-sections: prompts, retriever, and chains. We will use Langchain for the RAG pipeline as it is the most popular choice among developers.

Prompts

Our first step in chatting with our document is to tell our LLM what it needs to do. LLMs are well-trained on a large corpus of data, so they can quickly learn if prompted well. 

Prompt engineering customizes chatbots from a general to a specialized level. LangChain's prompt templates simplify this process by integrating default messages, user input, chat history, and optional additional context. The provided prompt can be iterated for improvement.

instructions = """You are a friendly chatbot capable of answering questions related to CNN news."""

human = """
The context is provided as: {context}
New human question: {question}
"""
prompt = ChatPromptTemplate(
    messages=[
        SystemMessagePromptTemplate.from_template(instructions),
        HumanMessagePromptTemplate.from_template(human), #User query will go here
    ],
    input_variables=['question','context'], # context provided by retriever and question by the user
)

Hybrid search custom retriever

Retrievers are used to retrieve relevant documents given a query. As our hybrid and reranked function will manage that, we just need to wrap our documents in a retriever class to create our RAG pipeline.

class CustomRetriever(BaseRetriever):
    documents: List[Document]
    """List of documents to retrieve from."""
    k: int
    """Number of top results to return"""

    def _get_relevant_documents(
        self, query: str, *, run_manager: CallbackManagerForRetrieverRun
    ) -> List[Document]:
        """Sync implementations for retriever."""
        matching_documents = []
        for document in self.documents:
            if len(matching_documents) > self.k:
                return matching_documents

            if query.lower() in document.page_content.lower():
                matching_documents.append(document)
        return matching_documents

Chain

Chains in LangChain combine everything, including our LLM, prompts, and retriever. In this article, our choice for LLM is Cohere’s chat API. Let’s see that in action.

llm = ChatCohere(cohere_api_key = 'Your API key')
def advanced_RAG(query):

  rerank_results = hybrid_search(conn,query)

  documents = [Document(page_content=item.document.text) for item in rerank_results]

  retriever = CustomRetriever(documents=documents, k=len(documents))

  qa_chain = RetrievalQA.from_chain_type(
    llm, retriever= retriever, chain_type_kwargs={"prompt": prompt}
)
  return qa_chain({"query": query[0] })

In the function above, we used hybrid search and rerank functions to retrieve our results. These results are converted to documents and then given to our custom retriever. Later, the chain combines all components and is given a query. Let’s test it out.

advanced_RAG(["Is there any news related to people who died due to Carbon Monoxide"])

Here’s the response:

>>> 'Yes, there have been several recent news stories related to people who died due to carbon monoxide poisoning. Here are a few examples- In December 2022, a family of four, including two young children, was found dead in their home in Pennsylvania. … … … \n\nThese tragic incidents serve as a reminder of the importance of carbon monoxide detectors and regular maintenance of fuel-burning appliances to prevent future deaths from this colorless, odorless, and deadly gas.

This is true. As we can see in our dataset, there has been a family who died due to carbon monoxide (CO).

Next Steps

In this article, we discussed the nuts and bolts of hybrid search and used it in our RAG application. By combining keyword and semantic search, the hybrid search system ensures precise, contextually relevant results. Here are the benefits of the tech stack we used:

  • Cohere provides advanced embeddings for semantic search and reranking, enhancing result relevance.
  • PostgreSQL and pgvector on Timescale Cloud offer efficient storage and retrieval of dense and sparse embeddings. 

Together, these technologies create a robust search engine capable of handling complex queries and improving search quality. With Timescale Cloud, developers can access pgvector, pgvectorscale, and pgai—extensions that turn PostgreSQL into an easy-to-use and high-performance vector database, plus a fully managed cloud database experience. Build your AI application with Timescale Cloud today.

Learn more

If you want to keep learning about embeddings and how to build LLM applications, check out these resources: