Retrieval-Augmented Generation With Claude Sonnet 3.5 and Pgvector

Retrieval-Augmented Generation With Claude Sonnet 3.5 and Pgvector

The AI chatbot race, initiated by OpenAI with the release of ChatGPT, is now seeing new competitors like Gemini, Cohere, and many others. OpenAI’s latest release of GPT-4o amazed users with its advanced visual processing, auditory recognition, and conversational abilities. While Sam Altman’s company, OpenAI, appears to lead the AI race, Anthropic provided tougher competition with its models. Anthropic, a prominent AI research company dedicated to developing safe and ethical AI systems, is making waves in the field with its Claude family of large language models (LLMs). 

The new Claude launch is a notable upgrade from its predecessor. Anthropic claims it can surpass OpenAI's GPT-4o model on key benchmarks like GPQA (Graduate-Level Google-Proof Q&A), multilingual math (MGSM), and more.

In this article, we’ll discuss Claude’s new top-tier model, its strengths and usefulness, and compare it to other models in the Claude family. We will also use Sonnet 3.5 and pgvector to build a retrieval-augmented generation (RAG) application.

What Is RAG?

RAG (retrieval-augmented generation) is a natural language processing (NLP) technique that combines generative large language models (LLMs) with traditional information retrieval systems (databases). RAG systems can process and consolidate knowledge to create context-aware answers, explanations, and instructions in human-like language.

Everything About Claude Sonnet 3.5

Claude Sonnet 3.5 is a new model from Anthropic, a U.S.-based company. On various evaluations, it outperforms competitor models and Claude 3 Opus, matching the speed and cost of Claude 3 Sonnet.

This new version is available for free on Claude.ai and the Claude iOS app, with Claude Pro and Team plan subscribers receiving higher rate limits. Additionally, it can be accessed through the Anthropic API, which is the focus of this article, as well as through Amazon Bedrock and Google Cloud’s Vertex AI. Another way to access it is by using its web application. If you want to build RAG applications using Amazon Bedrock, here's a beginner-friendly guide.) 

A graph illustrating the cost and intelligence of Claude 3.5 Sonnet
Fig 1: Cost and intelligence showcase of Claude 3.5 Sonnet

Metrics

Comments

Speed

Claude 3.5 Sonnet is twice as fast as its predecessor, Claude 3 Opus. Figure 1 shows its elevated speed.

Cost

The model is priced at 5x less than Opus, at $75 per million tokens. 



Performance

Claude 3.5 Sonnet excels in performance across multiple evaluations (for more, see Figure 2), including graduate-level reasoning (GPQA), undergraduate-level knowledge (MMLU), and coding proficiency (HumanEval). It surpasses competitors like OpenAI's GPT-4o and Google's Gemini 1.5 Pro

User Reviews

Many developers reported 3-4x productivity increases with Claude 3.5 Sonnet for coding tasks. Users praised its natural, pleasant interaction style, finding it more intuitive and less frustrating than other AI assistants. Early adopters described it as "absolutely phenomenal" and superior to GPT-4 for automation and troubleshooting.

Table: Claude 3.5 Sonnet measured on different evaluation datasets
Fig 2: Claude 3.5 Sonnet measured on different evaluation datasets

Claude Sonnet 3.5 vs. Anthropic family of AI models

The Anthropic family contains multiple models, including Claude Sonnet 3.5, and each model provides different types of performance and costs. Below is a side-by-side comparison of the Sonnet model and other members of the Anthropic family. 

If you’re unfamiliar with any of the terms in the table below, the following provides you with descriptions of each of the features:

  • Input context window: the number of tokens the input context window supports
  • Maximum output tokens: the number of tokens the model can generate in a single request
  • Input token pricing: the cost of input data provided to the model
  • Output token pricing: the cost of output tokens generated by the model
  • MMLU: evaluates LLM knowledge acquisition in zero-shot and few-shot settings
  • MMMU: a wide-ranging multi-discipline and multimodal benchmark

Features

Claude Sonnet 3.5

Sonnet

Opus

Haiku

Input Context Window

200K

200K

200K

200K

Maximum Output Tokens

4096

4096

4096

4096

Input Token Pricing

$3

per million tokens


$3

per million tokens

$15

per million tokens

$0.25

per million tokens

Output Token Pricing

$15

per million tokens

$15

per million tokens

$75

per million tokens

$1.25

per million tokens

MMLU Benchmark

90.4

81.5

88.2

76.7

MMMU Benchmark

68.3

53.1

59.4

50.2

RAG Implementation With Sonnet 3.5 and Pgvector

Now that you know more about Claude Sonnet 3.5, let’s use it along with pgvector to implement a retrieval-augmented generation engine. 

Before we do that, however, we’ll look at the architecture and necessary concepts, starting with the schematic diagram.

The schematic diagram for our RAG application
The schematic diagram for our RAG application

The diagram above represents a RAG pipeline with the following steps:

  • Documents: the process begins with collecting documents that need to be indexed. 
  • Data indexing: these documents are indexed and stored in a vector database, in this case PostgreSQL with pgvector. 
  • Query: a user query is input into the system, which retrieves relevant information from the vector database.
  • Vector database: this database stores the documents' embeddings using pgvector. When a query is made, it retrieves the top results based on relevance.
  • LLM (Claude Sonnet 3.5): the retrieved results and the original query are fed into a language model for further processing and understanding.
  • Results: the final output is generated, giving the user a response incorporating the retrieved data and the language model's processing. 

This pipeline combines document retrieval with language understanding to generate relevant responses.

Setup and Imports

To begin, we will first install and import the necessary libraries for RAG in this section.

%pip install psycopg2 pgvector anthropic sentence-transformers opendatasets pandas

We then need to import them into our environment.

import anthropic
import psycopg2
import itertools
import opendatasets as od 
from PIL import Image
import os
import random
import shutil
import base64
import httpx
import matplotlib.pyplot as plt

To set it up, each API call needs a valid API key. The SDKs can retrieve this key from the ANTHROPIC_API_KEY environment variable, or you can provide it when initializing the Anthropic client. On Windows, we can simply add an environment variable like this:

setx ANTHROPIC_API_KEY "your-api-key-here"

After this, we can set our client.

client = anthropic.Anthropic()

Let's ensure it’s working by checking the client's response. The code below uses an API client to query Anthropic and prints the response. Here’s the code:

message = client.messages.create(
    model="claude-3-5-sonnet-20240620",
    max_tokens=1000,
    temperature=0,
    system="Answer the questions provided",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": "What is Pgvector?"
                }
            ]
        }
    ]
)
print(message.content)

The code initializes a message request to an AI model (Claude-3-5-sonnet) with specific settings:

  • It asks the model to generate a response to the user query, "What is pgvector?".
  • The max_tokens=1000 parameter specifies the maximum length of the response in tokens (words).
  • temperature=0 ensures the response is deterministic rather than random.
  • The system parameter guides the model in answering the questions provided.
  • It uses a text query format to ask the model, "What is pgvector?". Note that the model supports multimodal inputs, meaning it could also process image queries alongside text.

Next, we will set up the embedding model.

Setting Up the Embedding Model

To convert the text chunks to embeddings, we will use Sentence Transformers (SBERT). SBERT is a Python module for text and image embeddings, ideal for applications like semantic search and similarity scoring. It offers over 5,000 pre-trained models on Hugging Face🤗 and supports easy training or fine-tuning for custom use cases. 

Let's load the model paraphrase-MiniLM-L6-v2 and generate embeddings on an example sentence. Here’s the code:

from sentence_transformers import SentenceTransformer
embedding_model = SentenceTransformer('paraphrase-MiniLM-L6-v2')

# Sentences we want to encode. Example:
sentence = ['This framework generates embeddings for each input sentence']

# Sentences are encoded by calling model.encode()
embedding = embedding_model.encode(sentence)

Connecting to PostgreSQL Using Timescale Cloud

A key component for a RAG application is a vector database, which is essential for querying indexed documents and retrieving relevant context for searches. In this tutorial, we'll use PostgreSQL and pgvector as our vector database. The pgvector extension turns PostgreSQL into a fully featured vector database, with support for similarity search, as well as sparse embeddings for keyword searches.

  1. To get started, sign up, create a new database, and follow the provided instructions. For detailed guidance, refer to the Timescale guide.
  2. After signing up, connect to the Timescale database using the service URI from the dashboard, which looks like this:

postgres://tsdbadmin:@.tsdb.cloud.timescale.com:/tsdb?sslmode=require

  1. Create a password in the project settings by clicking on Create credentials.
Project settings page to create credentials in the Timescale console
Project settings page to create credentials in the Timescale console

4.  This setup checks the connection validity and runs a basic query to confirm database access, ensuring it’s ready for use.

The connect to your service page in the Timescale Cloud console
Connecting to the Timescale service

5.  Now you can connect to the database, which can be done as shown below:

CONNECTION = "<Your Connection String>"

conn = psycopg2.connect(CONNECTION)
cursor = conn.cursor()
# use the cursor to interact with your database
cursor.execute("SELECT 'hello world'")
print(cursor.fetchone())

Basic RAG

In this implementation, we'll construct a RAG pipeline using relevant news articles to enhance the accuracy of the Claude model. This example serves as a foundational demonstration for our advanced RAG implementation.

Dataset overview

The CNN/DailyMail Dataset comprises over 300,000 English-language news articles from CNN and the Daily Mail. It supports both extractive and abstractive summarization and was originally designed for machine reading, comprehension, and abstractive question answering.

Data fields

  • id: SHA1: Hash of the URL where the story was retrieved.
  • article: Body of the news article.
  • Highlights: Highlights of the article, authored by the article's writer.

For our hybrid search engine, we will use the article field, which contains comprehensive information about the incidents. Given the dataset's size, we will work with a smaller subset of approximately 1,000 articles for this demonstration.

from datasets import load_dataset
dataset = load_dataset("cnn_dailymail", "3.0.0")
content = dataset["train"]
content = content.shuffle(seed=42).select(range(0,1000))
sample_dataset = content["highlights"][:10]

Let's generate embeddings for the sample_dataset. The size of the embedding array will be required later when creating the table.

embeddings = embedding_model.encode(sample_dataset)

Converting numpy.ndarray to Python list as it is accepted by pgvector.

embeddings = embeddings.tolist()

Table creation and ingesting documents

We'll create a documents table to store our documents and their embeddings. This table will have the following columns:

  • Id: Serves as the primary key to identify each row uniquely.
  • Contents: Stores the content of the articles as TEXT.
  • Embedding: Stores the embeddings of the articles as VECTOR. The VECTOR size is set to 384, matching the embedding dimension of the paraphrase-MiniLM-L6-v2 model.
  1. To enable operations on embeddings, we will install the vector extension in PostgreSQL.
extension = """CREATE EXTENSION IF NOT EXISTS vector"""

cursor.execute(extension)
conn.commit()

2.  Now, it’s time to create the table given the details above:

document_table = """CREATE TABLE IF NOT EXISTS documents  (
    id BIGINT PRIMARY KEY GENERATED BY DEFAULT AS IDENTITY,
    contents TEXT,
    embedding VECTOR(384)
)"""

cursor.execute(document_table)
conn.commit()

3.  Next, we need to insert the elements in the table. The code below inserts article contents and their corresponding embeddings into the documents table in PostgreSQL. It constructs an SQL INSERT statement, combining the first five articles with their embeddings from embeddings, and executes the insertion using the database cursor. Finally, it commits the transaction to save the changes.

sql = 'INSERT INTO documents (contents, embedding) VALUES ' + ', '.join(['(%s, %s)' for _ in embeddings])
params = list(itertools.chain(*zip(sample_dataset, embeddings)))

cursor.execute(sql, params)
conn.commit()

4.  Let's check the documents inserted into the table in the DB.

cursor.execute("SELECT * From documents")
cursor.fetchone()

Now we need to create a function to retrieve relevant documents based on vector similarity. This function will leverage the index to efficiently search for the K nearest vectors, significantly reducing computation time.

Indexing

Two indexing algorithms, IVFFlat and HNSW are used to ensure fast search performance.

  • IVFFlat: This algorithm divides vectors into clusters, creating lists for each centroid. Only a subset of lists (those whose centroids are closest to the search vector) are examined during a search. This reduces the number of distance calculations.
  • HNSW (hierarchical navigable small world): This algorithm builds a graph where nodes represent vectors and edges connect nearby vectors. It uses hierarchical layers to navigate the graph, efficiently finding the nearest neighbors. HNSW is known for its high recall and speed, making it suitable for large-scale searches.

We will create an index on the embedding column using IVFFlat, using the concepts explained to optimize search performance. Timescale has introduced another indexing technique via the PostgreSQL open-source extension pgvectorscale (GitHub stars welcome!), readily available on Timescale Cloud. This new indexing technique, StreamingDiskANN, along with Statistical Binary Quantization—also a Timescale innovation—unlocks high-performance AI use cases, making PostgreSQL as fast as Pinecone

ivfflat = """CREATE INDEX ON documents USING ivfflat (embedding vector_cosine_ops)"""
cursor.execute(ivfflat)
conn.commit()

The code below retrieves the contents of the two documents closest to the given query embeddings from the documents table in PostgreSQL. Here's the code for it:

def relevant_search(conn, query):
    query_embeddings = embedding_model.encode(query).tolist()
   
    with conn.cursor() as cur:
        cur.execute('SELECT contents FROM documents ORDER BY embedding <=> %s::vector LIMIT 2', (query_embeddings))
        return cur.fetchall()

query = ["News related to people who died due to Carbon Monoxide"]

relevant_search(conn, query)

Here’s the breakdown of the code:

  • with conn.cursor() as cur: Opens a cursor to interact with the PostgreSQL database.
  • cur.execute(...): Executes an SQL query.
  • SELECT contents FROM documents ORDER BY embedding <=> %s::vector LIMIT 2: Select the contents column from the documents table, ordering the results by the distance between the embedding column and the provided query embeddings. The <=> operator is used for vector distance comparison.
  • (query_embeddings): Supplies the query embeddings as a parameter to the SQL query.
  • return cur.fetchall(): Fetches and returns the query results (the contents of the two closest documents).

This final step of the RAG application integrates language model generation with semantic search. Once the relevant documents are retrieved, they are passed alongside the query to the Claude Sonnet 3.5 model for response generation.

def rag_function(conn, client, model_name, query):

    # Step 1: Retrieve relevant documents

    relevant_docs = relevant_search(conn, query)
    relevant_text = " ".join([doc[0] for doc in relevant_docs])  # Combine the contents of the retrieved documents

    # Step 2: Use the retrieved documents to augment the query for the Claude model
    full_query = (f"Context: The following are relevant news articles related to the query.\n"
        f"{relevant_text}\n\n"
        f"Based on the above context, please answer the following question:\n"
        f"Question: {query[0]}")

    # Step 3: Query the Claude model with the augmented query
    message = client.messages.create(
        model=model_name,
        max_tokens=1000,
        temperature=0,
        system="Given a query and context, provide the accurate information. Don't hallucinate, If the context does not provide relevant information. ",
        messages=[
            {
                "role": "user",
                "content": [
                    {
                        "type": "text",
                        "text": full_query
                    }
                ]
            }
        ]
    )
    
    return message.content

# Example usage:
query = ["News related to people who died due to Carbon Monoxide"]
response = rag_function(conn, client, "claude-3-5-sonnet-20240620", query)
print(response)

Alright! We have completed a basic RAG example using Claude Sonnet 3.5. 

Conclusion

This article covered everything you need to know about Claude Sonnet 3.5. We discussed its superior intelligence, multimodality, and reduced cost compared to other family members. These superpowers were later used to implement RAG using pgvector on Timescale Cloud

We started with a basic RAG example to retrieve relevant news content. Stay tuned for the advanced version involving an AI image gallery! You, too, can start using Timescale to build your AI application with Claude and pgvector. Timescale Cloud has a complete open-source stack for your AI applications, with pgvector, pgai, and pgvectorscale. If you’re implementing RAG, pgai makes it easier to build search and RAG applications by bringing more AI workflows into PostgreSQL.

To make your AI application more scalable and performant, try pgvectorscale. This extension adds a third approximate nearest-neighbor (ANN) search algorithm to pgvector (StreamingDiskANN) and utilizes a streaming model that allows the index to continuously retrieve the “next closest” item for a given query, revving your application’s performance.

Both pgai and pgvectorscale are open source under the PostgreSQL license. To install them, check out the pgai and pgvectorscale GitHub repos (stars are always welcome!). To get started more quickly,  sign up for Timescale Cloud and create a free cloud PostgreSQL database for your RAG application.

This post was written by
11 min read
AI
Contributors

Related posts