Category: All posts
Nov 08, 2024
Large language models (LLMs) are data-intensive systems that generate responses by analyzing vast amounts of information. However, as statistical models, they don't truly understand it. For example, LLMs don't inherently "know" that the sun rises in the east; they simply generate this fact based on patterns in the data they've been trained on. When faced with less common queries, LLMs may produce inaccurate or fabricated information, a problem known as hallucination.
Retrieval-augmented generation (RAG) helps mitigate hallucinations by combining information retrieval with text generation. Instead of relying solely on the model's internal knowledge, RAG pulls relevant documents or data to enhance the response. However, when using proprietary or confidential documents, you need to ensure their security during the retrieval process.
So, in this post, we will build a fully local RAG application to avoid sending private information to the LLM. We'll use PostgreSQL to store documents and Ollama to host a local model like Mistral.
In this section, we will discuss RAG and the tools required to build it locally. We will use PostgreSQL and Ollama to build RAG. PostgreSQL will act as a vector storage house, while Ollama provides us with embedding and generation models.
RAG, or retrieval-augmented generation, combines AI models with search algorithms to retrieve information from external sources and incorporate it into a pre-trained LLM.
RAG takes input, retrieves relevant documents (e.g., from Wikipedia), and concatenates them with the input prompt. This combined context is fed to a text generator, producing the final output. This approach adapts to evolving facts, bypassing the need for retraining static LLMs and enabling access to up-to-date information.
RAG has now emerged as a go-to solution for question-answering tasks in natural language processing (NLP). However, many enterprises hesitate due to data privacy and security concerns. Building RAG locally addresses these issues and offers several advantages:
RAG serves many applications, including:
Let’s shift our focus to the technologies required to build the local RAG system:
Unlike other RDBMS, or relational database management systems, PostgreSQL supports both relational and non-relational data types, making it one of the most compliant, stable, and mature relational databases available today. Here are some of its features:
Building RAG with PostgreSQL was challenging due to its component requirements until pgai. Pgai builds on top of pgvector and simplifies RAG implementation by integrating embedding and generative AI models with PostgreSQL. This way, no additional libraries are required to get the embeddings and invoke a generation model. All will be managed by PostgreSQL and in PostgreSQL. We will use pgai in our RAG application workflow for this tutorial.
Mistral 7B is a 7.3B parameter model that outperforms Llama 2 13B on all benchmarks and Llama 1 34B on many benchmarks. It excels in code tasks, approaching CodeLlama 7B performance while remaining strong in English tasks.
Mistral 7B uses grouped-query attention for faster inference and sliding window attention for handling longer sequences efficiently. Released under the Apache 2.0 license, it can be used without restrictions, downloaded, and deployed on any cloud. It’s easy to fine-tune and includes a chat model that surpasses Llama 2 13B in performance.
Its small size (~4.1 GB) makes it appropriate for this tutorial, as it can work on consumer-based machines without requiring a GPU.
Ollama (Omni-Layer Learning Language Acquisition Model) redefines language acquisition and NLP by enabling users to run LLMs locally on their machines. Developed to empower individuals and organizations, Ollama offers a user-friendly interface and seamless integration capabilities.
To build local RAG, we need two things: PostgreSQL with pgai and models hosted on Ollama. Let’s start with setting up PostgreSQL:
In this tutorial, we will use the fastest way to set up PostgreSQL, which is using a pre-built docker container.
To connect to a PostgreSQL installation on Docker, you need:
First, let’s create a Docker network that TimescaleDB and Ollama containers will use to communicate.
docker network create local-rag
Follow these steps to install PostgreSQL and TimescaleDB using Timescale's provided packages:
docker pull timescale/timescaledb-ha:pg16
The TimescaleDB extension is pre-created in the default postgres
database in both the -ha
and non-ha
Docker images. TimescaleDB is added to any new database you create in these images by default.
docker run -d --network local-rag --name timescaledb -p 5432:5432 -e POSTGRES_PASSWORD=password timescale/timescaledb-ha:pg16
Here, the --network local-rag
option ensures that the container uses the Docker network created above. Remember to change the port for each Docker instance if multiple container instances are running.
psql
is not installed on the system, follow this guide. For the timescaledb-ha:pg16
image, the default user and database are both postgres
. The default command to connect to PostgreSQL in this image is:psql -d "postgres://postgres:password@localhost:5432/postgres"
CREATE EXTENSION IF NOT EXISTS "ai" VERSION '0.4.0' CASCADE;
The CASCADE
option automatically installs the pgvector
and plpython3u
extensions. After installation, you can ensure that the correct versions were installed successfully by listing all extensions using this command in psql
:
\dx
Then exit the connection using the following command:
\q
Let’s look at setting up Ollama.
The following steps will set up Ollama on the local machine:
In this tutorial, we'll install Ollama using Docker following this guide. We first pull the official Ollama Docker image:
docker pull ollama/ollama
Then, run an Ollama container on the same network as our TimescaleDB container (i.e., local–rag
) and remember to note the ports.
docker run -d --network local-rag -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama
docker exec -it ollama ollama pull mistral
(The model size is 7B, so downloading takes a few minutes.)
For the nomic-embed-text-v1.5 model, the embedding model we are using:
docker exec -it ollama ollama pull nomic-embed-text
You can also simplify these commands by using the following command:
ollama pull mistral or ollama pull nomic-embed-text
ollama run mistral
This starts an Ollama REPL where you can interact with the Mistral model.
We have completed the setup; let’s start developing now.
Before going into the nitty-gritty of the details, let’s take a look at the architecture:
Here’s a step-by-step explanation of the process, following the stages in the architecture:
1. Documents: The process begins with collecting documents that must be indexed and stored.
2. Data indexing: These documents are indexed and stored in a vector database. This database uses a PostgreSQL instance hosted on Timescale.
3. Query: A user query is received, which initiates the retrieval and generation process.
4. Embedding models: The query is sent to the embedding models running on ollama:11434
. These models process the query to generate embeddings, which are numerical representations of the query's content. In this tutorial, we will use nomic-embed-text-v1.5, a high-performing embedding model with a large token context window. The embedding dimension is 768
.
5. Vector database: The generated embeddings are then used to search the vector database for the top results that are most relevant to the query.
6. Top results: The top results from the vector database are retrieved and passed to the generation models.
7. Generation models: The generation models, also running on ollama:11434
and managed by Ollama, take the top results and generate a final response. The model used will be Mistral, a 7B model.
8. Result: The generated response is returned as the result, completing the query process.
This architecture ensures data is processed locally, leveraging both embedding and generation models to provide accurate and relevant responses to user queries.
In this section, we will develop the components one by one with an explanation of the code. You can also access the code in this Jupyter Notebook.
The required library psycopg2
is imported to handle PostgreSQL database operations.
%pip install psycopg2-binary
import psycopg2
In this tutorial, we will create dummy data of different locations with their description. Here’s what it looks like:
dummy_data = [
{"title": "Seoul Tower", "content": "Seoul Tower is a communication and observation tower located on Namsan Mountain in central Seoul, South Korea."},
{"title": "Gwanghwamun Gate", "content": "Gwanghwamun is the main and largest gate of Gyeongbokgung Palace, in Jongno-gu, Seoul, South Korea."},
{"title": "Bukchon Hanok Village", "content": "Bukchon Hanok Village is a Korean traditional village in Seoul with a long history."},
{"title": "Myeong-dong Shopping Street", "content": "Myeong-dong is one of the primary shopping districts in Seoul, South Korea."},
{"title": "Dongdaemun Design Plaza", "content": "The Dongdaemun Design Plaza is a major urban development landmark in Seoul, South Korea."}
]
In a real-world scenario, the data might be confidential and private documents for a local RAG application.
This component establishes a connection to the PostgreSQL database using the psycopg2
library.
def connect_db():
return psycopg2.connect(
host="your_host",
database="your_database",
user="your_user",
password="your_password"
port="your_port"
)
We can extract the information from the connection string as follows:
postgres://postgres:password@localhost:5432/postgres
Username: postgres
Password: password
Host: localhost
Port: 5432
Database name: postgres
We can also pass the connection string to the function like this:
def connect_db():
return psycopg2.connect(
"postgres://postgres:password@localhost:5432/postgres")
This component creates the documents table if it does not already exist. The table includes columns for id
, title
, content
, and embedding
.
with connect_db() as conn:
with conn.cursor() as cur:
cur.execute("""
CREATE TABLE IF NOT EXISTS documents (
id SERIAL PRIMARY KEY,
title TEXT,
content TEXT,
embedding VECTOR(768)
);
""")
This component inserts data into the documents table. It first inserts the title
and content
then updates the embedding
field using the ollama_embed
function from pgai.
with connect_db() as conn:
with conn.cursor() as cur:
for doc in dummy_data:
cur.execute("""
INSERT INTO documents (title, content, embedding)
VALUES (
%(title)s,
%(content)s,
ai.ollama_embed(
'nomic-embed-text',
concat(%(title)s, ' - ', %(content)s),
host=>'http://ollama:11434'
)
);
""", doc)
The most advantageous aspect is that the generation of embeddings occurs directly within the database, eliminating the need for any additional libraries.
This component fetches and prints all data from the documents table, including the dimensions of the embeddings. This way, we can ensure the data insertion:
with connect_db() as conn:
with conn.cursor() as cur:
cur.execute("""
SELECT title, content, embedding
FROM documents;
""")
rows = cur.fetchall()
for row in rows:
print(f"Title: {row[0]}\nContent: {row[1]}\nEmbedding: {row[2][0:150]}...\n\n")
This component takes a query, embeds it, retrieves the most relevant documents based on cosine similarity, and generates a response using the ollama_generate
function.
query = 'Tell me about landmarks in Seoul.'
with connect_db() as conn:
with conn.cursor() as cur:
# Embed the query using the ollama_embed function
cur.execute("""
SELECT ai.ollama_embed('nomic-embed-text', %s, host=>'http://ollama:11434');
""", (query,))
query_embedding = cur.fetchone()[0]
# Retrieve relevant documents based on cosine distance
cur.execute("""
SELECT title, content, 1 - (embedding <=> %s) AS similarity
FROM documents
ORDER BY similarity DESC
LIMIT 3;
""", (query_embedding,))
rows = cur.fetchall()
# Prepare the context for generating the response
context = "\n\n".join([f"Landmark: {row[0]}\nDescription: {row[1]}" for row in rows])
# Generate the response using the ollama_generate function
cur.execute("""
SELECT ai.ollama_generate('mistral', %s, host=>'http://ollama:11434');
""", (f"Query: {query}\nContext: {context}",))
model_response = cur.fetchone()[0]
print(model_response['response'])
8. Results
After querying, here is the result:
>> Seoul is home to several iconic landmarks that are must-visit attractions. One of the prominent landmarks is Seoul Tower, also known as Namsan Tower, which offers panoramic views of the city from its vantage point on Namsan Mountain. Another significant landmark is Gwanghwamun Gate, the main and largest gate of Gyeongbokgung Palace, showcasing traditional Korean architecture and historical significance.
Here we have it: our very own local RAG system powered by PostgreSQL and Ollama. No data is sent to public LLMs, ensuring complete privacy and security for your information.
Our friend Dave Ebbelaar has also put together a great video on this topic, so be sure to check it out.
Large language models rely on vast data but can hallucinate when queried with uncommon topics. RAG combines information retrieval with text generation to mitigate this, allowing for efficient updates without full retraining. However, standard RAG methods often send data to external LLMs, risking confidentiality breaches. Local RAG addresses this challenge by processing and generating responses entirely within a secure local environment, ensuring data privacy and security.
In this article, we created a local RAG application using PostgreSQL with pgai, Mistral, and Ollama. With the provided code, you can develop your own RAG application for note-taking, code comprehension, or simply as a personal AI on your laptop.
Start using pgai to simplify your AI pipelines and use AI within the database—no need to resort to other libraries. To get started, head over to the pgai GitHub repo and check out our tutorial to learn more about Ollama and pgai so you can start building more private, lower-cost AI applications today.
Create a free Timescale account to access pgvector, pgai, and pgvectorscale in one place—with the full benefits of a mature cloud PostgreSQL platform.