Build a Fully Local RAG App With PostgreSQL, Mistral, and Ollama
Large language models (LLMs) are data-intensive systems that generate responses by analyzing vast amounts of information. However, as statistical models, they don't truly understand it. For example, LLMs don't inherently "know" that the sun rises in the east; they simply generate this fact based on patterns in the data they've been trained on. When faced with less common queries, LLMs may produce inaccurate or fabricated information, a problem known as hallucination.
Retrieval-augmented generation (RAG) helps mitigate hallucinations by combining information retrieval with text generation. Instead of relying solely on the model's internal knowledge, RAG pulls relevant documents or data to enhance the response. However, when using proprietary or confidential documents, you need to ensure their security during the retrieval process.
So, in this post, we will build a fully local RAG application to avoid sending private information to the LLM. We'll use PostgreSQL to store documents and Ollama to host a local model like Mistral.
How to Build a Local RAG Application: Definition and Tools
In this section, we will discuss RAG and the tools required to build it locally. We will use PostgreSQL and Ollama to build RAG. PostgreSQL will act as a vector storage house, while Ollama provides us with embedding and generation models.
What is retrieval-augmented generation?
RAG, or retrieval-augmented generation, combines AI models with search algorithms to retrieve information from external sources and incorporate it into a pre-trained LLM.
RAG takes input, retrieves relevant documents (e.g., from Wikipedia), and concatenates them with the input prompt. This combined context is fed to a text generator, producing the final output. This approach adapts to evolving facts, bypassing the need for retraining static LLMs and enabling access to up-to-date information.
RAG has now emerged as a go-to solution for question-answering tasks in natural language processing (NLP). However, many enterprises hesitate due to data privacy and security concerns. Building RAG locally addresses these issues and offers several advantages:
- Privacy: keeps sensitive information on local hardware, ensuring data security
- Latency: reduces response times by processing data locally
- Control: provides greater control over data and model management
- Cost: eliminates costs associated with cloud-based services
- Customization: allows for tailored adjustments and fine-tuning specific to local needs
- Reliability: operates independently of internet connectivity or remote server availability
RAG serves many applications, including:
- Healthcare: RAG models assist in diagnosing diseases by retrieving and summarizing relevant medical literature, enhancing decision-making for healthcare professionals.
- Journalism: News agencies use RAG to automatically generate articles and summarize reports, improving efficiency and content accuracy.
- Customer service: Chatbots use RAG to fetch relevant information, providing accurate and context-aware responses to customer inquiries.
- Software development: Tools like OpenSauced use RAG to build a "copilot" for git history. By retrieving relevant code snippets, commit messages and documentation, developers can gain insights into past decisions and better understand the evolution of their projects. Learn more about how OpenSauced leverages RAG with Timescale and pgvector.
Introduction to PostgreSQL, Mistral, and Ollama
Let’s shift our focus to the technologies required to build the local RAG system:
PostgreSQL
Unlike other RDBMS, or relational database management systems, PostgreSQL supports both relational and non-relational data types, making it one of the most compliant, stable, and mature relational databases available today. Here are some of its features:
- Multi-version concurrency control for concurrent reads and writes
- Highly scalable with Unicode and international character sets
- Cross-platform compatibility (Linux, Windows, OSX, FreeBSD, Solaris)
- Dynamic loading for user-written code
- Full ACID (Atomicity, Consistency, Isolation, and Durability) compliance
- Extensible with custom functions and procedural languages
- Rich data types (arrays, JSON, PostGIS)
- Built-in replication and partitioning for large datasets
- Versatility via a rich ecosystem of extensions: for example, pgvector turns PostgreSQL into a vector database
Building RAG with PostgreSQL was challenging due to its component requirements until pgai. Pgai builds on top of pgvector and simplifies RAG implementation by integrating embedding and generative AI models with PostgreSQL. This way, no additional libraries are required to get the embeddings and invoke a generation model. All will be managed by PostgreSQL and in PostgreSQL. We will use pgai in our RAG application workflow for this tutorial.
Mistral
Mistral 7B is a 7.3B parameter model that outperforms Llama 2 13B on all benchmarks and Llama 1 34B on many benchmarks. It excels in code tasks, approaching CodeLlama 7B performance while remaining strong in English tasks.
Mistral 7B uses grouped-query attention for faster inference and sliding window attention for handling longer sequences efficiently. Released under the Apache 2.0 license, it can be used without restrictions, downloaded, and deployed on any cloud. It’s easy to fine-tune and includes a chat model that surpasses Llama 2 13B in performance.
Its small size (~4.1 GB) makes it appropriate for this tutorial, as it can work on consumer-based machines without requiring a GPU.
Ollama
Ollama (Omni-Layer Learning Language Acquisition Model) redefines language acquisition and NLP by enabling users to run LLMs locally on their machines. Developed to empower individuals and organizations, Ollama offers a user-friendly interface and seamless integration capabilities.
- Local execution: LLMs are run locally, ensuring data privacy, faster processing, and reduced reliance on external servers. With the help of pgai, the models can run from PostgreSQL.
- Extensive model library: Access to a vast library of pre-trained LLMs, including popular models like Llama 3, tailored for various tasks and hardware.
- Seamless integration: Integrates with tools like Python, LangChain, pgai, and LlamaIndex for easy AI application development.
- Customization and fine-tuning: Allows users to customize and fine-tune LLMs for specific needs through prompt engineering and few-shot learning.
Setup
To build local RAG, we need two things: PostgreSQL with pgai and models hosted on Ollama. Let’s start with setting up PostgreSQL:
Setting up PostgreSQL
In this tutorial, we will use the fastest way to set up PostgreSQL, which is using a pre-built docker container.
Prerequisites
To connect to a PostgreSQL installation on Docker, you need:
Create a docker network
First, let’s create a Docker network that TimescaleDB and Ollama containers will use to communicate.
docker network create local-rag
Install and configure TimescaleDB on PostgreSQL
Follow these steps to install PostgreSQL and TimescaleDB using Timescale's provided packages:
- Pull the TimescaleDB Docker image. Here’s how to pull the Docker image:
docker pull timescale/timescaledb-ha:pg16
The TimescaleDB extension is pre-created in the default postgres
database in both the -ha
and non-ha
Docker images. TimescaleDB is added to any new database you create in these images by default.
- Run the Docker container. To run the Docker container, do the following:
docker run -d --network local-rag --name timescaledb -p 5432:5432 -e POSTGRES_PASSWORD=password timescale/timescaledb-ha:pg16
Here, the --network local-rag
option ensures that the container uses the Docker network created above. Remember to change the port for each Docker instance if multiple container instances are running.
- Connect to a database on your PostgreSQL instance. If
psql
is not installed on the system, follow this guide. For thetimescaledb-ha:pg16
image, the default user and database are bothpostgres
. The default command to connect to PostgreSQL in this image is:
psql -d "postgres://postgres:password@localhost:5432/postgres"
- Installing pgai: use the following query to install the ai extension (latest version: 0.4.0).
CREATE EXTENSION IF NOT EXISTS "ai" VERSION '0.4.0' CASCADE;
The CASCADE
option automatically installs the pgvector
and plpython3u
extensions. After installation, you can ensure that the correct versions were installed successfully by listing all extensions using this command in psql
:
\dx
Then exit the connection using the following command:
\q
Let’s look at setting up Ollama.
Setting up Ollama
The following steps will set up Ollama on the local machine:
- Download Ollama. Ollama supports MacOS, Windows, and Linux. You can download it from the official GitHub repo or the official website.
In this tutorial, we'll install Ollama using Docker following this guide. We first pull the official Ollama Docker image:
docker pull ollama/ollama
Then, run an Ollama container on the same network as our TimescaleDB container (i.e., local–rag
) and remember to note the ports.
docker run -d --network local-rag -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama
- Get the model. Visit the model library to see all available models. The default download is the latest model. For the Mistral model, the generation model we are using:
docker exec -it ollama ollama pull mistral
(The model size is 7B, so downloading takes a few minutes.)
For the nomic-embed-text-v1.5 model, the embedding model we are using:
docker exec -it ollama ollama pull nomic-embed-text
You can also simplify these commands by using the following command:
ollama pull mistral or ollama pull nomic-embed-text
- Run the model. Run the model with:
ollama run mistral
This starts an Ollama REPL where you can interact with the Mistral model.
Development of Local RAG
We have completed the setup; let’s start developing now.
Architecture overview
Before going into the nitty-gritty of the details, let’s take a look at the architecture:
Here’s a step-by-step explanation of the process, following the stages in the architecture:
1. Documents: The process begins with collecting documents that must be indexed and stored.
2. Data indexing: These documents are indexed and stored in a vector database. This database uses a PostgreSQL instance hosted on Timescale.
3. Query: A user query is received, which initiates the retrieval and generation process.
4. Embedding models: The query is sent to the embedding models running on ollama:11434
. These models process the query to generate embeddings, which are numerical representations of the query's content. In this tutorial, we will use nomic-embed-text-v1.5, a high-performing embedding model with a large token context window. The embedding dimension is 768
.
5. Vector database: The generated embeddings are then used to search the vector database for the top results that are most relevant to the query.
6. Top results: The top results from the vector database are retrieved and passed to the generation models.
7. Generation models: The generation models, also running on ollama:11434
and managed by Ollama, take the top results and generate a final response. The model used will be Mistral, a 7B model.
8. Result: The generated response is returned as the result, completing the query process.
This architecture ensures data is processed locally, leveraging both embedding and generation models to provide accurate and relevant responses to user queries.
Development
In this section, we will develop the components one by one with an explanation of the code. You can also access the code in this Jupyter Notebook.
- Install and import libraries
The required library psycopg2
is imported to handle PostgreSQL database operations.
%pip install psycopg2-binary
import psycopg2
- Data preparation
In this tutorial, we will create dummy data of different locations with their description. Here’s what it looks like:
dummy_data = [
{"title": "Seoul Tower", "content": "Seoul Tower is a communication and observation tower located on Namsan Mountain in central Seoul, South Korea."},
{"title": "Gwanghwamun Gate", "content": "Gwanghwamun is the main and largest gate of Gyeongbokgung Palace, in Jongno-gu, Seoul, South Korea."},
{"title": "Bukchon Hanok Village", "content": "Bukchon Hanok Village is a Korean traditional village in Seoul with a long history."},
{"title": "Myeong-dong Shopping Street", "content": "Myeong-dong is one of the primary shopping districts in Seoul, South Korea."},
{"title": "Dongdaemun Design Plaza", "content": "The Dongdaemun Design Plaza is a major urban development landmark in Seoul, South Korea."}
]
In a real-world scenario, the data might be confidential and private documents for a local RAG application.
- Database connection component
This component establishes a connection to the PostgreSQL database using the psycopg2
library.
def connect_db():
return psycopg2.connect(
host="your_host",
database="your_database",
user="your_user",
password="your_password"
port="your_port"
)
We can extract the information from the connection string as follows:
postgres://postgres:password@localhost:5432/postgres
Username: postgres
Password: password
Host: localhost
Port: 5432
Database name: postgres
We can also pass the connection string to the function like this:
def connect_db():
return psycopg2.connect(
"postgres://postgres:password@localhost:5432/postgres")
- Table creation component
This component creates the documents table if it does not already exist. The table includes columns for id
, title
, content
, and embedding
.
with connect_db() as conn:
with conn.cursor() as cur:
cur.execute("""
CREATE TABLE IF NOT EXISTS documents (
id SERIAL PRIMARY KEY,
title TEXT,
content TEXT,
embedding VECTOR(768)
);
""")
- Data insertion component
This component inserts data into the documents table. It first inserts the title
and content
then updates the embedding
field using the ollama_embed
function from pgai.
with connect_db() as conn:
with conn.cursor() as cur:
for doc in dummy_data:
cur.execute("""
INSERT INTO documents (title, content, embedding)
VALUES (
%(title)s,
%(content)s,
ai.ollama_embed(
'nomic-embed-text',
concat(%(title)s, ' - ', %(content)s),
host=>'http://ollama:11434'
)
);
""", doc)
The most advantageous aspect is that the generation of embeddings occurs directly within the database, eliminating the need for any additional libraries.
- Data fetching component
This component fetches and prints all data from the documents table, including the dimensions of the embeddings. This way, we can ensure the data insertion:
with connect_db() as conn:
with conn.cursor() as cur:
cur.execute("""
SELECT title, content, embedding
FROM documents;
""")
rows = cur.fetchall()
for row in rows:
print(f"Title: {row[0]}\nContent: {row[1]}\nEmbedding: {row[2][0:150]}...\n\n")
- Retrieve and generate response component
This component takes a query, embeds it, retrieves the most relevant documents based on cosine similarity, and generates a response using the ollama_generate
function.
query = 'Tell me about landmarks in Seoul.'
with connect_db() as conn:
with conn.cursor() as cur:
# Embed the query using the ollama_embed function
cur.execute("""
SELECT ai.ollama_embed('nomic-embed-text', %s, host=>'http://ollama:11434');
""", (query,))
query_embedding = cur.fetchone()[0]
# Retrieve relevant documents based on cosine distance
cur.execute("""
SELECT title, content, 1 - (embedding <=> %s) AS similarity
FROM documents
ORDER BY similarity DESC
LIMIT 3;
""", (query_embedding,))
rows = cur.fetchall()
# Prepare the context for generating the response
context = "\n\n".join([f"Landmark: {row[0]}\nDescription: {row[1]}" for row in rows])
# Generate the response using the ollama_generate function
cur.execute("""
SELECT ai.ollama_generate('mistral', %s, host=>'http://ollama:11434');
""", (f"Query: {query}\nContext: {context}",))
model_response = cur.fetchone()[0]
print(model_response['response'])
8. Results
After querying, here is the result:
>> Seoul is home to several iconic landmarks that are must-visit attractions. One of the prominent landmarks is Seoul Tower, also known as Namsan Tower, which offers panoramic views of the city from its vantage point on Namsan Mountain. Another significant landmark is Gwanghwamun Gate, the main and largest gate of Gyeongbokgung Palace, showcasing traditional Korean architecture and historical significance.
Here we have it: our very own local RAG system powered by PostgreSQL and Ollama. No data is sent to public LLMs, ensuring complete privacy and security for your information.
Our friend Dave Ebbelaar has also put together a great video on this topic, so be sure to check it out.
Conclusion
Large language models rely on vast data but can hallucinate when queried with uncommon topics. RAG combines information retrieval with text generation to mitigate this, allowing for efficient updates without full retraining. However, standard RAG methods often send data to external LLMs, risking confidentiality breaches. Local RAG addresses this challenge by processing and generating responses entirely within a secure local environment, ensuring data privacy and security.
In this article, we created a local RAG application using PostgreSQL with pgai, Mistral, and Ollama. With the provided code, you can develop your own RAG application for note-taking, code comprehension, or simply as a personal AI on your laptop.
Start using pgai to simplify your AI pipelines and use AI within the database—no need to resort to other libraries. To get started, head over to the pgai GitHub repo and check out our tutorial to learn more about Ollama and pgai so you can start building more private, lower-cost AI applications today.
Create a free Timescale account to access pgvector, pgai, and pgvectorscale in one place—with the full benefits of a mature cloud PostgreSQL platform.