Build a Fully Local RAG App With PostgreSQL, Mistral, and Ollama

Build a Fully Local RAG App With PostgreSQL, Mistral, and Ollama

Large language models (LLMs) are data-intensive systems that generate responses by analyzing vast amounts of information. However, as statistical models, they don't truly understand it. For example, LLMs don't inherently "know" that the sun rises in the east; they simply generate this fact based on patterns in the data they've been trained on. When faced with less common queries, LLMs may produce inaccurate or fabricated information, a problem known as hallucination.

Retrieval-augmented generation (RAG) helps mitigate hallucinations by combining information retrieval with text generation. Instead of relying solely on the model's internal knowledge, RAG pulls relevant documents or data to enhance the response. However, when using proprietary or confidential documents, you need to ensure their security during the retrieval process.

So, in this post, we will build a fully local RAG application to avoid sending private information to the LLM. We'll use PostgreSQL to store documents and Ollama to host a local model like Mistral.

How to Build a Local RAG Application: Definition and Tools

In this section, we will discuss RAG and the tools required to build it locally. We will use PostgreSQL and Ollama to build RAG. PostgreSQL will act as a vector storage house, while Ollama provides us with embedding and generation models. 

What is retrieval-augmented generation?

RAG, or retrieval-augmented generation, combines AI models with search algorithms to retrieve information from external sources and incorporate it into a pre-trained LLM. 

RAG takes input, retrieves relevant documents (e.g., from Wikipedia), and concatenates them with the input prompt. This combined context is fed to a text generator, producing the final output. This approach adapts to evolving facts, bypassing the need for retraining static LLMs and enabling access to up-to-date information.

RAG has now emerged as a go-to solution for question-answering tasks in natural language processing (NLP). However, many enterprises hesitate due to data privacy and security concerns. Building RAG locally addresses these issues and offers several advantages:

  • Privacy: keeps sensitive information on local hardware, ensuring data security
  • Latency: reduces response times by processing data locally
  • Control: provides greater control over data and model management
  • Cost: eliminates costs associated with cloud-based services
  • Customization: allows for tailored adjustments and fine-tuning specific to local needs
  • Reliability: operates independently of internet connectivity or remote server availability

RAG serves many applications, including:

  • Healthcare: RAG models assist in diagnosing diseases by retrieving and summarizing relevant medical literature, enhancing decision-making for healthcare professionals.
  • Journalism: News agencies use RAG to automatically generate articles and summarize reports, improving efficiency and content accuracy.
  • Customer service: Chatbots use RAG to fetch relevant information, providing accurate and context-aware responses to customer inquiries.
  • Software development: Tools like OpenSauced use RAG to build a "copilot" for git history. By retrieving relevant code snippets, commit messages and documentation, developers can gain insights into past decisions and better understand the evolution of their projects. Learn more about how OpenSauced leverages RAG with Timescale and pgvector.

Introduction to PostgreSQL, Mistral, and Ollama

Let’s shift our focus to the technologies required to build the local RAG system:

PostgreSQL

Unlike other RDBMS, or relational database management systems, PostgreSQL supports both relational and non-relational data types, making it one of the most compliant, stable, and mature relational databases available today. Here are some of its features:

  • Multi-version concurrency control for concurrent reads and writes
  • Highly scalable with Unicode and international character sets
  • Cross-platform compatibility (Linux, Windows, OSX, FreeBSD, Solaris)
  • Dynamic loading for user-written code
  • Full ACID (Atomicity, Consistency, Isolation, and Durability) compliance
  • Extensible with custom functions and procedural languages
  • Rich data types (arrays, JSON, PostGIS)
  • Built-in replication and partitioning for large datasets
  • Versatility via a rich ecosystem of extensions: for example, pgvector turns PostgreSQL into a vector database

Building RAG with PostgreSQL was challenging due to its component requirements until pgai. Pgai builds on top of pgvector and simplifies RAG implementation by integrating embedding and generative AI models with PostgreSQL. This way, no additional libraries are required to get the embeddings and invoke a generation model. All will be managed by PostgreSQL and in PostgreSQL. We will use pgai in our RAG application workflow for this tutorial.

Mistral

Mistral 7B is a 7.3B parameter model that outperforms Llama 2 13B on all benchmarks and Llama 1 34B on many benchmarks. It excels in code tasks, approaching CodeLlama 7B performance while remaining strong in English tasks.

Performance comparison of Mistral with other models on different benchmarks
Performance comparison of Mistral with other models on different benchmarks (Source)

Mistral 7B uses grouped-query attention for faster inference and sliding window attention for handling longer sequences efficiently. Released under the Apache 2.0 license, it can be used without restrictions, downloaded, and deployed on any cloud. It’s easy to fine-tune and includes a chat model that surpasses Llama 2 13B in performance.

Its small size (~4.1 GB) makes it appropriate for this tutorial, as it can work on consumer-based machines without requiring a GPU.

Ollama

Ollama (Omni-Layer Learning Language Acquisition Model) redefines language acquisition and NLP by enabling users to run LLMs locally on their machines. Developed to empower individuals and organizations, Ollama offers a user-friendly interface and seamless integration capabilities.

  • Local execution: LLMs are run locally, ensuring data privacy, faster processing, and reduced reliance on external servers. With the help of pgai, the models can run from PostgreSQL.
  • Extensive model library: Access to a vast library of pre-trained LLMs, including popular models like Llama 3, tailored for various tasks and hardware.
  • Seamless integration: Integrates with tools like Python, LangChain, pgai, and LlamaIndex for easy AI application development.
  • Customization and fine-tuning: Allows users to customize and fine-tune LLMs for specific needs through prompt engineering and few-shot learning.

Setup

To build local RAG, we need two things: PostgreSQL with pgai and models hosted on Ollama. Let’s start with setting up PostgreSQL:

Setting up PostgreSQL

In this tutorial, we will use the fastest way to set up PostgreSQL, which is using a pre-built docker container. 

Prerequisites

To connect to a PostgreSQL installation on Docker, you need:

  • Docker
  • psql

Install and configure TimescaleDB on PostgreSQL

Follow these steps to install PostgreSQL and TimescaleDB using Timescale's provided packages:

  1. Pull the TimescaleDB Docker image. Here’s how to pull the Docker image:
docker pull timescale/timescaledb-ha:pg16

The TimescaleDB extension is pre-created in the default postgres database in both the -ha and non-ha Docker images. TimescaleDB is added to any new database you create in these images by default.

  1. Run the Docker container. To run the Docker container, do the following:
docker run -d --name –network host timescaledb -p 5432:5432 -e POSTGRES_PASSWORD=password timescale/timescaledb-ha:pg16

Here, the --network host option makes the container use the host’s network stack. The container will share the host’s IP address and network interfaces. Essentially, the container behaves as if it is running directly on the host, not inside an isolated Docker network. This will help later when connecting to the Ollama service.

Remember to change the port for each Docker instance if multiple container instances are running.

  1. Connect to a database on your PostgreSQL instance. The default way of connecting to PostgreSQL is as follows: 
psql -d "postgres://postgres:password@localhost:5432/postgres"

If psql is not installed on the system, follow this guide. For the timescaledb-ha:pg16 image, the default user and database are both postgres. The default command to connect to PostgreSQL in this image is:

psql -d "postgres://postgres:password@localhost:5432/postgres"

If the database is connected, it ensures that the above steps are done correctly. 

  1. Installing pgai: After connecting to the database, we can see the extensions installed by doing the following:
\dx+ ai

If ai is not mentioned in the installed extension, we will install it using the following query:

CREATE EXTENSION IF NOT EXISTS ai CASCADE;

The CASCADE option automatically installs the pgvector and plpython3u extensions. After installation, exit the connection.

  1. Inside the container: We will build our RAG application inside the Docker container. To this end, we will run an interactive shell inside the container and create our RAG application there using Python. To do this, run the following:
docker exec -it timescaledb bash

Inside the shell, create a Python file, which we will fill in later.

touch RAG.py

  1. Libraries required: psycopg2 will be required to create a Python client to communicate with PostgreSQL. As the Docker is pre-packaged with Python and Pip, you can install it very easily:
pip install psycopg2

Let’s look at setting up Ollama.

Setting up Ollama

The following steps will set up Ollama on the local machine:

  1. Download Ollama. Ollama supports MacOS, Windows, and Linux. You can download it from the official GitHub repo or the official website.
Ollama homepage
Ollama homepage

For Linux (Ubuntu):

curl -fsSL https://ollama.com/install.sh | sh

Ensure the NVIDIA/AMD GPU drivers are installed for optimal performance. The CPU-only mode works but is slower.

  1. Get the model. Visit the model library to see all available models. The default download is the latest model. For the Mistral model:
ollama pull mistral

The model size is 7B, so downloading takes a few minutes.

  1. Run the model. Run the model with:
ollama run mistral

This starts an Ollama REPL where you can interact with the Mistral model. 

  1. Serve the model. By default, Ollama models are served to the localhost:11434. It can be ensured by using curl request as follows:
curl http://localhost:11434/api/generate -d '{
  "model": "mistral",
  "prompt":"What is PostgreSQL?"
}'

Development of Local RAG

We have completed the setup; let’s start developing now.

Architecture overview

Before going into the nitty-gritty of the details, let’s take a look at the architecture:

Architecture diagram for local RAG application using PostgreSQL and Ollama
Architecture diagram for local RAG application using PostgreSQL and Ollama

Here’s a step-by-step explanation of the process, following the stages in the architecture:

1. Documents: The process begins with collecting documents that must be indexed and stored.

2. Data indexing: These documents are indexed and stored in a vector database. This database uses a PostgreSQL instance hosted on Timescale.

3. Query: A user query is received, which initiates the retrieval and generation process.

4. Embedding models: The query is sent to the embedding models running on localhost:11434. These models process the query to generate embeddings, which are numerical representations of the query's content. In this tutorial, we will use nomic-embed-text-v1.5, a high-performing embedding model with a large token context window. The embedding dimension is 768.

5. Vector database: The generated embeddings are then used to search the vector database for the top results that are most relevant to the query.

6. Top results: The top results from the vector database are retrieved and passed to the generation models.

7. Generation models: The generation models, also running on `localhost:11434` and managed by Ollama, take the top results and generate a final response. The model used will be Mistral, a 7B model.

8. Result: The generated response is returned as the result, completing the query process.

This architecture ensures data is processed locally, leveraging both embedding and generation models to provide accurate and relevant responses to user queries.

Development

In this section, we will develop the components one by one with the explanation of the code:

  1. Importing libraries

The required library psycopg2 is imported to handle PostgreSQL database operations.

import psycopg2
  1. Data preparation

In this tutorial, we will create dummy data of different locations with their description. Here’s what it looks like:

dummy_data = [
    {"title": "Seoul Tower", "content": "Seoul Tower is a communication and observation tower located on Namsan Mountain in central Seoul, South Korea."},
    {"title": "Gwanghwamun Gate", "content": "Gwanghwamun is the main and largest gate of Gyeongbokgung Palace, in Jongno-gu, Seoul, South Korea."},
    {"title": "Bukchon Hanok Village", "content": "Bukchon Hanok Village is a Korean traditional village in Seoul with a long history."},
    {"title": "Myeong-dong Shopping Street", "content": "Myeong-dong is one of the primary shopping districts in Seoul, South Korea."},
    {"title": "Dongdaemun Design Plaza", "content": "The Dongdaemun Design Plaza is a major urban development landmark in Seoul, South Korea."}
]

In a real-world scenario, the data might be confidential and private documents for a local RAG application.

  1. Database connection component

This component establishes a connection to the PostgreSQL database using the psycopg2 library.

def connect_db():
    return psycopg2.connect(
        host="your_host",
        database="your_database",
        user="your_user",
        password="your_password"
  port="your_port"
  )

We can extract the information from the connection string as follows:

postgres://postgres:password@localhost:5432/postgres

Username: postgres
Password: password
Host: localhost
Port: 5432
Database name: postgres

We can also pass the connection string to the function like this:

def connect_db():
    return psycopg2.connect(
"postgres://postgres:password@localhost:5432/postgres")

  1. Table creation component

This component creates the documents table if it does not already exist. The table includes columns for id, title, content, and embedding.

def create_table():
    conn = connect_db()
    cur = conn.cursor()
    cur.execute("""
        CREATE TABLE IF NOT EXISTS documents (
            id SERIAL PRIMARY KEY,
            title TEXT,
            content TEXT,
            embedding VECTOR(768)
        );
    """)
    conn.commit()
    cur.close()
    conn.close()

  1. Data insertion component

This component inserts data into the documents table. It first inserts the title and content then updates the embedding field using the ollama_embed function from pgai.

def insert_data():
    conn = connect_db()
    cur = conn.cursor()

    for doc in dummy_data:
        cur.execute("""
            INSERT INTO documents (title, content, embedding)
            VALUES (
                %(title)s,
                %(content)s,
                ollama_embed('nomic-embed-text', concat(%(title)s, ' - ', %(content)s))
            )
        """, doc)

    conn.commit()
    cur.close()
    conn.close()

The most advantageous aspect is that the generation of embeddings occurs directly within the database, eliminating the need for any additional libraries.

  1.  Data fetching component

This component fetches and prints all data from the documents table, including the dimensions of the embeddings. This way, we can ensure the data insertion:

def fetch_data():
    conn = connect_db()
    cur = conn.cursor()
    cur.execute("""
        SELECT title, content, vector_dims(embedding) 
        FROM documents;
    """)
    rows = cur.fetchall()
    for row in rows:
        print(f"Title: {row[0]}, Content: {row[1]}, Embedding Dimensions: {row[2]}")
    cur.close()
    conn.close()
  1.  Retrieve and generate response component

This component takes a query, embeds it, retrieves the most relevant documents based on cosine similarity, and generates a response using the ollama_generate function.

def retrieve_and_generate_response(query):
    conn = connect_db()
    cur = conn.cursor()
    
    # Embed the query using the ollama_embed function
    cur.execute("""
        SELECT ollama_embed('nomic-embed-text', %s);
    """, (query,))
    query_embedding = cur.fetchone()[0]
    
    # Retrieve relevant documents based on cosine distance
    cur.execute("""
        SELECT title, content, 1 - (embedding <=> %s) AS similarity
        FROM documents
        ORDER BY similarity DESC
        LIMIT 5;
    """, (query_embedding,))
    
    rows = cur.fetchall()
    
    # Prepare the context for generating the response
    context = "\n\n".join([f"Title: {row[0]}\nContent: {row[1]}" for row in rows])
    
    # Generate the response using the ollama_generate function
    cur.execute("""
        SELECT ai.ollama_generate('mistral', %s);
    """, (f"Query: {query}\nContext: {context}",))
    
    response = cur.fetchone()[0]
    print(f"Response: {response}")
    
    cur.close()
    conn.close()
  1.  Main

This function orchestrates the entire process by calling the necessary functions in sequence.

def main():
    create_table()
    insert_data()
    fetch_data()
    retrieve_and_generate_response("Tell me about landmarks in Seoul")

if __name__ == "__main__":
    main()

  1. Results

After querying, here is the result:

>> Seoul is home to several iconic landmarks that are must-visit attractions. One of the prominent landmarks is Seoul Tower, also known as Namsan Tower, which offers panoramic views of the city from its vantage point on Namsan Mountain. Another significant landmark is Gwanghwamun Gate, the main and largest gate of Gyeongbokgung Palace, showcasing traditional Korean architecture and historical significance.

Here we have it: our very own local RAG system powered by PostgreSQL and Ollama. No data is sent to public LLMs, ensuring complete privacy and security for your information.

Conclusion

Large language models rely on vast data but can hallucinate when queried with uncommon topics. RAG combines information retrieval with text generation to mitigate this, allowing for efficient updates without full retraining. However, standard RAG methods often send data to external LLMs, risking confidentiality breaches. Local RAG addresses this challenge by processing and generating responses entirely within a secure local environment, ensuring data privacy and security. 

In this article, we created a local RAG application using PostgreSQL with pgai, Mistral, and Ollama. With the provided code, you can develop your own RAG application for note-taking, code comprehension, or simply as a personal AI on your laptop.

Start using pgai to simplify your AI pipelines and use AI within the database—no need to resort to other libraries. To get started, head over to the pgai GitHub repo and check out our tutorial to learn more about Ollama and pgai so you can start building more private, lower-cost AI applications today. 

Create a free Timescale account to access pgvector, pgai, and pgvectorscale in one place—with the full benefits of a mature cloud PostgreSQL platform. 

This post was written by
11 min read
AI
Contributors

Related posts