Parsing All the Data With Open-Source Tools: Unstructured and Pgai

Data isn't always available in clean, structured formats, which is why you need to parse it. Data parsing is the process of converting your data from one format to another to make it more structured. As a developer, I'll never forget the incredible amount of data parsers I built throughout my career. Pulling data from diverse sources like PDFs, scraping web pages, and adapting the latest HTML version—in sum, parsing data—is always a recurring task in development.

Today, I’d like to share an example of how you can use unstructured.io’s open-source libraries for data parsing. This tool promises to structure unstructured documents, powering your document analysis and allowing you to use PostgreSQL to store all the knowledge spread across different types of company files. As part of today’s example, we’ll also save the raw text and the vectorized content for future semantic searches.

The Tools: Data Parsing With Unstructured and Pgai

For this data parsing example, our final objective is to have a functional command line utility that you can ./import.sh my/docs/*.{pdf,doc,html} to-my-postgres-db. Let’s break down the tooling that will make it possible:

1. Unstructured: This powerful open-source library excels at extracting and structuring information from various document types. It's your Swiss Army knife to ingest unstructured data into analyzable content.

2. pgai: An open-source extension for PostgreSQL that brings AI capabilities directly into your database. It allows you to perform complex AI operations (including text embedding), and with pgai Vectorizer, you can automate the process of creating and managing the embeddings without leaving your PostgreSQL database environment.

Unstructured will help you ingest data from diverse sources (PDFs, emails, web pages, and more). They offer some of their parsers as a service, but we’ll focus on running everything locally.

PostgreSQL will be used to store and manage the unstructured data in a structured format.

We’ll also use pgai Vectorizer to generate embeddings of the content, calling the OpenAI API and storing the vectorized version of the data. This PostgreSQL extension will allow us to build powerful semantic search capabilities through all types of documents.

By the end of this tutorial, you'll have a process pipeline capable of parsing all your documents into searchable information directly from your favorite database: PostgreSQL. If you’re handling internal knowledge bases, customer communications, or vast document repositories, this setup will enable you to extract valuable insights with ease and speed. Let’s get started.

Structuring the Unstructured

We’ll use Unstructured to ingest the document and output it to a PostgreSQL database. This instance will have the pgai extension installed and configured to automatically create and manage vectors for the extracted content using one SQL query.

Here is a simple workflow diagram:

Data Parsing Project: Setting Up the Environment

Our example repository has a docker-compose if you want to set up everything locally. If you run into any questions about this local setup, check the pgai pre-built Docker container and the guide to run the pgai Vectorizer worker.

For production use, we recommend using Timescale Cloud. If you haven't used our PostgreSQL cloud platform yet, you can sign up for a free trial and enable the pgai extension to test this tutorial.

Main: a PostgreSQL instance with pgai capabilities
Unstructured: the Unstructured service for document processing

Make sure to set the following key environment variables:

DB_HOST, DB_PORT, DB_NAME, DB_USER, DB_PASSWORD: PostgreSQL connection details
OPENAI_API_KEY: your OpenAI API key for text processing tasks
UNSTRUCTURED_API_KEY: the API key for the Unstructured service (if using the hosted version)

You can adapt the OpenAI model or the embedding model to your needs directly in the schema.sql file. Adjust other settings as needed for your specific use case.

After setting up the environment variables, run install.sh to install the necessary dependencies.

Run import.sh to import the data into PostgreSQL.

Repository Overview

The repository includes a command line utility mapping a few defaults that can be useful to get started with any type of document. You can check it on GitHub.

The key files are:

README.md: provides an overview and setup instructions
schema.sql: defines the database schema
install.sh: installs necessary dependencies
import.sh: handles document import and processing
.envrc.example: default environment variables needed

To get started, run the install.sh script:

git clone https://github.com/timescale/unstructured-pgai-example
cd unstructured-pgai-example
./install.sh

This script installs the required Python packages and sets up the database schema.

The Database Schema

Unstructured is able to converge multiple types of files into a list of elements. Some docs are composed of multiple elements while others are just a single element.

The heart of our setup lies in the schema.sql file. Let's break it down:

CREATE EXTENSION IF NOT EXISTS ai CASCADE;
-- Create the table for the elements (source table).
CREATE TABLE elements (
    id BIGINT PRIMARY KEY GENERATED BY DEFAULT AS IDENTITY,
    element_id VARCHAR,
    text TEXT,
    -- ... (other fields)
);
-- Create the vectorizer for the source table.
SELECT ai.create_vectorizer(
    'public.elements'::regclass
  , embedding=>ai.embedding_openai('text-embedding-3-small', 1536)
  , chunking=>ai.chunking_recursive_character_text_splitter('text')
  , formatting=>ai.formatting_python_template('type: $type, url: $url, chunk: $chunk')
);

Key points here:

We enable the ai extension for pgai functionality. It also introduces descendant extensions like pgvector and pgvectorscale (a complement to pgvector for high-performance, cost-efficient vector search on large workloads).
The elements table stores extracted document information, including fields like url, filename, and extra properties that depend on the type of data sourced. Some files are represented by a single element, and others extract multiple elements.
We use the function ai.create_vectorizer to create a pgai Vectorizer on the elements table. Using this vectorizer, we:
- Split the text into chunks and enrich each chunk with the text’s type and url to enhance the richness of the generated vector embeddings.
- Automatically generate embeddings for the text using OpenAI’s text-embedding-3-small model.
- Store the generated embeddings in a dedicated table, elements_embedding_storecombines.
- Create a view, elements_embedding, which combines fields from elements and elements_embedding.

This vectorizer also automatically updates embeddings if there are any changes in the text stored in the source data, elements. Check the pgai Vectorizer API reference to learn more!

Importing and Processing Documents

The import.sh script handles all variables and configurations to make the unstructured-ingest command line easily accessible, being the first argument.

unstructured-ingest \
  local \
    --input-path "$*" \
    --output-dir local-output-to-SQL \
    --num-processes 2 \
    --verbose \
    --work-dir data \
    --metadata-include "$metadata_includes" \
    --additional-partition-args="{\"split_pdf_page\":\"true\", \"split_pdf_allow_failed\":\"true\", \"split_pdf_concurrency_level\": 15}" \
  postgres \
    --db-type $SQL_DB_TYPE \
    --username $PGUSER \
    --password $PGPASSWORD \
    --host $PGHOST \
    --port $PGPORT \
    --database $PGDATABASE

This script achieves the following:

1. It takes input documents from the command line.

2. It processes them using the Unstructured command line application.

3. It inserts the extracted data into our PostgreSQL instance.

Note that a few arguments of unstructured-ingest are worth mentioning:

1. --work-dir is responsible for caching pre-processed files in case you run ingest multiple times over the same file.

2. --output-dir local-output-to-SQL points it to use the database connection parameters.

3. --additional-partition-args can help you to also go deeper into different parser options and optimize for your use case.

So, if you use the import script available, you can import any files from the command line. So, let’s import one example from the unstructured examples unstructured/example-docs folder: /import.sh ../unstructured/example-docs/pdf/embedded-images.pdf.

This is a preview of the PDF we imported into the database. It’s not a simple webpage but a complex layout and scenario. You can check the PDF here.

Now, let’s explore and query the data around the PDF information directly in the database.

Querying the Parsed Data

Let’s start summarizing the elements to understand what content type is available in each file. We’ll start querying the database filtering by the latest imported file:

SELECT  type, count(*) 
FROM elements 
WHERE filename = 'embedded-images.pdf' 
GROUP BY 1;

       type        | count
-------------------+-------
 UncategorizedText |     3
 Footer            |     1
 Header            |     2
 NarrativeText     |     5
(4 rows)

Now, let’s filter through the content using the vectors. The <=> operator checks the closest vector filtering by similarity using the cosine distance between the two vectors:

SELECT text
FROM elements_embedding
WHERE filename ~ 'embedded-images.pdf'
AND embedding <=> ai.openai_embed('text-embedding-3-small', 'Hands Free') < 0.7;

"HANDS-FREE" CARD: use (2/4)
 Hands-free unlocking, when approaching the vehicle; With the card in access zone 3, the ve- hicle will unlock. Unlocking is indicated by one flash of the hazard warning lights and the indicator lights.
 Hands-free locking when moving away from the vehicle With the card on you, doors and lug- gage compartment door closed, move away from the vehicle: it will lock auto- matically once you have left the access zone 3. Note: the distance at which the vehicle locks depends on the surroundings.
(3 rows)

This query uses vector similarity search to find the most relevant document elements based on your input query. The output seems weird in a few cases, like ve- hicle, for example, because it recognizes line breaks in three-column layout content.

At the beginning of my career, I would have never imagined I'd have such tools to parse all types of files at once and with such ease. Now, it’s becoming a reality. Combining amazing open-source tools allows you to easily jump from that old drive full of archives and unreadable files into structured data accessible via SQL.

Start Parsing Your Unstructured Data

Now, it’s time to try it on your own files. You can clone the repository and test it with your data. I also created a Python example that fetches all the Timescale docs pages and uses the HTML parser behind the scenes, making it easy to add custom logic during the transition from unstructured to structured data.

Exploring the resources and tools mentioned in this blog post will help you supercharge your document analysis and text embedding at scale. If you’re new to AI but are familiar with PostgreSQL, pgai can help you create new opportunities and ways to utilize data by bringing AI workflows into the database.

In this tutorial, pgai Vectorizer played a crucial role by automating the generation and management of vector embeddings directly within our PostgreSQL database. With a single SQL command, we configurated a vectorizer that can do the following:

Processes unstructured text from various PDF documents and generates embeddings using an OpenAI model, enabling powerful semantic search capabilities
Continuously monitors changes in the unstructured text and automatically updates embeddings to keep them in sync with the content

With pgai and pgai Vectorizer, complex AI operations are seamlessly executed within your PostgreSQL database, simplifying workflows and enabling scalable document analysis and search functionalities–all while maintaining the development streamlined. This open-source PostgreSQL AI stack showcases that PostgreSQL alone can power AI applications, making it accessible to AI engineers of any skill level, from novice to expert.

You can find installation instructions on the pgai and pgai Vectorizer GitHub repository (GitHub ⭐s welcome!). You can also access pgai and pgai Vectorizer on any database service on Timescale’s cloud PostgreSQL platform.

Join the Postgres for AI Community

Your journey doesn't end here! Join our community of developers building AI applications with PostgreSQL:

Join the pgai Discord server: Share your projects, get help, and connect with peers.
Spread the word: Share our announcements on X/Twitter or LinkedIn (we'll RT you back!).
Contribute: Submit issues, feature requests, or code contributions on GitHub.
Cloud providers: Interested in offering pgai on your PostgreSQL cloud? Let's talk!