Parsing All the Data With Open-Source Tools: Unstructured and Pgai

Parsing All the Data With Open-Source Tools: Unstructured and Pgai

Data isn't always available in clean, structured formats, which is why you need to parse it. Data parsing is the process of converting your data from one format to another to make it more structured. As a developer, I'll never forget the incredible amount of data parsers I built throughout my career. Pulling data from diverse sources like PDFs, scraping web pages, and adapting the latest HTML version—in sum, parsing data—is always a recurring task in development.

Today, I’d like to share an example of how you can use unstructured.io’s open-source libraries for data parsing. This tool promises to structure unstructured documents, powering your document analysis and allowing you to use PostgreSQL to store all the knowledge spread across different types of company files. As part of today’s example, we’ll also save the raw text and the vectorized content for future semantic searches.

The Tools: Data Parsing With Unstructured and Pgai

For this data parsing example, our final objective is to have a functional command line utility that you can ./import.sh my/docs/*.{pdf,doc,html} to-my-postgres-db. Let’s break down the tooling that will make it possible:

1. Unstructured: This powerful open-source library excels at extracting and structuring information from various document types. It's your Swiss Army knife to ingest unstructured data into analyzable content.

2. pgai: An open-source extension for PostgreSQL that brings AI capabilities directly into your database. It allows you to perform complex AI operations, including text embedding, without leaving your database environment.

Unstructured will help you ingest data from diverse sources (PDFs, emails, web pages, and more). They offer some of their parsers as a service, but we’ll focus on running everything locally.

PostgreSQL will be used to store and manage the unstructured data in a structured format.

We’ll also use pgai to generate embeddings of the content, calling the OpenAI API and storing the vectorized version of the data. This PostgreSQL extension will allow us to build powerful semantic search capabilities through all types of documents.

By the end of this tutorial, you'll have a process pipeline capable of parsing all your documents into searchable information directly from your favorite database: PostgreSQL. If you’re handling internal knowledge bases, customer communications, or vast document repositories, this setup will enable you to extract valuable insights with ease and speed. Let’s get started.

Structuring the Unstructured

We’ll use Unstructured to ingest the document and output it to a PostgreSQL database. This instance will have the pgai extension installed and configured to automatically create vectors for the extracted content.

Here is a simple workflow diagram:

A diagram representing this tutorial's workflow

Data Parsing Project: Setting Up the Environment

Our example repository comes with a docker-compose in case you want to set up everything locally. If you have any doubts about this local setup, check the pgai pre-built Docker container

For production use, we recommend using Timescale Cloud. If you haven't used our PostgreSQL cloud platform yet, you can sign up for a free trial and enable the pgai extension to test this tutorial.

  1. Main: a PostgreSQL instance with pgai capabilities
  2. Unstructured: the Unstructured service for document processing

Make sure to set the following key environment variables:

  • DB_HOST, DB_PORT, DB_NAME, DB_USER, DB_PASSWORD: PostgreSQL connection details
  • OPENAI_API_KEY: your OpenAI API key for text processing tasks
  • UNSTRUCTURED_API_KEY: the API key for the Unstructured service (if using the hosted version)

You can adapt the OpenAI model or the embedding model to your needs directly in the schema.sql file. Adjust other settings as needed for your specific use case.

After setting up the environment variables, run install.sh to install the necessary dependencies.

Run import.sh to import the data into PostgreSQL.

Repository Overview

The repository includes a command line utility mapping a few defaults that can be useful to get started with any type of document. You can check it on GitHub.

The key files are:

  • README.md: provides an overview and setup instructions
  • schema.sql: defines the database schema
  • install.sh: installs necessary dependencies
  • import.sh: handles document import and processing
  • .envrc.example: default environment variables needed

To get started, run the install.sh script:

git clone https://github.com/timescale/unstructured-pgai-example
cd unstructured-pgai-example
./install.sh

This script installs the required Python packages and sets up the database schema.

The Database Schema

Unstructured is able to converge multiple types of files into a list of elements. Some docs are composed of multiple elements while others are just a single element.

The heart of our setup lies in the schema.sql file. Let's break it down:

CREATE EXTENSION IF NOT EXISTS ai CASCADE;
CREATE OR REPLACE FUNCTION embedding(content TEXT) RETURNS VECTOR AS $$
DECLARE
    vectorized_content VECTOR;
BEGIN
    vectorized_content := openai_embed(
        'text-embedding-ada-002',
        content
    )::VECTOR;
    RETURN vectorized_content;
END;
$$ IMMUTABLE LANGUAGE plpgsql;
CREATE TABLE elements (
    id UUID PRIMARY KEY,
    element_id VARCHAR,
    text TEXT,
    embeddings vector GENERATED ALWAYS AS (embedding(text)) STORED,
    -- ... (other fields)
);

The PostgreSQL-generated ALWAYS columns look magical, and combined with pgai, allow integrating vector embeddings with your data without touching the application code.

Key points here:

  • The generated column is created and stored with the vector data at write time.
  • We enable the ai extension for pgai functionality. It also introduces descendant extensions like pgvector and pgvectorscale (a complement to pgvector for high-performance, cost-efficient vector search on large workloads).
  • The embedding function uses OpenAI's text embedding model to vectorize text.
  • The elements table stores extracted document information, including automatically generated text embeddings. The other fields like url, filename and extra properties depend on the type of data sourced. Some files are represented by a single element, and others extract multiple elements.

Importing and Processing Documents

The import.sh script handles all variables and configurations to make the unstructured-ingest command line easily accessible, being the first argument.

unstructured-ingest \
  local \
    --input-path "$*" \
    --output-dir local-output-to-SQL \
    --num-processes 2 \
    --verbose \
    --work-dir data \
    --metadata-include "$metadata_includes" \
    --additional-partition-args="{\"split_pdf_page\":\"true\", \"split_pdf_allow_failed\":\"true\", \"split_pdf_concurrency_level\": 15}" \
  sql \
    --db-type $SQL_DB_TYPE \
    --username $PGUSER \
    --password $PGPASSWORD \
    --host $PGHOST \
    --port $PGPORT \
    --database $PGDATABASE

This script achieves the following:

1. It takes input documents from the command line.

2. It processes them using the Unstructured command line application.

3. It inserts the extracted data into our PostgreSQL instance.

Note that a few arguments of unstructured-ingest are worth mentioning:

1. --work-dir is responsible for caching pre-processed files in case you run ingest multiple times over the same file.

2. --output-dir local-output-to-SQL points it to use the database connection parameters.

3. --additional-partition-args can help you to also go deeper into different parser options and optimize for your use case.

So, if you use the import script available, you can import any files from the command line. So, let’s import one example from the unstructured examples unstructured/example-docs folder:/import.sh ../unstructured/example-docs/embedded-images.pdf.

This is a preview of the PDF we imported into the database. It’s not a simple webpage but a complex layout and scenario. You can check the PDF here.

The PDF we imported into the database. It contains information about hands-free unlocking in a car

Now, let’s explore and query the data around the PDF information directly in the database.

Querying the Parsed Data

Let’s start summarizing the elements to understand what content type is available in each file. We’ll start querying the database filtering by the latest imported file:

select  type, count(*) from elements where filename = 'embedded-images.pdf' group by 1;
       type        | count
-------------------+-------
 UncategorizedText |     3
 Footer            |     1
 Header            |     2
 NarrativeText     |     5
(4 rows)

Now, let’s use the function embedding to filter using the vectors. The <=> operator checks the closest vector filtering by similarity using the cosine distance between the two vectors:

 SELECT text
FROM elements
WHERE filename ~ 'embedded-images.pdf'
AND embeddings <=> embedding('Hands Free') < 0.2;
                                                                                                                                               text
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 “HANDS-FREE” CARD: use (2/4)
 Hands-free unlocking, when approaching the vehicle; With the card in access zone 3, the ve- hicle will unlock. Unlocking is indicated by one flash of the hazard warning lights and the indicator lights.
 Hands-free locking when moving away from the vehicle With the card on you, doors and lug- gage compartment door closed, move away from the vehicle: it will lock auto- matically once you have left the access zone 3. Note: the distance at which the vehicle locks depends on the surroundings.
(3 rows)

This query uses vector similarity search to find the most relevant document elements based on your input query. The output seems weird in a few cases, like ve- hicle, for example, because it recognizes line breaks in three-column layout content. 

At the beginning of my career, I would have never imagined I'd have such tools to parse all types of files at once and with such ease. Now, it’s becoming a reality. Combining amazing open-source tools allows you to easily jump from that old drive full of archives and unreadable files into structured data accessible via SQL. 

Start Parsing Your Unstructured Data

Now, it’s time to try it on your own files. You can clone the repository and test it with your data. I also created a Python example that fetches all the Timescale docs pages and uses the HTML parser behind the scenes, making it easy to add custom logic during the transition from unstructured to structured data.

Exploring the resources and tools mentioned in this blog post will help you supercharge your document analysis and text embedding at scale. If you’re new to AI but are familiar with PostgreSQL, pgai can help you create new opportunities and ways to utilize data by bringing AI workflows into the database. Pgai is open source under the PostgreSQL License, making it easier for PostgreSQL developers to become AI engineers and build search and RAG applications. 

You can find installation instructions on the pgai GitHub repository (GitHub ⭐s welcome!). You can also access pgai on any database service on Timescale’s cloud PostgreSQL platform.

Join the Postgres for AI Community

Your journey doesn't end here! Join our community of developers building AI applications with PostgreSQL:

This post was written by

Originally posted

Last updated

8 min read
AI
Contributors

Related posts