Category: All posts
Nov 15, 2024
Data isn't always available in clean, structured formats, which is why you need to parse it. Data parsing is the process of converting your data from one format to another to make it more structured. As a developer, I'll never forget the incredible amount of data parsers I built throughout my career. Pulling data from diverse sources like PDFs, scraping web pages, and adapting the latest HTML version—in sum, parsing data—is always a recurring task in development.
Today, I’d like to share an example of how you can use unstructured.io’s open-source libraries for data parsing. This tool promises to structure unstructured documents, powering your document analysis and allowing you to use PostgreSQL to store all the knowledge spread across different types of company files. As part of today’s example, we’ll also save the raw text and the vectorized content for future semantic searches.
For this data parsing example, our final objective is to have a functional command line utility that you can ./import.sh my/docs/*.{pdf,doc,html} to-my-postgres-db
. Let’s break down the tooling that will make it possible:
1. Unstructured: This powerful open-source library excels at extracting and structuring information from various document types. It's your Swiss Army knife to ingest unstructured data into analyzable content.
2. pgai
: An open-source extension for PostgreSQL that brings AI capabilities directly into your database. It allows you to perform complex AI operations (including text embedding), and with pgai Vectorizer, you can automate the process of creating and managing the embeddings without leaving your PostgreSQL database environment.
Unstructured will help you ingest data from diverse sources (PDFs, emails, web pages, and more). They offer some of their parsers as a service, but we’ll focus on running everything locally.
PostgreSQL will be used to store and manage the unstructured data in a structured format.
We’ll also use pgai Vectorizer to generate embeddings of the content, calling the OpenAI API and storing the vectorized version of the data. This PostgreSQL extension will allow us to build powerful semantic search capabilities through all types of documents.
By the end of this tutorial, you'll have a process pipeline capable of parsing all your documents into searchable information directly from your favorite database: PostgreSQL. If you’re handling internal knowledge bases, customer communications, or vast document repositories, this setup will enable you to extract valuable insights with ease and speed. Let’s get started.
We’ll use Unstructured to ingest the document and output it to a PostgreSQL database. This instance will have the pgai
extension installed and configured to automatically create and manage vectors for the extracted content using one SQL query.
Here is a simple workflow diagram:
Our example repository has a docker-compose
if you want to set up everything locally. If you run into any questions about this local setup, check the pgai pre-built Docker container and the guide to run the pgai Vectorizer worker.
For production use, we recommend using Timescale Cloud. If you haven't used our PostgreSQL cloud platform yet, you can sign up for a free trial and enable the pgai extension to test this tutorial.
Make sure to set the following key environment variables:
DB_HOST
, DB_PORT
, DB_NAME
, DB_USER
, DB_PASSWORD
: PostgreSQL connection detailsOPENAI_API_KEY
: your OpenAI API key for text processing tasksUNSTRUCTURED_API_KEY
: the API key for the Unstructured service (if using the hosted version)You can adapt the OpenAI model or the embedding model to your needs directly in the schema.sql
file. Adjust other settings as needed for your specific use case.
After setting up the environment variables, run install.sh
to install the necessary dependencies.
Run import.sh
to import the data into PostgreSQL.
The repository includes a command line utility mapping a few defaults that can be useful to get started with any type of document. You can check it on GitHub.
The key files are:
README.md
: provides an overview and setup instructionsschema.sql
: defines the database schemainstall.sh
: installs necessary dependenciesimport.sh
: handles document import and processing.envrc.example
: default environment variables neededTo get started, run the install.sh
script:
git clone https://github.com/timescale/unstructured-pgai-example
cd unstructured-pgai-example
./install.sh
This script installs the required Python packages and sets up the database schema.
Unstructured is able to converge multiple types of files into a list of elements. Some docs are composed of multiple elements while others are just a single element.
The heart of our setup lies in the schema.sql
file. Let's break it down:
CREATE EXTENSION IF NOT EXISTS ai CASCADE;
-- Create the table for the elements (source table).
CREATE TABLE elements (
id BIGINT PRIMARY KEY GENERATED BY DEFAULT AS IDENTITY,
element_id VARCHAR,
text TEXT,
-- ... (other fields)
);
-- Create the vectorizer for the source table.
SELECT ai.create_vectorizer(
'public.elements'::regclass
, embedding=>ai.embedding_openai('text-embedding-3-small', 1536)
, chunking=>ai.chunking_recursive_character_text_splitter('text')
, formatting=>ai.formatting_python_template('type: $type, url: $url, chunk: $chunk')
);
Key points here:
ai
extension for pgai functionality. It also introduces descendant extensions like pgvector and pgvectorscale (a complement to pgvector for high-performance, cost-efficient vector search on large workloads).elements
table stores extracted document information, including fields like url
, filename
, and extra properties that depend on the type of data sourced. Some files are represented by a single element, and others extract multiple elements.ai.create_vectorizer
to create a pgai Vectorizer on the elements
table. Using this vectorizer, we:type
and url
to enhance the richness of the generated vector embeddings.text-embedding-3-small
model.elements_embedding_store
combines.elements_embedding
, which combines fields from elements
and elements_embedding
.This vectorizer also automatically updates embeddings if there are any changes in the text stored in the source data, elements
. Check the pgai Vectorizer API reference to learn more!
The import.sh
script handles all variables and configurations to make the unstructured-ingest command line easily accessible, being the first argument.
unstructured-ingest \
local \
--input-path "$*" \
--output-dir local-output-to-SQL \
--num-processes 2 \
--verbose \
--work-dir data \
--metadata-include "$metadata_includes" \
--additional-partition-args="{\"split_pdf_page\":\"true\", \"split_pdf_allow_failed\":\"true\", \"split_pdf_concurrency_level\": 15}" \
postgres \
--db-type $SQL_DB_TYPE \
--username $PGUSER \
--password $PGPASSWORD \
--host $PGHOST \
--port $PGPORT \
--database $PGDATABASE
This script achieves the following:
1. It takes input documents from the command line.
2. It processes them using the Unstructured command line application.
3. It inserts the extracted data into our PostgreSQL instance.
Note that a few arguments of unstructured-ingest
are worth mentioning:
1. --work-dir
is responsible for caching pre-processed files in case you run ingest multiple times over the same file.
2. --output-dir local-output-to-SQL
points it to use the database connection parameters.
3. --additional-partition-args
can help you to also go deeper into different parser options and optimize for your use case.
So, if you use the import script available, you can import any files from the command line. So, let’s import one example from the unstructured examples unstructured/example-docs folder: /import.sh ../unstructured/example-docs/pdf/embedded-images.pdf
.
This is a preview of the PDF we imported into the database. It’s not a simple webpage but a complex layout and scenario. You can check the PDF here.
Now, let’s explore and query the data around the PDF information directly in the database.
Let’s start summarizing the elements
to understand what content type is available in each file. We’ll start querying the database filtering by the latest imported file:
SELECT type, count(*)
FROM elements
WHERE filename = 'embedded-images.pdf'
GROUP BY 1;
type | count
-------------------+-------
UncategorizedText | 3
Footer | 1
Header | 2
NarrativeText | 5
(4 rows)
Now, let’s filter through the content using the vectors. The <=>
operator checks the closest vector filtering by similarity using the cosine distance between the two vectors:
SELECT text
FROM elements_embedding
WHERE filename ~ 'embedded-images.pdf'
AND embedding <=> ai.openai_embed('text-embedding-3-small', 'Hands Free') < 0.7;
"HANDS-FREE" CARD: use (2/4)
Hands-free unlocking, when approaching the vehicle; With the card in access zone 3, the ve- hicle will unlock. Unlocking is indicated by one flash of the hazard warning lights and the indicator lights.
Hands-free locking when moving away from the vehicle With the card on you, doors and lug- gage compartment door closed, move away from the vehicle: it will lock auto- matically once you have left the access zone 3. Note: the distance at which the vehicle locks depends on the surroundings.
(3 rows)
This query uses vector similarity search to find the most relevant document elements based on your input query. The output seems weird in a few cases, like ve- hicle
, for example, because it recognizes line breaks in three-column layout content.
At the beginning of my career, I would have never imagined I'd have such tools to parse all types of files at once and with such ease. Now, it’s becoming a reality. Combining amazing open-source tools allows you to easily jump from that old drive full of archives and unreadable files into structured data accessible via SQL.
Now, it’s time to try it on your own files. You can clone the repository and test it with your data. I also created a Python example that fetches all the Timescale docs pages and uses the HTML parser behind the scenes, making it easy to add custom logic during the transition from unstructured to structured data.
Exploring the resources and tools mentioned in this blog post will help you supercharge your document analysis and text embedding at scale. If you’re new to AI but are familiar with PostgreSQL, pgai can help you create new opportunities and ways to utilize data by bringing AI workflows into the database.
In this tutorial, pgai Vectorizer played a crucial role by automating the generation and management of vector embeddings directly within our PostgreSQL database. With a single SQL command, we configurated a vectorizer that can do the following:
With pgai and pgai Vectorizer, complex AI operations are seamlessly executed within your PostgreSQL database, simplifying workflows and enabling scalable document analysis and search functionalities–all while maintaining the development streamlined. This open-source PostgreSQL AI stack showcases that PostgreSQL alone can power AI applications, making it accessible to AI engineers of any skill level, from novice to expert.
You can find installation instructions on the pgai and pgai Vectorizer GitHub repository (GitHub ⭐s welcome!). You can also access pgai and pgai Vectorizer on any database service on Timescale’s cloud PostgreSQL platform.
Your journey doesn't end here! Join our community of developers building AI applications with PostgreSQL:
In case you missed the pgai Vectorizer launch, learn why vector databases are the wrong abstraction and how you can create AI embeddings right within your PostgreSQL database.