Scaling Document Data Extraction With LLMs & Vector Databases
Extracting structured data from unstructured documents is a powerful use case for large language models (LLMs). This sort of data extraction from complex documents has always remained a challenge. Done either completely manually or using current intelligent document processing (IDP) platforms that utilize previous-generation machine learning or natural language processing (NLP) techniques is very time-consuming and tedious.
Data extraction is critical because it transforms unstructured or semi-structured data into a format that can be systematically queried, analyzed, and used to drive decisions. When done using LLMs, it can help better automate critical business processes with relatively little effort.
An interesting background story is that vector databases play an important role in the extraction of structured data from unstructured documents, especially for lengthier documents, whose contents won’t fit into the context window of an LLM that’s being used to extract data.
While the use of vector databases for this use case in general is great, there are some serious challenges to contend with. In this article, we’ll look at how vector databases present a significant impact on cost (in a good way) but present their own unique challenges. We’ll also look at how to overcome these challenges. But overall, it’s a great story! For document data extraction use cases, vector databases are indispensable.
Extracting Data With LLMs and Vector Databases: The Tools
Background on the Unstract Platform
One of the key limitations of the current intelligent document processing or IDP platforms is that there are a ton of manual annotations needed to extract data. If the documents are long, if the data to be extracted is spread across pages, and if there are a large number of variants of the document (e.g., bank statements for dozens of different banks), then the amount of manual annotations that are needed becomes ridiculously high, almost debilitatingly so. Large language models are disrupting IDP platforms and making manual annotations—the greatest pain of IDP—completely redundant.
Unstract is an IDP 2.0 platform, powered by LLMs that allows the processing of documents that are way more complex than current IDP 1.0 platforms can handle. It can also handle way more variants of the same document type, simple or otherwise. All this without the need for manual annotations! You can read more about how large language models are ushering in the IDP 2.0 era on our blog.
Unstract is an open-source, no-code platform. While you can use LLMs for a lot of things like doing Q&A or building agents, and there are libraries and frameworks that allow you to do this, Unstract is purpose-built for one use case: extracting structured data from unstructured documents—and it shows.
We will use multiple examples on the Unstract Platform to demonstrate how incredibly useful vector databases can be when dealing with unstructured documents.
The screenshot above shows the Prompt Studio, the purpose-built prompt engineering environment for structured data extraction from unstructured documents. Once prompt engineering is done and an extraction schema is defined, the Prompt Studio project can be launched as an API or as an ETL (extract-load-transform) pipeline that can take unstructured documents of a particular type as input and return clean, structured JSON data.
We’ll use quarterly reports from U.S. public companies, also known as 10-Q reports, to demonstrate the need for and challenges with vector databases and strategies to overcome them.
When you launch an API based on a Prompt Studio project, it can return the count of tokens consumed by the LLM configured, along with the cost of those tokens. We’ll use this ability to see how the use of vector databases affects the cost of extraction.
About Timescale Cloud
With the open-source TimescaleDB at its core, Timescale Cloud is a PostgreSQL-based managed service designed for scale, speed, and savings. Timescale’s vector database offerings can be used for various large language model use cases, like Q&As based on retrieval-augmented generation (RAG) and intelligent document processing, as powered by the Unstract Platform.
These capabilities are enabled by Timescale Cloud’s open-source PostgreSQL stack tailored for AI applications, which includes pgvector as well as additional extensions built by the Timescale team—pgai and pgvectorscale, along with pgai Vectorizer, one of our latest releases that enables you to automate embedding creation within your PostgreSQL database.
In Timescale Cloud, this streamlined AI stack benefits from a fully managed service that will have you worry-free. With a powerful, simple-to-use service, TimescaleDB instances are super easy to manage. The best part is that Timescale is based on PostgreSQL, which means that familiarity and comfort are always there.
It’s super easy to get started with Timescale Cloud with a 30-day, hassle-free, no credit card required free trial.
Extracting Data Without Vector Databases
When extracting structured data from PDFs like 10-Qs, we first need to extract the raw text from them. We use LLMWhisperer to do that. One powerful thing about LLMWhisperer is its Layout Preservation technology, which maintains the layout of the original document in the raw output text. This retains a lot of information in terms of column layouts, tables, etc., that LLMs can use to interpret the document in question, bumping up extraction accuracy quite a bit, especially for documents that use multiple columns or tables to present data (super common).
This Unstract Prompt Studio project titled 10-Q Parser without Vector DB is publicly shared so you can take a look around.
In this project, we’ve enabled GPT-4o and Anthropic’s Claude Sonnet 3.5 as extraction LLMs so we can compare their extraction abilities side-by-side.
Although technically, a vector database is configured for the Prompt Studio project, we’ve set the chunk and overlap sizes to zero. When this is done, the vector database is completely skipped while processing documents in this project. This is what we need for this project since we’re trying to find out what the impact on cost will be when we don’t use a vector database.
We’re trying to extract a bunch of fields and we’re getting pretty good results, as seen in the Combined Output pane screenshot below. You can also see this in the project’s public share.
The only trouble is going to be the cost of the extraction. Since we’ve not enabled the vector DB in this project, each of these prompts is being run against the full context of the extracted raw text of this document, which stands at 30 pages long. That’s a lot of tokens!
Of course, Unstract has features like SinglePass extraction that avoid this by combining all your prompts using a large language model and executing it against the full context only once. But, for educational purposes, we have not enabled it. Similarly, if you happen to do this programmatically, you’d be running each of these prompts against the full context and incurring a huge cost.
Let’s look at what this cost actually is so that when we enable the use of a vector database, we can compare the cost with other approaches we’ll be using in the future. For this, we’ll be deploying this Prompt Studio project as an API. We can then download a Postman Collection to be able to call that API easily from Postman.
Unstract can return various bits of metadata, including tokens consumed and their cost. We’ll make a note of this so we can compare once we enable the vector database in a subsequent project we’ll create.
Once we download the Postman Collection, we can import it into Postman and make our API call. To engineer our prompts, we used a Q1 report. Now, to make an API call for structuring, let’s use a Q2 report, which should have all the same fields and values we tried to extract. Let's see what it cost us.
It’s a pretty penny—$1.43 just for a single file! This can get expensive quickly. Let’s solve this problem.
The Impact of Enabling the Vector Database
To demonstrate the positive impact on the cost that enabling the use of a vector database in the extraction process can have, I cloned the Prompt Studio project we discussed so far and then turned on the vector database.
In this case, we’ll be using TimescaleDB as our vector database, which I’ve configured in my Unstract platform account. The project 10-Q Parser with Simple Retrieval is shared publicly for you to examine.
As you can see in the screenshot above, we’ve set the vector database to Timescale Cloud in the Prompt Studio project’s settings (we’re using their managed cloud offering) and set the chunk size and overlap values. This ensures that any input document is split into chunks and stored in the vector database.
In the previous setup, we saw that when the vector database was not used, the individual prompts were being run against the full extracted raw context. In this project, since we've set up the vector database and set the chunk and overlap sizes as well, when prompts are run, a different approach is used. Please note that the Retrieval Strategy has been set to “simple,” which is the default. The steps that are followed are:
- The document is split into chunks with the specified chunk size and stored in the vector database.
- Each prompt is first sent to the vector database, asking it to return the relevant chunks that might contain data that the prompt is trying to extract.
- The prompt and chunks returned from the vector database are used to extract the information requested by the user. The token count for this should be significantly less than when the full document is sent.
It is important to understand that vector databases are not using LLMs internally. They use text embedding algorithms to retrieve document chunks related to the prompts. Because the text is in natural language and text embedding can’t find you exact results unlike an LLM potentially can, this is always an approximation.
Let’s see how much this approach costs for data extraction from the same document versus the previous approaches’ $1.43. We’ve deployed this Prompt Studio project as an API, downloaded a Postman Collection, and made the API call passing along the Q2 10-Q document we used last time. Here are the results:
It’s a mere 17 cents! That is pretty insane. Also, it makes so much sense. Earlier, we were sending the whole raw context for each field extraction, but now we’re asking the vector database for the relevant chunks and then sending that to the LLM to do the extraction.
The problem with simple retrieval
It is very common for many large prompts to have a lot of extraneous instructions meant for the LLM (e.g., formatting, conditional instructions, etc.) apart from the detail on exactly what needs to be actually extracted. For example, let’s look at the following prompt:
Extract the customer name from the given context. Note that the customer’s middle name should always be ignored. Also, respond with the name formatted as “Last name, First name.” Note that the context might contain the name of the customer’s relationship manager. Please note that their name needs to be ignored. We are interested in the customer’s name only.
In the above prompt, what we’re intending to extract is the name of the customer only. However, the prompt contains various other instructions to the LLM. When such prompts are sent to the vector database, the relevance of the chunks returned becomes very questionable.
With the so-called “Simple” strategy, this can very much happen since the prompts are sent to the vector database unmodified, requesting it to retrieve relevant chunks. In the example projects we saw so far, if you look at each of the prompts, they’re pretty simple. As a result, we were able to retrieve pretty relevant chunks from the vector database and, thus, good results from the large language model.
Sub-question retrieval: A more sophisticated approach
Great! We now know the trouble that the Simple retrieval strategy can cause and how it might affect the quality or relevance of the chunks retrieved from the vector database. In the example prompt provided earlier, we know that the only thing we’re interested in are the chunks or the chunk that contains the name of the customer. However, due to the extraneous other information in the prompt, the chunk retrieval quality can suffer.
Sub-question retrieval is a technique where, rather than sending the prompt unmodified to the vector database, we first send the prompt to an LLM and ask it what different bits of information the user is trying to extract. In other words, we try and break the user’s prompt into different sub-questions that can then be sent to the vector database. Thus, the name. This ensures that we get questions that only have the exact information to be extracted, and the other bits are ignored. For instance, the only question we might get from our long extraction prompt from the LLM doing the sub-question extraction could be:
What is the name of the customer?
This is brilliant because now we can expect the vector database’s retrieval quality to be very high, especially compared to the Simple retrieval strategy for the same prompt. While the current project works well with a Simple retrieval strategy because the prompts used there are pretty straightforward, for the sake of completeness, let’s look at what the cost impact is for a clone of the project with just the retrieval strategy switched to Sub-question.
Cost Impact of the Sub-Question Retrieval Strategy
We’ve now cloned our previous project, which used the Timescale vector database, and made just one change: we switched the retrieval strategy from “simple” to “subquestion.”
Next, we deploy this Prompt Studio project as an API, download the Postman Collection, import it into Postman, and call the API with the same Apple Q2 report PDF file we used in API calls before. Recall that the cost spent on LLMs when no vector database was used was $1.43, and it was $0.17 when Timescale Cloud was enabled with chunking and overlap set to non-zero values with a “Simple” retrieval strategy. Let’s look at the cost now for the sub-question retrieval strategy.
It’s 23 cents or $0.23. How come it’s higher than the 17 cents spent on the “Simple” strategy? The reason is that we call the large language model to fetch the sub-questions from our prompts. That is the delta you’re seeing in addition.
Let’s Explore Sub-Question Retrieval Further
Sub-question retrieval is a simple, yet surprisingly effective strategy that is worth exploring further. Also, we had kept the examples discussed thus far pretty simple in terms of prompts. In practice, though, Unstract users hardly ever use simple prompts. They almost always use more sophisticated prompts that usually output complex JSON data.
Let’s take the same 10-Q example, but let’s try and extract all 10 fields we’re trying to extract, but with a single prompt specifying a compound JSON schema to extract. When we used simple prompts, irrespective of whether we used a simple strategy or a sub-question strategy, the extraction always worked. That was because even with a simple strategy in place, the vector database was able to respond with good quality/relevant chunks.
We’ll now create two “Compact” prompt projects that will combine all 10 extractions we did earlier into just one prompt. With the “Simple” retrieval strategy, the extraction should fail outright since the vector database will no longer be able to respond with relevant chunks due to the increased complexity of our prompt. This is to demonstrate the limitations of the simple retrieval strategy in real-world use cases.
Complex prompt with a simple retrieval strategy
As you can see in the following screenshot or in this publicly shared project Compact 10-Q Parser with Simple Retrieval, we’re using a single compound prompt in place of the 10 prompts we used in the earlier projects.
Now, let’s look at the results we’re getting from the two configured LLMs for this large prompt.
When we now look at the LLM responses, compared to last time, we have all the values available since the quality of the retrieval has gone up due to the usage of the sub-question retrieval strategy. We do see that the two configured LLMs aren’t agreeing on what one particular value needs to be, but that’s another problem altogether. It is possible to inspect the chunks that were retrieved from the vector database as well. This can be incredibly useful to debug any issues.
Conclusion
Large language models are changing intelligent document processing and ushering in the IDP 2.0 era. Vector databases like Timescale Cloud can play an important role in reducing the cost of structured data extraction, especially for extraction from long, unstructured documents. Sign up for Unstract and start for free.
We saw practical examples of how Timescale Cloud can significantly reduce the number of LLM tokens consumed, leading to non-trivial cost savings. We also saw how the complexity of prompts can affect a vector database’s retrieval quality and how to solve that problem using more sophisticated retrieval techniques.
Build your AI application with Timescale Cloud today and start saving on structured data extraction. With Timescale Cloud, developers can access pgvector, pgvectorscale, pgai, and pgai Vectorizer—extensions that turn PostgreSQL into an easy-to-use and high-performance vector database, plus a fully managed cloud database experience.