Cost-Effective PDF Parsing and Chunking with LLMs

2025-02-06

ℹ️Note on the source

This blog post was automatically generated (and translated). It is based on the following original, which I selected for publication on this blog:
Sergey’s Blog.

Cost-Effective PDF Parsing and Chunking with LLMs

Converting PDFs into machine-readable text chunks for Retrieval-Augmented Generation (RAG) systems presents significant challenges. While both open-source and proprietary solutions exist, achieving an optimal balance of accuracy, scalability, and cost-effectiveness remains elusive. Current solutions often struggle with complex layouts and can be prohibitively expensive, especially when dealing with large datasets.

The Promise of Large Language Models

Large Language Models (LLMs) appear to be a natural fit for this task. However, until recently, their cost-effectiveness compared to proprietary solutions has been questionable, and inconsistencies have posed challenges for real-world applications. Recent advancements, particularly with models like Gemini Flash 2.0, are changing this landscape. These models demonstrate near-perfect OCR accuracy at a significantly lower cost, making them viable for large-scale document processing.

PDF to Markdown, Pages per Dollar

≈ 12,000

_All LLM providers are quoted with their batch pricing.

Accuracy Considerations

Table identification and extraction represent a particularly difficult aspect of document parsing due to complex layouts and inconsistent data quality. Evaluating LLMs on real-world challenges, such as poor scans and intricate table structures, reveals nuanced performance. While specialized models may currently outperform general-purpose LLMs on specific benchmarks, a closer examination often reveals that discrepancies are primarily minor structural variations that do not significantly impact an LLM's understanding of the table's content. Crucially, numerical values are rarely misread, suggesting that most errors are superficial formatting choices rather than substantive inaccuracies.

Beyond table parsing, models like Gemini Flash 2.0 consistently deliver near-perfect accuracy across other facets of PDF-to-markdown conversion, resulting in a simple, scalable, and cost-effective indexing pipeline.

Chunking for RAG Pipelines

Effective RAG pipelines require splitting documents into smaller, semantically meaningful chunks. Using LLMs for this task has been shown to improve retrieval accuracy, as LLMs excel at understanding context and identifying natural boundaries in text. While historically cost-prohibitive, the affordability of models like Gemini Flash 2.0 now makes LLM-based chunking feasible at scale.

CHUNKING_PROMPT """\
OCR the following page into Markdown. Tables should be formatted as HTML. 
Do not sorround your output with triple backticks.
Chunk the document into sections of roughly 250 - 1000 words. Our goal is 
to identify parts of the page with same semantic theme. These chunks will 
be embedded and used in a RAG pipeline. 
Surround the chunks with <chunk> </chunk> html tags.
"""

The Challenge of Lost Bounding Boxes

Markdown extraction and chunking can lead to a critical limitation: the loss of bounding box information. This makes it difficult to link extracted information back to its exact location in the source PDF, which can create a trust gap. Bounding boxes are essential for verifying the accuracy of extracted data.

While LLMs have demonstrated spatial understanding, accurately mapping text to its location within a document remains a challenge. Further training and fine-tuning with a focus on document layouts could potentially bridge this gap.

GET_NODE_BOUNDING_BOXES_PROMPT """\
Please provide me strict bounding boxes that encompasses the following text in the attached image? I'm trying to draw a rectangle around the text.
- Use the top-left coordinate system
- Values should be percentages of the image width and height (0 to 1)
{nodes}
"""

Towards Effortless Document Ingestion

Addressing the challenges of parsing, chunking, and bounding box detection represents a significant step towards "solving" document ingestion into LLMs. This progress brings us closer to a future where document parsing is efficient and practically effortless for any use case. What innovative applications will become possible when document ingestion is no longer a bottleneck?

Comments are closed.