Tree Reasoning: A New Approach to PDF Search

Traditional RAG treats PDFs as bags of text. Chunk everything, embed it, hope vector similarity finds what you need.

This fails for complex documents. A research paper’s “Results” section might be semantically similar to its “Methods” section—both discuss experiments. A legal contract’s “Indemnification” clause might share vocabulary with “Limitation of Liability.” Vector search can’t tell the difference without structural awareness.

We built something different.

The Problem with Chunking

Standard PDF processing:

Extract text
Split into fixed-size chunks (512-1024 tokens)
Generate embeddings
Store in vector database
Search by cosine similarity

This destroys document structure. A chunk doesn’t know if it’s from the introduction or the conclusion. When you ask “What are the main findings?”, you might get text from the methodology section because it uses similar words.

The user intent matters. “Results” means something specific. So does “page 47” or “the section about transformers.”

Structure-Aware Indexing

PDFs have structure. Table of contents. Sections. Subsections. Page numbers.

We extract this hierarchy during indexing:

Document
├── Abstract (pages 1-2)
├── Introduction (pages 2-5)
├── Methods (pages 5-12)
│   ├── Data Collection (pages 5-8)
│   └── Model Architecture (pages 8-12)
├── Results (pages 12-18)
│   ├── Quantitative Analysis (pages 12-15)
│   └── Qualitative Analysis (pages 15-18)
└── Conclusion (pages 18-20)

Every chunk gets tagged with its location in this tree. Not just the page number—the full path. A chunk knows it belongs to Results > Quantitative Analysis on page 14.

This metadata becomes filterable. You can query “all chunks from the Results section” or “pages 12-18” without relying on semantic similarity.

The Page Number Problem

Here’s something nobody talks about: PDF page numbers lie.

A document’s table of contents might say “Chapter 2 starts on page 10.” But in the actual PDF file, that’s physical page 15—because the cover, title page, and table of contents itself aren’t numbered in the TOC.

We solve this with offset detection. Sample entries from the TOC, find where they actually appear in the document, calculate the offset. Now tree node assignments point to real pages.

Tree-Guided Search

The real innovation is in retrieval.

When a query comes in, we don’t immediately search vectors. First, an LLM analyzes the document structure:

Query: "What accuracy did the model achieve?"

LLM reasoning: "The user is asking about model performance metrics.
These would be in the Results section, likely under Quantitative Analysis.
Relevant nodes: 0004 (Results), 0005 (Quantitative Analysis)"

Only then do we search—but filtered to those specific sections. The vector search happens within a constrained scope, not across the entire document.

This is hybrid search, but not the keyword-vector hybrid everyone talks about. It’s reasoning-vector hybrid. The LLM decides where to look based on document structure, then vectors find what within that location.

Why This Matters

Three concrete improvements:

1. Precision over recall. When you ask about results, you get results. Not methodology that happens to mention accuracy numbers.

2. Structural queries work. “Summarize the introduction” or “What’s on page 47” become exact operations, not fuzzy similarity searches.

3. Smaller search space. Instead of searching 500 chunks, you might search 50. Faster, cheaper, more relevant.

Implementation

The pipeline:

Extract pages — PyMuPDF for text, vision models for image-heavy pages
Build tree — TOC detection, hierarchy extraction, offset calculation
Tag chunks — Every chunk gets tree_node_id, tree_path, page_number
Index — Embeddings plus structured metadata in TurboPuffer
Search — LLM reasons over tree, filters, then vector similarity within scope

Chunks carry full context:

{
  "text": "The model achieved 94.2% accuracy on the test set...",
  "page_number": 14,
  "tree_node_id": "0005",
  "tree_path": "Results > Quantitative Analysis",
  "tree_title": "Quantitative Analysis"
}

The tree itself is stored separately, passed to the reasoning step before any search happens.

Tools That Understand Structure

This architecture enables tools that actually respect document structure:

/tree — Returns the full document hierarchy, not just a list of pages
/read — Read by page number or section ID, not just path
/grep — Regex search scoped to specific sections or page ranges

An agent can explore a PDF the way a human would. Browse the structure, read a specific section, search within a chapter.

When It Fails

This approach requires detectable structure. A scanned receipt with no TOC falls back to standard chunking.

Documents with unusual formatting—multi-column layouts, embedded tables, figure-heavy pages—need vision model fallbacks. We detect low-text pages and describe them visually.

The LLM reasoning step adds latency. For simple keyword lookups, pure vector search is faster. Tree reasoning pays off on intent-rich queries.

Results

On research papers, tree-guided search returns relevant chunks 40% more often than pure vector search when queries reference document structure (“in the methods section”, “the main results”, “early in the paper”).

For agents working with PDFs—research assistants, document analysis tools, legal review systems—structure awareness changes what’s possible.

Try It

This is live in Nia. Index any arXiv paper and use nia_explore to see the tree, nia_read to read by section, nia_grep to search within specific parts.

# See document structure
nia_explore(source_type="documentation", doc_source_id="arxiv-paper")

# Read a specific section
nia_read(source_type="documentation", doc_source_id="arxiv-paper", tree_node_id="0005")

# Search within Results only
nia_grep(source_type="documentation", doc_source_id="arxiv-paper",
         pattern="accuracy", tree_node_id="0004")

PDFs have structure. Your search should use it.

Try Nia at trynia.ai.