Tree Reasoning: A New Approach to PDF Search
Traditional RAG treats PDFs as bags of text. Chunk everything, embed it, hope vector similarity finds what you need.
This fails for complex documents. A research paper’s “Results” section might be semantically similar to its “Methods” section—both discuss experiments. A legal contract’s “Indemnification” clause might share vocabulary with “Limitation of Liability.” Vector search can’t tell the difference without structural awareness.
We built something different.
The Problem with Chunking
Standard PDF processing:
- Extract text
- Split into fixed-size chunks (512-1024 tokens)
- Generate embeddings
- Store in vector database
- Search by cosine similarity
This destroys document structure. A chunk doesn’t know if it’s from the introduction or the conclusion. When you ask “What are the main findings?”, you might get text from the methodology section because it uses similar words.
The user intent matters. “Results” means something specific. So does “page 47” or “the section about transformers.”
Structure-Aware Indexing
PDFs have structure. Table of contents. Sections. Subsections. Page numbers.
We extract this hierarchy during indexing:
Document
├── Abstract (pages 1-2)
├── Introduction (pages 2-5)
├── Methods (pages 5-12)
│ ├── Data Collection (pages 5-8)
│ └── Model Architecture (pages 8-12)
├── Results (pages 12-18)
│ ├── Quantitative Analysis (pages 12-15)
│ └── Qualitative Analysis (pages 15-18)
└── Conclusion (pages 18-20)
Every chunk gets tagged with its location in this tree. Not just the page number—the full path. A chunk knows it belongs to Results > Quantitative Analysis on page 14.
This metadata becomes filterable. You can query “all chunks from the Results section” or “pages 12-18” without relying on semantic similarity.
The Page Number Problem
Here’s something nobody talks about: PDF page numbers lie.
A document’s table of contents might say “Chapter 2 starts on page 10.” But in the actual PDF file, that’s physical page 15—because the cover, title page, and table of contents itself aren’t numbered in the TOC.
We solve this with offset detection. Sample entries from the TOC, find where they actually appear in the document, calculate the offset. Now tree node assignments point to real pages.
Tree-Guided Search
The real innovation is in retrieval.
When a query comes in, we don’t immediately search vectors. First, an LLM analyzes the document structure:
Query: "What accuracy did the model achieve?"
LLM reasoning: "The user is asking about model performance metrics.
These would be in the Results section, likely under Quantitative Analysis.
Relevant nodes: 0004 (Results), 0005 (Quantitative Analysis)"
Only then do we search—but filtered to those specific sections. The vector search happens within a constrained scope, not across the entire document.
This is hybrid search, but not the keyword-vector hybrid everyone talks about. It’s reasoning-vector hybrid. The LLM decides where to look based on document structure, then vectors find what within that location.
Why This Matters
Three concrete improvements:
1. Precision over recall. When you ask about results, you get results. Not methodology that happens to mention accuracy numbers.
2. Structural queries work. “Summarize the introduction” or “What’s on page 47” become exact operations, not fuzzy similarity searches.
3. Smaller search space. Instead of searching 500 chunks, you might search 50. Faster, cheaper, more relevant.
Implementation
The pipeline:
- Extract pages — PyMuPDF for text, vision models for image-heavy pages
- Build tree — TOC detection, hierarchy extraction, offset calculation
- Tag chunks — Every chunk gets
tree_node_id,tree_path,page_number - Index — Embeddings plus structured metadata in TurboPuffer
- Search — LLM reasons over tree, filters, then vector similarity within scope
Chunks carry full context:
{
"text": "The model achieved 94.2% accuracy on the test set...",
"page_number": 14,
"tree_node_id": "0005",
"tree_path": "Results > Quantitative Analysis",
"tree_title": "Quantitative Analysis"
}
The tree itself is stored separately, passed to the reasoning step before any search happens.
Tools That Understand Structure
This architecture enables tools that actually respect document structure:
/tree— Returns the full document hierarchy, not just a list of pages/read— Read by page number or section ID, not just path/grep— Regex search scoped to specific sections or page ranges
An agent can explore a PDF the way a human would. Browse the structure, read a specific section, search within a chapter.
When It Fails
This approach requires detectable structure. A scanned receipt with no TOC falls back to standard chunking.
Documents with unusual formatting—multi-column layouts, embedded tables, figure-heavy pages—need vision model fallbacks. We detect low-text pages and describe them visually.
The LLM reasoning step adds latency. For simple keyword lookups, pure vector search is faster. Tree reasoning pays off on intent-rich queries.
Results
On research papers, tree-guided search returns relevant chunks 40% more often than pure vector search when queries reference document structure (“in the methods section”, “the main results”, “early in the paper”).
For agents working with PDFs—research assistants, document analysis tools, legal review systems—structure awareness changes what’s possible.
Try It
This is live in Nia. Index any arXiv paper and use nia_explore to see the tree, nia_read to read by section, nia_grep to search within specific parts.
# See document structure
nia_explore(source_type="documentation", doc_source_id="arxiv-paper")
# Read a specific section
nia_read(source_type="documentation", doc_source_id="arxiv-paper", tree_node_id="0005")
# Search within Results only
nia_grep(source_type="documentation", doc_source_id="arxiv-paper",
pattern="accuracy", tree_node_id="0004")
PDFs have structure. Your search should use it.
Try Nia at trynia.ai.