Habr AI→ original

PageIndex from VectifyAI offers embedding-free search for long documents

PageIndex from VectifyAI proposes a different approach to working with long documents: instead of chunks and vector databases, the system builds a…

AI-processed from Habr AI; edited by Hamidun News
PageIndex from VectifyAI offers embedding-free search for long documents
Source: Habr AI. Collage: Hamidun News.
◐ Listen to article

PageIndex is one of the most notable new contenders for the role of 'RAG without a vector database'. Instead of the familiar scheme with embeddings and chunks, the system builds a hierarchical table of contents for a document with brief descriptions of sections, and then asks an LLM to logically select relevant nodes and associated pages. The approach looks fresh and for long, well-structured PDFs can indeed work closer to how a human reads a document.

PageIndex's mechanics are relatively simple. The document is first divided by pages, after which the model and supporting code compile an expanded TOC — a tree of sections with headings, page ranges, and summaries for each node. When a question comes in, the prompt doesn't send the entire document or a set of random text chunks, but rather this structure itself. The LLM selects relevant branches of the tree, and then only the pages attached to them are substituted into the final query.

As a result, the system gets by without embeddings, without vector storage, and without artificial chunking, which often breaks meaning at fragment boundaries. This is why there's so much interest around PageIndex. On long financial reports, legal documents, technical manuals, and educational materials, such an approach looks natural: humans also typically start with a table of contents rather than sift through text in pieces.

In the project repository, VectifyAI developers directly position the system as reasoning-based retrieval and claim that on FinanceBench it achieved 98.7% accuracy. For teams working with a single large document or a small collection of complex PDFs, this sounds like a strong alternative to the conventional RAG pipeline, especially if you want more interpretable search with clear references to sections and pages.

But the main question is not whether vector search can be replaced by PageIndex, but where this approach reaches its limits. The criticism here is rather pragmatic. First, the TOC also needs to be stored somewhere, especially if there's more than one document, so the talk of 'completely without an index' is slightly misleading.

Second, for large collections there's no convincing strategy yet for document selection: metadata, keyword search, TF-IDF, and BM25 don't disappear and often remain a cheap first filter. Third, reasoning retrieval is almost inevitably more expensive in tokens and slower in response time. If a good vector RAG already delivers about 90% quality, the additional percentage points of accuracy can cost several times more — and for not every product is this a reasonable trade-off.

Practice also shows limitations. In reviews, it's noted that PageIndex performed poorly with literary text without explicit structure: if a document has no sections or headings, there's simply nothing to build a 'smart table of contents' from. Results were better with academic text because it has a proper hierarchy of sections. You can run the system locally through the open repository: install dependencies, set an API key for a compatible LLM via LiteLLM, and run a PDF or markdown through run_pageindex.py.

But there are nuances here too: the author separately warns about the LiteLLM version, advises against installing the cloud pageindex package from pip for local work, and describes how on weak local models the tree quality noticeably degrades, and the process itself can take dozens of minutes even on a relatively small document.

What does this mean in practice? PageIndex doesn't look like a vector search killer, but looks like a useful new layer in RAG architecture. The most logical way to see it is not as a direct replacement, but as a specialized tool for long, structured documents where explainability, navigation precision, and page-by-page processing matter. The most realistic scenario is hybrid: first a cheap search by metadata or vectors, then PageIndex for precise section selection. Such a compromise better reflects reality: there's no universal replacement for vector RAG yet, but document-first approaches like PageIndex already have their own clear niche.

ZK
Hamidun News
AI news without noise. Daily editorial selection from 400+ sources. A product by Zhemal Khamidun, Head of AI at Alpina Digital.

Want to stop reading about AI and start using it?

AI News is a curated feed of AI/tech news. Hamidun Academy teaches you to use AI systematically in your work.

What do you think?
Loading comments…