NVIDIA introduces NeMo Retriever — agentic search for complex enterprise data
NVIDIA introduced a new pipeline for NeMo Retriever that turns search into an agentic process: the model plans its own steps, reformulates queries, and…
AI-processed from Hugging Face Blog; edited by Hamidun News
NVIDIA unveiled a new agentic pipeline for NeMo Retriever — a search system that does not limit itself to semantic similarity comparison. Instead of a single query, it launches a cycle of search, evaluation of intermediate results, and strategy refinement, which helped the solution take first place in the ViDoRe v3 rating and second place in BRIGHT.
Why one search is not enough
Classical dense retrieval works well when it is sufficient to find documents semantically similar to the query. But in enterprise scenarios, this is often not enough: documents can be visually complex, queries can be composite, and answers can be scattered across multiple sources. In such tasks, you need not just embedding matching, but the ability to break down a question into parts, test hypotheses, and change search direction several times.
NVIDIA describes this as a gap between two types of systems. Retrievers can quickly scan huge amounts of data, but barely reason. Large language models can plan and make logical inferences, but cannot immediately process millions of documents.
Agentic retrieval should close this gap by combining both approaches in a single cycle.
How the cycle works
The pipeline is built on ReACT architecture. The agent does not receive a task as "one query — one result," but acts step by step: thinks, calls the retrieve(query, top_k) tool, analyzes what was found, and decides what to do next. The final answer is assembled through a separate final_results tool that returns a list of the most relevant documents. According to the team, several useful patterns naturally emerged during the process:
- generation of more precise queries as new facts appear;
- constant rephrasing until the system finds a useful signal;
- breaking down a complex question into several simple subtasks;
- re-ranking the found documents before final selection.
If the agent hits the step limit or context length, the pipeline does not break blindly. Reciprocal Rank Fusion is used as a safeguard: documents receive a final score based on their positions in different search attempts, and the system still returns a meaningful set of results.
Where the pipeline won
The main result — not just a place on the leaderboard, but universality. The same pipeline, without changing the base architecture, took first place in ViDoRe v3 with NDCG@10 69.22 and second place in BRIGHT with NDCG@10 50.
90. The first metric is important for visually rich and diverse enterprise documents, the second for tasks requiring more multi-step reasoning. The authors separately compared their approach with more specialized solutions.
For example, INF-X-Retriever leads in BRIGHT with a result of 63.40, but on ViDoRe v3 in the same configuration with nemotron-colembed-vl-8b-v2 it showed 62.31 — even lower than regular dense retrieval with the same embedding model at 64.
36. NVIDIA uses this comparison as an argument for a generalizable approach: the agentic cycle transfers better across different task types than pipelines tailored to a single benchmark. It is also interesting that the team reworked the infrastructure for speed.
Initially, the retriever was deployed as an MCP server, which is logical for LLM access to external tools. But in practice, this added extra network calls, a separate process, risk of silent configuration errors and failures under load. As a result, the MCP scheme was replaced with a thread-safe singleton retriever inside the process: the model and embeddings are loaded once, access is synchronized through a lock, and the retrieve() interface remains the same.
This eliminated an entire class of operational problems and accelerated experiments.
The cost of autonomous search
NVIDIA directly states that this quality comes at a price. Agentic retrieval is noticeably slower and more expensive than regular dense retrieval. On ViDoRe v3, one query took an average of 136.
3 seconds, required approximately 760 thousand input tokens and 6.3 thousand output tokens, and the agent made an average of 9.2 search calls.
For real-time tasks, this is a heavy profile, especially when dealing with mass load. The team also compared closed and open models. On ViDoRe v3, the combination with Opus 4.
5 turned out to be the best, but switching to open-weight gpt-oss-120b resulted in only moderate quality degradation — from 69.22 to 66.38.
On BRIGHT, the gap was larger, indicating dependence of complex reasoning tasks on more powerful frontier models. NVIDIA's next step is to attempt to transfer these agentic patterns to more compact specialized open models to reduce cost and latency without significant quality loss.
What this means
Search across enterprise data is rapidly moving away from the "enter a query — get similar documents" model. NVIDIA shows that the next level is an agent that can search iteratively, change tactics, and combine reasoning with retrieval. While this approach is currently expensive and slow, for complex high-stakes scenarios it already looks like a working architecture, not a laboratory experiment.
Want to stop reading about AI and start using it?
AI News is a curated feed of AI/tech news. Hamidun Academy teaches you to use AI systematically in your work.