NVIDIA Introduced Nemotron 3 Nano Omni for Long Documents, Audio, Video, and AI Agents

NVIDIA introduced Nemotron 3 Nano Omni — a multimodal model for documents, audio, video, and agent tasks in interfaces. It can process documents with 100+ pages, work with long speech recordings, and analyze screenshots. According to the company's published benchmarks, the model significantly outperforms the previous Nemotron Nano V2 VL and has better efficiency than several alternatives with open weights.

Khamidun Zhemal

AI monitoring · Hugging Face Blog

Apr 28, 2026· 3 min

AI-processed from Hugging Face Blog; edited by Hamidun News

NVIDIA Introduced Nemotron 3 Nano Omni for Long Documents, Audio, Video, and AI Agents — Source: Hugging Face Blog. Collage: Hamidun News.

◐ Listen to article

28 April 2026, NVIDIA introduced Nemotron 3 Nano Omni — a multimodal model with long context for documents, audio, video and agent scenarios in interfaces. The company is betting on practical tasks: from parsing complex PDFs and screen recordings to speech recognition and reasoning across multiple data types simultaneously.

What tasks does Nemotron 3 Nano Omni solve?

Nemotron 3 Nano Omni is positioned not simply as an OCR model or yet another VLM for images. NVIDIA describes it as a universal system for five classes of workloads: analysis of real documents, automatic speech recognition, understanding long videos with audio, assistance in GUI scenarios and general multimodal reasoning. We're talking not about short demo examples, but about documents with tables, formulas, cross-references between pages, slides, screenshots and screen recordings with voice comments.

In the document scenario, the model, according to the company, handles files over 100 pages and must simultaneously hold both fine details and overall structure. For audio and video, the emphasis is on long materials: educational videos, meetings with slides, product demos and support recordings. For agent tasks, working with screenshots and interface state is important — the model can interpret what it sees on the screen and help with choosing the next action.

Multi-page contracts, reports and technical documents
Screen recordings and tutorials with voice accompaniment
Recognition of long speech with noise, accents and different speakers
Analysis of GUI and screenshots for computer-use scenarios

What's inside the model

The architecture is built around the Nemotron 3 Nano 30B-A3B language backbone and two specialized encoders: C-RADIOv4-H for visual data and Parakeet-TDT-0.6B for audio. Connection between modalities and the LLM is implemented through lightweight projectors to bring everything into a single sequence of tokens.

Inside the backbone, NVIDIA uses a hybrid approach: 23 Mamba layers for long context, 23 MoE layers with 128 experts and top-6 routing, as well as 6 attention layers for global connections and complex reasoning. Special emphasis is placed on efficient work with dense visual data. Instead of tiling, which was used in the previous version, the model received dynamic resolution at native aspect ratio: from 1024 to 13312 visual patches are allocated per image.

For video, two compression mechanisms are applied. Conv3D combines pairs of adjacent frames before feeding into ViT, and EVS at the inference stage discards static tokens and keeps only dynamic changes. For audio, the transition to native input is important: the model works not only with a transcript, but with the audio track itself, and was trained on segments up to 20 minutes, with overall LLM context claimed at over five hours.

Results and availability

In published benchmarks, Nemotron 3 Nano Omni has made significant improvements compared to Nemotron Nano V2 VL and often outperforms Qwen3-Omni 30B-A3B. According to NVIDIA, the model scores 57.5 on MMLongBench-Doc versus 38.

0 for the previous version, 65.8 on OCRBenchV2-En and 63.6 on CharXiv reasoning.

In GUI tasks it shows 47.4 on OSWorld versus 11.0 for the previous model, and in multimodal video — 72.

2 on Video-MME, 55.4 on WorldSense and 74.1 on DailyOmni.

For audio, 89.4 on VoiceBench and 5.95 WER on HF Open ASR are claimed, where lower is better.

No less important for developers is cost and speed. NVIDIA writes of a 7.4x increase in system efficiency in multi-document scenarios and 9.

2x in video use cases compared to other open multimodal models with comparable interactivity. The company also claims up to 2.9x higher speed for single-stream reasoning in multimodal tasks.

Checkpoints are already posted on Hugging Face in BF16, FP8 and NVFP4 formats, so the model can be tested not only as a research release, but also as a foundation for applied pipelines.

What this means

NVIDIA is clearly moving not toward yet another showcase demo, but toward practical enterprise scenarios where you need to simultaneously read long documents, understand voice, see the interface and maintain large context without a sharp increase in cost. If the claimed metrics are confirmed in real integrations, Nemotron 3 Nano Omni will be a strong candidate with open weights for document AI, video understanding and computer-use agents.

Hamidun News

AI news without noise. Daily editorial selection from 50+ sources. A product by Zhemal Khamidun, Head of AI at Alpina Digital.

Telegram channel RSS hamidun.com

Need AI working inside your business — not just in your newsfeed?

I build production AI for companies — custom CRM, internal tools, autonomous agents, workflow automation. Owned by you, shaped to your process, no per-seat tax. Built by Zhemal Khamidun, CPO of AlpinaGPT (AI platform, 6,000+ users).

Book a free consultation →