NVIDIA released Nemotron 3 Nano 4B — a compact hybrid model for on-device deployment
NVIDIA released Nemotron 3 Nano 4B, a compact language model designed to run directly on devices without relying on the cloud. Its hybrid Mamba-Transformer…
AI-processed from Hugging Face Blog; edited by Hamidun News
NVIDIA has opened access to Nemotron 3 Nano 4B — a compact language model with 4 billion parameters, developed specifically for edge deployment on Jetson, RTX GPU, and DGX Spark devices. This is NVIDIA's first 4B model built on a hybrid Mamba-Transformer architecture with a focus on minimal memory consumption and high inference speed.
Hybrid Architecture of the New Generation
At the core of Nemotron 3 Nano 4B is a 42-layer construction: 21 Mamba blocks, 4 Attention blocks, and 17 MLP blocks. This ratio is atypical for language models of this size — most competitors are built exclusively on transformers. Mamba layers process long sequences with linear rather than quadratic memory complexity — this is the main source of efficiency.
Attention blocks are placed strategically and preserve accuracy where global context understanding is critical. Compared to the parent model Nemotron Nano 9B v2, the embedding dimension was reduced from 4,480 to 3,136, the number of Mamba heads from 128 to 96, and the number of layers from 56 to 42. The result: the smallest VRAM footprint in the 4B class when tested on RTX 4070 and record-low Time-to-First-Token latency for long input sequences.
Four Training Stages
Nemotron 3 Nano 4B is not simply a trimmed 9B model, but a model with its own four-stage training pipeline. The first is compression through Nemotron Elastic: neural architecture search (NAS) with a trained router determined exactly where to prune the 9B network. The router operated on four axes: Mamba heads, hidden dimension, FFN channels, and model depth. The second is distillation for accuracy recovery:
- Short context (8K, 63B tokens): 70% post-training + 30% pretraining data
- Long context (49K, 150B tokens): window expansion for complex reasoning tasks
The third is supervised fine-tuning (SFT): two stages covering math, code, science, chat, agent tasks, and safety. The fourth is three-stage reinforcement learning via NeMo-RL: from single-turn instruction following to multi-turn with JSON/XML outputs and further to function tool calling. The reasoning/non-reasoning data ratio is 50/50 with progressive KL penalty tightening.
Numbers That Matter
On Jetson Orin Nano with 4-bit GGUF (Q4_K_M), the model delivers 18 tokens/sec — twice as fast as Nemotron Nano 9B v2 on the same hardware. FP8 quantization via ModelOpt maintains 100% median accuracy recovery with up to 1.8X latency/throughput improvement compared to BF16.
"FP8 quantization achieved 100% median accuracy recovery with up to
1.8X latency/throughput improvement over BF16" — from NVIDIA technical documentation.
On key benchmarks, the model leads among competitors in its class:
- IFBench and IFEval — instruction following
- Orak — game intelligence: Super Mario, Darkest Dungeon, Stardew Valley
- Tool-use — tool invocation and hallucination avoidance
- TTFT — minimal latency on long input sequences
The model is available in three variants: BF16 (full precision), FP8 (optimized for RTX and server GPUs), GGUF Q4_K_M (for Jetson and Llama.cpp). Engines vLLM, TRT-LLM, and Hugging Face Transformers are supported.
What This Means
A 4B model with a 2X speed advantage over 9B on Jetson changes the edge AI equation: robotics, IoT, local agents, and game NPCs get an industrial-grade tool without expensive hardware and without sending data to the cloud. Open weights allow fine-tuning the model for a specific domain without licensing restrictions.
Want to stop reading about AI and start using it?
AI News is a curated feed of AI/tech news. Hamidun Academy teaches you to use AI systematically in your work.