NVIDIA Explains the Difference Between VLA and WAM — Two Approaches to Robot Control

Q: What is the source?

Originally published on NVIDIA Developer Blog. Hamidun News processes and adapts the material with AI.

Q: When was it published?

Jun 15, 2026. Reading time: 3 min.

NVIDIA published an overview of two competing approaches to robot control. VLA models start with a language backbone — they can understand instructions but…

Hamidun News Editorial

AI monitoring · NVIDIA Developer Blog

Jun 15, 2026· 3 min

AI-processed from NVIDIA Developer Blog; edited by Hamidun News

NVIDIA Explains the Difference Between VLA and WAM — Two Approaches to Robot Control — Source: NVIDIA Developer Blog. Collage: Hamidun News.

◐ Listen to article

NVIDIA has published a comprehensive overview of two competing architectures for robotic AI — VLA and WAM — and explains why the second approach may become the next industry standard.

Two Classes of Robotic Models

Today there are two dominant ways to create a model that controls a robot. The first is to take a pre-trained Vision-Language model and fine-tune it to generate commands for a manipulator. Such systems are called Vision-Language-Action models, or VLA. Examples already in operation: Pi-0 from Physical Intelligence and GR00T N1 from NVIDIA. Both start with a powerful VLM backbone that has absorbed knowledge about the world through texts and images — and then adapt for real-world motor tasks. The key advantage: rich semantics and the ability to generalize unfamiliar instructions.

The second path is World-Action Models, or WAM. Here the foundation is not a language model, but a "world" model — a system trained to predict future video frames depending on the action performed. Such a backbone hasn't read the internet, but it has seen how objects move, deform, and respond to physical impact.

Why Imagination Is More Important Than Language

The key idea of WAM is that predicting "what will happen if I push this mug" is fundamentally more useful for a robot than the ability to parse complex instructions. World models, grown from video generation tasks, accumulate precisely this type of knowledge. In practice, this is expressed in the following differences:

VLM backbone provides rich semantics and generalization of language commands
World-model backbone embeds physical intuition without explicit physics programming
VLA is fine-tuned predominantly on human teleoperation datasets
WAM can use synthetic video as an internal simulator
Both approaches don't exclude each other — researchers are already experimenting with hybrids

NVIDIA in Both Camps

Notably, NVIDIA is present in both directions at once. GR00T N1 is the flagship VLA model for humanoid robots. Cosmos is a world models platform that potentially serves as a WAM backbone for the next generation of systems.

"We are at the beginning of an era of physical AI" — this is the exact narrative NVIDIA is cementing through this glossary publication and conceptual overview.

By standardizing terminology before the market fully divides into camps, the company positions itself as an architect of discourse. This is not just a blog — it's a bid to dictate how the industry will think about the next generation of robots.

What This Means

The choice between VLA and WAM is a strategic decision for everyone building robotics today. VLA launches faster with teleoperation data available; WAM potentially scales better without expensive manual annotation. As video generation models become cheaper and improve, World-Action Models will become increasingly attractive — and NVIDIA intends to occupy leading positions in both camps simultaneously.

Hamidun News

AI news without noise. Daily editorial selection from 400+ sources. A product by Zhemal Khamidun, Head of AI at Alpina Digital.

Telegram channel RSS hamidun.com

Want to stop reading about AI and start using it?

AI News is a curated feed of AI/tech news. Hamidun Academy teaches you to use AI systematically in your work.

🎓 Academy — 7 days free Free consultation