Hugging Face Blog→ original

NVIDIA Shows Efficient Method to Train Cosmos for Robot Video Generation Using LoRA

NVIDIA engineers published a guide for fine-tuning the Cosmos Predict 2.5 model using LoRA/DoRA—parameter-efficient adaptation methods. This enables adapting vi

AI-processed from Hugging Face Blog; edited by Hamidun News
NVIDIA Shows Efficient Method to Train Cosmos for Robot Video Generation Using LoRA
Source: Hugging Face Blog. Collage: Hamidun News.
◐ Listen to article

NVIDIA presented a practical guide for fine-tuning its Cosmos Predict 2.5 model using LoRA and DoRA—parameter-efficient adaptation methods. This work transforms expensive full retraining into an accessible process that any team can run on a single GPU.

Why This Matters

Cosmos Predict 2.5 is a powerful 2-billion-parameter video model that generates physically plausible videos from text, images, or other videos. Standard full retraining of such a model requires enormous computational resources and often leads to catastrophic forgetting—the model loses general knowledge when adapting to a specific task.

LoRA (Low-Rank Adaptation) solves this problem: instead of modifying all 2 billion parameters, only small adapters in attention and feedforward layers are trained. This reduces memory consumption by an order of magnitude and enables work on budget hardware.

How It Works in Practice

Using the GR1-100 dataset (92 robot manipulation videos), NVIDIA demonstrated the following results:

  • Training on 1× H100 GPU: 17 hours
  • Training on 8× H100 GPU: 2.5 hours
  • Adapters occupy only a few MB (versus many GB for full checkpoints)
  • Adapters are easily swappable—different versions for different domains

The model was trained for 500 epochs on manipulation videos: grasping objects from a mat into a bowl, bringing juice to a green cup, and so on. Text instructions for each video helped the model understand what needed to be generated.

What Training Delivered

The base model struggled: generating human hands instead of robot hands, shaky video, and implausible object movement. After fine-tuning via LoRA/DoRA:

Fine-tuned models (LoRA r=32,

DoRA r=32) correctly use specified hand and eliminated jitter and improved video stability.

Qualitatively: hallucinations disappeared, the model consistently uses the correct hand, objects move with physical plausibility, and instructions are followed more precisely.

Quantitatively: scores for geometric stability (Sampson Error), physical plausibility, and instruction-following all improved across all configurations—LoRA rank 8, LoRA rank 32, DoRA rank 32. Rank 32 provides better instruction accuracy, rank 8 requires less memory.

What This Means

Synthetic robot videos are a hot problem: obtaining real manipulation data is expensive and time-consuming. With Cosmos + LoRA, robotics teams can generate thousands of examples overnight on a single GPU. This is cheaper, faster, and trains real robots on diverse movement variations.

NVIDIA released complete code, recipes, and pre-made adapters—copy and run.

ZK
Hamidun News
AI news without noise. Daily editorial selection from 400+ sources. A product by Zhemal Khamidun, Head of AI at Alpina Digital.

Want to stop reading about AI and start using it?

AI News is a curated feed of AI/tech news. Hamidun Academy teaches you to use AI systematically in your work.

What do you think?
Loading comments…