Yann LeCun Presents LeWorldModel — JEPA Model Without Representation Collapse from Pixels
Yann LeCun and colleagues introduced LeWorldModel — a new JEPA world model that learns directly from pixel data without stop-gradient, EMA, and frozen…
AI-processed from MarkTechPost; edited by Hamidun News
A team of researchers led by Yann LeCun presented LeWorldModel, or LeWM—a new world model for training agents directly on pixel data. The authors claim that the model solves one of the main problems of the JEPA approach—representation collapse—while also significantly accelerating planning.
Why JEPA Breaks
World models are needed by agents to build a compact internal map of the environment and calculate actions not in raw frames, but in latent space. However, when training directly on images, such systems often fall into representation collapse: different scenes begin to be encoded similarly, and the model formally performs the prediction task but loses useful structure of the world. Because of this, developers have to insure training with auxiliary techniques—stop-gradient, EMA, frozen encoders, and multi-component loss functions. The problem is particularly painful for agents that need to plan long chains of actions: if the latent space degenerates, the planner stops distinguishing between good and bad scenarios.
How LeWM Works
LeWM attempts to remove this complexity. The architecture consists of two main parts: an encoder that translates a frame into a compact latent representation, and a predictor that estimates the next state based on the current state and action. The implementation uses ViT-Tiny with approximately 5 million parameters and a transformer predictor with approximately 10 million, so the entire system fits into 15 million parameters and, according to the authors, trains on a single GPU in a few hours.
The key idea is not to create auxiliary objectives, but to keep only the prediction of the next embedding and the SIGReg regularizer. SIGReg forces latent vectors to remain diverse and close to an isotropic Gaussian distribution. To do this, the model looks not at the entire space as a whole, but at a set of random one-dimensional projections and checks their statistics. This approach should reduce the risk of degenerate representations without heavy engineering overhead.
In practical terms, LeWM retains only one truly important hyperparameter—the regularization weight λ—whereas the nearest end-to-end alternative, PLDM, has significantly more settings. The authors also separately note that for stability, dropout of 0.1 in the predictor and a small projection layer after the encoder helped.
What the Tests Showed
Based on the paper's results, LeWM proved not only more stable during training, but also faster at the planning stage. The authors compare it with PLDM and DINO-WM on navigation, manipulation, and control tasks in 2D and 3D environments. The model works directly with pixels, without a frozen foundation encoder and without reliance on tasks requiring reward, yet remains competitive on several benchmarks.
- approximately 200 times fewer tokens per frame compared to DINO-WM
- up to 48 times faster planning: approximately 0.98 seconds versus 47 seconds per cycle
- only two loss functions instead of seven in PLDM-based approaches using VICReg
- one main hyperparameter instead of a set of manual tunings
- latent space captures physical quantities and identifies "impossible" events like object teleportation
The authors separately tested whether the model understands the physical logic of the scene, rather than merely predicting the next frames. In violation-of-expectation tests, the system reacted more strongly to physically implausible events, such as sudden object teleportation, than to purely visual changes. Another interesting effect is temporal latent path straightening: as training progressed, trajectories in latent space became smoother and more linear even without a separate penalty that would explicitly enforce such behavior. This is important because smoother latent trajectories typically simplify action search during planning.
What This Means
For the agents market, this is an important signal: world models are becoming a practical direction again, not just an academic idea. If LeWM and similar approaches confirm results beyond laboratory benchmarks, developers will be able to build faster and cheaper agents that plan in a compact state space without heavy foundation encoders. This is particularly interesting for robotics, offline RL, and systems where the cost of error and latency is critical. Essentially, LeWM shows that fighting representation collapse can be done not by complicating the stack, but by more carefully framing the learning task itself.
Want to stop reading about AI and start using it?
AI News is a curated feed of AI/tech news. Hamidun Academy teaches you to use AI systematically in your work.