36Kr (36氪)→ original

World Models: Will They Be the Key to Autonomous Driving?

Automakers are actively using "world models" to train and test autonomous driving systems. This allows for creating more realistic simulations and improving alg

AI-processed from 36Kr (36氪); edited by Hamidun News
World Models: Will They Be the Key to Autonomous Driving?
Source: 36Kr (36氪). Collage: Hamidun News.
◐ Listen to article

In recent years, whenever automakers discuss intelligent driving, they invariably mention various new technical terms. Following end-to-end learning and VLA, "world model" has become the trendiest phrase in intelligent driving. Different companies have even rebranded it – Xiaopeng introduced a "World Basic Model," NIO called it an "End-to-End World Model," while Huawei branded it a "World Behavior Model" (WA). Besides them, Horizon Robotics, Li Auto, Yuanrong Qixing, and Momenta are also working on world models.

However, based on their press conferences, it's hard to tell whether the world models they're discussing are the same thing. What problem does it solve and where does it fit in the autonomous driving architecture? Looking at the broader context, a "world model" essentially represents recreating the real world in a virtual environment—a technology that enables AI to understand the real world, comprehend physical laws, causality between things, and environmental dynamics, much like humans do.

Most scientists and technology companies view world models as a key element in "physical-world AI." Stanford University Professor Fei-Fei Li once noted that spatial intelligence is the next decade of AI, and world models are the key technology for building spatial intelligence. Scientists and technology companies at the forefront of the industry are still in the exploration phase, but the Chinese automotive industry has already staked its positions using various new conceptual terms.

In fact, the "world model" being discussed in the intelligent driving industry today is merely a difference in naming—there's no significant technological distinction. It's simply an update to the technological paradigm for original simulation tools in the industry, solving problems of testing and validating end-to-end models in virtual worlds with higher fidelity, greater detail, richer scenarios, and more degrees of freedom. All of this is to train more efficient and human-like end-to-end intelligent driving models.

In other words, intelligent driving manufacturers and automakers aren't actually creating complete digital physical reality; they're simply using the world model concept to build a simulator. Each company may have different expectations for world models, but as far as we know, world models in the intelligent driving industry are currently only applied in the cloud and aren't used in vehicles.

The widespread adoption of end-to-end learning has highlighted the shortcomings of simulators. Over the past two to three years, leading intelligent driving solutions have transitioned from rule-based stacks to AI-based control and completed "formal" integration. Perception, prediction, and planning were maximally integrated into a single network, along with larger models and greater computational power. As automakers often say at their press conferences, "intelligent driving after end-to-end learning is more like human driving."

But in real-world application, a counterintuitive phenomenon emerged: new OTA versions after end-to-end learning don't necessarily become better and may even "degrade." The main issue isn't that the model got worse, but that AI-based control makes evaluation and regression more difficult. At that time, many intelligent driving specialists believed that as long as the frontend is trained well enough, the vehicle will drive like a human.

This approach hasn't been fruitless, and early end-to-end learning results shocked many specialists, but the "black box" nature of end-to-end learning also has side effects. When the model makes a mistake, developers struggle to understand why the error occurred. How do you prove it won't happen again next time?

Whether a model is good depends not only on "whether it's large enough and has enough data," but also on how you detect problems, define problems, and verify problems. Manufacturers gradually realized they need a better simulator to evaluate model performance during the model validation phase.

Most leading players are building world models to use as simulators. To allow an ideal VLA to conduct reinforcement learning in a simulated environment, Li Auto in 2025 proposed a world driving model that includes trajectories of both its own and other vehicles, serving as a grading teacher; Xiaopeng, although it only announced a "World Basic Model" that isn't essentially related to world models, but according to 36Kr Auto, Xiaopeng also uses world models for simulation testing to evaluate the capabilities of new model versions.

The widespread adoption of end-to-end learning has highlighted the inadequacies of traditional simulators. "When end-to-end learning wasn't so popular, the testing costs for everyone weren't as high, and they could still test the system in pieces. Now with end-to-end learning, there's no way to test the system piece by piece, and that's when the simulator's problems become obvious," said an industry developer.

During the rule-based era, automakers created simulations that often served two purposes: one was reproducing mid-way interception issues, returning to and reproducing fragments that occurred during road tests; the other was using simulators to increase data richness for edge cases, creating multiple typical intersections, pedestrians crossing, and vehicle insertion scenarios in the simulator so the system could pass through them. At that time, the simulator played the role of a "magnifying glass," but after end-to-end learning, the model is hard to break down into pieces, it's hard to systematically generate smaller, manageable edge cases, and it's even harder to maintain large-scale closed-loop validation needed for end-to-end learning—and this is exactly why the world model was introduced.

In the era of end-to-end learning, the world model is the "coach" for the intelligent driving model. "Currently, the world model level of domestic automakers is at a certain distance from Tesla, but the gap is less than a year," said an industry insider.

Tesla hasn't used the concept of "world model" but rather the term "world simulator" (Tesla's Vice President of Autonomous Driving Ashok Elluswamy first mentioned this at last year's ICCV). The simulator is based on a massive dataset created by Tesla itself and generates future states based on the current state and next actions. Thus, it operates in a closed loop with the basic end-to-end model on the vehicle side to evaluate real-world effectiveness.

An industry insider noted that Tesla is more like using neural networks to "fit" the world. The rendering process is generated through calculations to minimize explicit overlaying of physical rules; the material library isn't entirely predetermined by humans in advance but rather maintains certain probability weights and combination space. The advantage of this approach is that the model has stronger generalization ability.

Domestic automakers are taking a different, more "controlled" path. According to a supplier who spoke with 36Kr Auto, Li Auto uses Gaussian 3D reconstruction—which is also one of the methods most automakers are currently using.

Regardless of which route is chosen, the world model ultimately points to the same position from an engineering perspective: automakers use the world model as a "verification and refutation system" in the era of end-to-end learning to reproduce, rewrite, and expand scenarios that may occur in real driving in the cloud, verify whether the output of the large model on the vehicle side is stable and reproducible, and turn "where is wrong and why is wrong" back into a traceable chain of evidence.

The role of a world model is like that of a coach, and a great coach can train great athletes. "As the cloud world model becomes stronger and stronger, theoretically the ability of the end-to-end model trained on the vehicle side should become stronger and stronger," said a developer.

The main capabilities of a world model generally include two aspects: one is digital modeling and abstraction of the physical world; the other is intelligent imagination and prediction of the physical world based on such modeling, for example, predicting how the future world will change based on given images. Whether a world model is good depends on whether it can generate sufficiently realistic and diverse data in the cloud. "If an automaker only uses real data collected for modeling, then it's clearly not creating a world model but just creating a set of data reproduction processes," said a supplier's product manager.

A world model needs to learn the world's operating patterns based on data from the physical world, so the quality of training data for the world model will substantially affect the quality of data generated by the model. Mao Jimin, head of the product line at JIJIA Vision, mentioned: "For a generative model like a world model, its generated results will ultimately conform to the distribution patterns of the input data characteristics. In the process of commercializing a real world model, we found that if the data quality is only 60 points, then the quality of data generated based on this world model may be only 55 points."

Based on the world model, automakers can infinitely generate needed scenarios from various dimensions during cloud simulation and can generate videos as training data according to instructions. "Efficiency isn't just a little higher than real-world collection and then training; the model iteration speed will lead in this era," said a supplier developer.

But these are all idealized results. "A world model is a major upgrade compared to simulators used for intelligent driving, or in other words, lacking simulation information, and can only be verified using autonomous data, but it's still far from an ideal simulator."

World model algorithms haven't matured yet, and there's still plenty of "hallucination." Currently, the industry as a whole is at the "just beginning" stage.

An automaker developer told 36Kr Auto that domestic manufacturers can generate video clips lasting 30-60 seconds based on world models, but the consistency of dynamic objects isn't very good, and there are major issues with both spatiotemporal consistency and multi-view consistency.

Generative models form the foundation of world models, and generative models are inherently coupled with the risk of "hallucination." "The most difficult thing about world models right now is how to guarantee that generated things are real. If a person is generated, how do you guarantee that their behavior and trajectory could actually happen in the real world," said a supplier's product manager. "If the world model creates confusion, it will cause the model to learn wrong things, leading to very poor effects in the model deployed on the vehicle side."

An extreme example: if vehicles generated in the cloud move sideways, the model will think the vehicle at the front-left will instantly move to the front-right. During actual driving, the model might brake suddenly.

If the simulator cannot approximate key causal relationships of the real world, such as how slippery roads affect braking distance, the likelihood of false detection of stationary objects under backlighting, negotiation strategies of oncoming vehicles during lane changes, etc., then the "edge case" it generates could be false; optimizing for false problems is equivalent to wasting development resources on phantoms.

Many believe the bottleneck of world models is data and computational power, but Xia Zhongpu, former head of the "end-to-end" model for autonomous driving at Li Auto, agrees more with LeCun's viewpoint: "There are no major breakthroughs in world model algorithms, and self-supervised learning for image models hasn't yet found a relatively smooth paradigm like language."

The reason language models can scale quickly is that language itself has high information density, and each word carries clear semantic constraints. Image density is low, and for "driving decisions," useful information comprises only a small portion.

For example, models don't need to predict the trajectory of a car far behind, and don't need to predict changes in distant buildings—these are all noise; but they must predict whether the car ahead will suddenly brake hard in this lane, whether the car in the adjacent lane intends to change lanes, whether a pedestrian might suddenly cross the road. The model must first know "what to pay attention to."

"Currently, intelligent driving algorithms cannot extract enough useful information from images for driving," said Xia Zhongpu. An image might contain millions of pixels, but only 20 or so are related to decision-making, with the rest being noise. The model must first learn to extract 1‰ or even 1‱ of effective signals from the noise, and only then talk about how to organize those signals into a structure that can be used for reasoning and prediction.

In Xia Zhongpu's view, world model algorithms haven't achieved a breakthrough yet, let alone whether data is sufficient and how much computational power is needed. Precisely because the fundamental technology of world models hasn't seen clear breakthroughs, automakers' investments are more research-oriented, and even some automaker bosses are confused about it.

If world models are done well enough and can be deployed on the vehicle side, if computational power can support them... "Currently in China, world models are mainly used as simulation systems, and understanding of decision-making technology for intelligent driving isn't high enough yet," said Xia Zhongpu.

This also explains the surface contradiction: why everyone talks about world models but the difference in user experience isn't obvious—because most people's world models are still at the first stage of "used for training and testing" rather than the second stage of "can support decision planning."

"Deploying world models on the vehicle side is the most difficult," said Xia Zhongpu. Currently, no company is applying world models on vehicles. He also noted: "Using the large model method to simulate the physical world, predict changes in world development through interaction with the physical world, and thus influence the world through decision-making to develop in a direction beneficial to oneself. If world models reach this level, problems related to autonomous driving and robotics could be solved."

ZK
Hamidun News
AI news without noise. Daily editorial selection from 400+ sources. A product by Zhemal Khamidun, Head of AI at Alpina Digital.

Want to stop reading about AI and start using it?

AI News is a curated feed of AI/tech news. Hamidun Academy teaches you to use AI systematically in your work.

What do you think?
Loading comments…