Jiqizhixin (机器之心)→ original

Emu2 in Nature: Chinese Scientists Found a Single Code for Reality

Пекинская академия ИИ (BAAI) опубликовала в Nature статью о модели Emu2. Главный прорыв — перевод всей генерации на рельсы авторегрессии для текста, фото и виде

AI-processed from Jiqizhixin (机器之心); edited by Hamidun News
Emu2 in Nature: Chinese Scientists Found a Single Code for Reality
Source: Jiqizhixin (机器之心). Collage: Hamidun News.
◐ Listen to article

For a long time, modern artificial intelligence resembled a high-tech Frankenstein's monster. We were accustomed to neural networks having different "organs" for different senses: language models like GPT excelled brilliantly with text through autoregression, predicting the next word, while image generators like Midjourney or Stable Diffusion inhabited a world of diffusion, extracting order from the chaos of random pixels. This division seemed fundamental and immutable, like the difference between logic and imagination.

However, researchers from the Beijing Academy of Artificial Intelligence (BAAI) decided that this architectural dualism deserved to be left in the past. Their new work on the multimodal model Emu2, just published in Nature magazine, makes a bold claim: understanding and creating this world requires just one single algorithmic principle. The essence of the breakthrough lies in unification.

Scientists managed to prove that any information—whether a philosophical treatise, a video of a running cat, or a microchip schematic—can be reduced to a single token format. In the Emu2 system, a picture is no longer a set of pixels in the conventional sense. It transforms into a sequence of "visual words" that the neural network learns to predict just as we predict the ending of this sentence.

This approach, called autoregressive learning, was long considered too cumbersome for graphics. But Chinese engineers, using a model with 37 billion parameters, demonstrated that with the right approach, autoregression not only matches diffusion in quality but surpasses it in flexibility. Why break something that worked well enough?

The problem with current systems lies in their "seams." When you try to befriend a text-based brain with visual eyes, you must build complex bridges and adapters, on which meaning and context inevitably get lost. Emu2, however, possesses inherent multimodality.

It doesn't translate from the language of pictures to the language of words—it inherently thinks in a language where pixel and letter are equally valid. This allows the model to demonstrate frightening efficiency in in-context learning. You can show it a couple of examples of how to edit a photo, and it instantly grasps the task's logic without any additional training.

This is precisely the magic that once made GPT-3 a global sensation, but now it has spread across all visual space. The context of this event cannot be ignored. Publication in Nature is the highest mark of quality in the scientific world, and that it went to BAAI speaks volumes.

While Western giants like OpenAI or Google compete in the closedness of their laboratories, Chinese researchers methodically build the theoretical foundation for the next generation of AI. Emu2 effectively draws a line under the era of specialized tools. We are entering an era of universal prediction engines for reality.

If everything around us is a sequence of data, then the winner will be whoever's model best predicts the next element of that sequence, regardless of its nature. Of course, transitioning to pure autoregression requires colossal computational resources. This is a game for those with unlimited access to GPUs and infinite patience in hyperparameter tuning.

But the history of technology teaches us: elegant universality always defeats specialized workarounds in the long term. We've already seen how transformers "consumed" recurrent networks in linguistics. Now we're watching them begin to absorb the world of computer vision.

This isn't just another model, it's a manifesto of a new architectural purity that will force many to reconsider their roadmaps for the next couple of years. The point: Beijing has officially secured its position as a leader in fundamental AI theory, proving that the future belongs to unified autoregressive models. Does this mean diffusion neural networks are headed to the scrap heap of history, or will they find their niche in narrowly specialized tasks?

ZK
Hamidun News
AI news without noise. Daily editorial selection from 400+ sources. A product by Zhemal Khamidun, Head of AI at Alpina Digital.

Want to stop reading about AI and start using it?

AI News is a curated feed of AI/tech news. Hamidun Academy teaches you to use AI systematically in your work.

What do you think?
Loading comments…