Nous Research sped up LLM pretraining 2.5x without changing the architecture

Q: Источник материала?

Оригинальная публикация на MarkTechPost. Hamidun News обрабатывает и адаптирует материалы с помощью AI.

Q: Когда опубликовано?

2026-05-16. Время чтения: 3 мин.

Nous Research developed Token Superposition Training (TST), a two-phase pretraining method that speeds up LLM training by 2.5x at the same compute cost. In the

Hamidun News Editorial

AI monitoring · MarkTechPost

2026-05-16· 3 min

Nous Research sped up LLM pretraining 2.5x without changing the architecture — Source: MarkTechPost. Collage: Hamidun News.

◐ Listen to article

Nous Research has developed Token Superposition Training (TST) — an innovative two-phase pretraining method that reduces training time for large language models by 2.5 times at the same computational cost, without requiring any changes to architecture, tokenizer, or inference behavior.

How Token Superposition Training Works

The method is based on a simple but effective idea: in the first phase of pretraining, neighboring token embeddings are averaged into groups, or bags. Instead of predicting each token separately, the model works with aggregated sequence representations. This allows it to process information in large blocks and significantly accelerate gradient computation during backpropagation. Essentially, the first phase teaches the model to find patterns at a higher level of abstraction.

The second training phase is slightly shorter in duration — the model transitions to standard next-token prediction, as any LLM normally does. At this stage, it quickly adapts to the final task and recovers any potential quality losses that may have appeared in the first phase. The transition between phases is smooth and natural for the neural network architecture — there are no strange artifacts or incompatibilities.

The key advantage of TST is that the method doesn't touch the model's internal architecture. The number of parameters remains unchanged, the surrounding tooling and ecosystem don't change — the same number of weights, the same tokenizer, the same Adam optimizer, SGD, or any other. During inference, the model is fully compatible with existing deployment systems. This is critical for industrial applications, where changing the architecture might require rewriting a lot of code.

Models Used to Test the New Technique

Nous Research tested TST on models of different scales and architectures to verify the universality of the approach:

270M parameters (mini-models for quick experiments)
600M parameters (standard size for research projects)
3B parameters (dense architecture, dense models)
10B parameters with Mixture of Experts (MoE) architecture

On all these scales, the method showed consistent 2.5x acceleration at the same computational cost, measured in FLOP (floating point operations). The results are encouraging: this is not a laboratory trick that works only on a specific model size or architecture, but a universal approach that scales well. This means it can be applied widely.

Why This Is Critical for the Industry

LLM pretraining is the most resource-intensive and economically costly stage of model development. Training a single large model requires thousands of hours of GPU-cluster operation, and electricity and equipment costs are measured in millions of dollars. A 2.5x speedup is not just a 5-10% improvement, but a serious and achievable reduction in total expenses that directly impacts the economics of development.

For startups and small teams, this means the ability to train high-quality, competitive models with a smaller initial budget. For large labs like Meta, Mistral, or OpenAI — the ability to experiment with many more architecture variants, hyperparameters, and training strategies on the same infrastructure. This expands the boundaries of experimentation, accelerates the pace of innovation, and allows faster testing of fresh ideas.

What This Means

Token Superposition Training proves that even in a well-studied area of pretraining, there are simple but powerful ways to save compute. This may inspire other researchers to search for similar optimizations at different stages of model training — from weight initialization to adaptive learning rate schedules. For the industry — a positive signal that the boundary between fundamental research and industrial application is becoming increasingly blurred, and good ideas quickly find their way to production.

Hamidun News

AI news without noise. Daily editorial selection from 400+ sources. A product by Zhemal Khamidun, Head of AI at Alpina Digital.

Telegram channel RSS hamidun.com