NVIDIA X-Token: distillation that beats GOLD by 3.82 points
NVIDIA has introduced X-Token, a knowledge distillation method for small language models. The new approach outperforms GOLD by 3.82 points on average. On math t

NVIDIA released the X-Token method for optimizing small language models. X-Token is a knowledge distillation approach that corrects two structural limitations of the previous GOLD method and demonstrates significant improvements on standard benchmarks.
What is X-Token
X-Token is a Projection-Guided Cross-Tokenizer Knowledge Distillation method. In simpler terms, it's a way to transfer knowledge from a large model to a small one, but taking into account different token vocabulary sets (the elements into which the model breaks down text). Small models often run on their own tokenizers—special text parsing systems—and previously, distillation ignored this. X-Token solves this problem.
The method introduces an intermediate projection layer that translates representations between two different token spaces. It's like a translator working at the very foundation level of the model. When a large model transfers knowledge to a small one, X-Token ensures that information is not lost in translation from one encoding method to another.
Results That Impress
On the Llama-3.2-1B model, X-Token demonstrates consistent superiority:
- On common benchmarks—an improvement of 3.82 points on average compared to GOLD
- On mathematics tasks (GSM8k)—a jump from 2.56% to 15.54% accuracy
- On the standard MMLU test—growth from 24.0 to 24.7%
This is not just a marginal gain—on mathematics, accuracy increased sixfold. For a small 1-billion-parameter model, this is critical: every percentage point counts, because at such scales even small improvements in capability help solve more complex tasks.
Structural Errors in GOLD
The previous GOLD method ignored that the tokenizer in a small model could be completely different. This led to two problems: first, knowledge from the large model lost meaning when the small model translated it to its own vocabulary; second, distillation could not effectively utilize all the capabilities of the small model. X-Token embeds a projection between different token spaces into the distillation process. It's like a bridge between two information encoding systems. This is especially important when the small model is designed for fast execution on mobile or edge devices and has its own unique tokenizer to save memory.
What This Means
Small models are needed everywhere: on phones, in IoT devices, on local servers, where there is no cloud access or where latency is critical. X-Token shows that you can take knowledge from a huge model and efficiently 'compress' it into a small format—directly with its own vocabulary. This is the path to AI that works everywhere, not just on cloud computers. And a sixfold improvement on mathematics is a signal that small models are beginning to gain real capabilities for practical tasks. Soon, local AI could become the standard, not the exception.
Хотите не читать про ИИ, а внедрить его?
«AI News» — это полезные новости из мира ИИ. Системно научиться работать с нейросетями и применять их в работе — в Hamidun Academy.