BorisovAI tested MoE on an RTX 4090 and showed why perplexity breaks LLM evaluation
BorisovAI ran 22 experiments with an MoE architecture in which new experts are attached to a frozen model as plugins. On a single RTX 4090, the setup caused…
AI-processed from Habr AI; edited by Hamidun News
BorisovAI published a breakdown of 22 experiments with MoE architecture, in which new experts are plugged on top of a frozen language model like plugins. On a single RTX 4090, the scheme demonstrated an almost ideal engineering picture: zero degradation of old skills, precise routing, and notable perplexity reduction. But when the system was tested on a mathematical benchmark, it turned out that a beautiful metric could point in the wrong direction.
How the scheme was built
The researcher froze the base model entirely and added a small trainable expert to each MLP layer, plus a router on top with approximately 37 thousand parameters. The logic is simple: the backbone is left untouched, the new skill is trained separately, and then only the router is fine-tuned to send the right tokens to the right expert. Training a single new domain took about half an hour: roughly 15 minutes for the isolated expert and another 15 minutes for integration into the overall system. Across three scales, the scheme looked very convincing and with almost no trade-offs:
- GPT-2 124M with 4 domains reduced perplexity by 33.4%
- Pythia-410M with 6 domains reduced perplexity by 34.3%
- Pythia-1B with 8 domains reduced perplexity by 31.2%
- Routing accuracy reached 96%, and degradation of old skills remained at 0.000%
Additionally, the author tested several popular techniques often recommended for MoE. Load balancing penalties made results worse by 11–27%, and joint training of experts and router led to quality collapse. Loss-free balancing worked best: it kept all experts "alive" without a separate additional loss. At this stage, everything looked like a strong argument for modular LLMs, where new capabilities could be plugged in without full retraining.
Where the metric broke
Problems started when the architecture was transferred to Qwen 2.5 3B and a mathematical expert was added. By internal metrics, everything was again perfect: perplexity on mathematical texts dropped by 23.9%, the inter-domain gap grew 64.9 times, and the router selected the right expert with almost no errors. But on GSM8K, which tests the ability to solve school word problems, the model dropped from 74.4% to 65.8%.
"A model that speaks the language of mathematics has no ability to
solve problems."
This is the key finding of the entire work. The expert trained on textbooks and papers indeed mastered the statistics of mathematical language: where formulas typically appear, which terms come nearby, and what "correct" text from this domain looks like. But GSM8K requires not style recognition, but chains of reasoning. Therefore, perplexity reduction here measured not actual improvement in thinking, but simply better modeling of domain-specific text. The researcher separately rechecked alternatives—unfreezing upper layers, joint training, and a two-stage scheme—but all options remained at about the same minus 8.4–8.6 percentage points.
What worked better
A working breakthrough came after self-distillation. Instead of raw mathematical texts, the expert was trained on step-by-step solutions that the base model itself already managed to get right. For this, 750 GSM8K tasks were taken: the model solved 638 of them, and from these solutions a dataset of 119 thousand tokens was assembled. This is 33 times smaller than the corpus of 4 million tokens of textbooks and papers, but the format turned out to be much closer to actual inference. The result flipped from expectations. After such training, GSM8K grew to 75.5%: this is 1.1 percentage points above the base model and 9.7 points better than the raw mathematical text variant. Meanwhile, perplexity, conversely, worsened by 17.8%.
Additionally, it turned out that even data packaging matters: the "question/answer" format gave another 2–3 points more than the more academic "problem/solution" format. In other words, it's more useful to train the expert for the form of future use, not for an abstractly "high-quality" corpus. An attempt to turn this approach into a self-improvement cycle didn't take off.
Early runs hinted at growth from 75.5% to 76.0%, but after fixing the seed and expanding the sample, the effect turned out to be statistical noise.
On cold start, the new expert quickly reached a plateau, and on warm start, quality even declined because the same tasks repeated too much between cycles, and the expert overfitted. Label smoothing separately failed: on mathematics it cost another 9 points.
What this means
For LLM developers, there are immediately two practical conclusions here. First, modular architecture with pluggable experts can indeed add domain skills without catastrophic forgetting. Second, evaluating such systems by perplexity is dangerous: it can improve precisely when the model begins to reason worse. If the task involves logic, code, or mathematics, the main criterion should be behavioral benchmarks, not just beautiful language metrics.
Want to stop reading about AI and start using it?
AI News is a curated feed of AI/tech news. Hamidun Academy teaches you to use AI systematically in your work.