Habr AI→ original

ruGPT3XL Gains 8k Context: Restored Model Transcends 2k Limit with Minimal Losses

After restoration, ruGPT3XL gained not only working sparse attention but also an 8k context window instead of the original 2k tokens. The project author…

AI-processed from Habr AI; edited by Hamidun News
ruGPT3XL Gains 8k Context: Restored Model Transcends 2k Limit with Minimal Losses
Source: Habr AI. Collage: Hamidun News.
◐ Listen to article

ruGPT3XL, restored from an old Megatron-LM checkpoint, received a full context window of 8 thousand tokens instead of the original 2 thousand — and lost almost nothing in quality on short texts. At the same time, the project author fixed a critical bug in the attention mechanism that made the early version of the model look noticeably worse than the original, although it formally ran and generated text. The project started as a technical restoration of an old Russian-language model: the ruGPT3XL weights were converted to Hugging Face format, a GGUF version was prepared for llama.

cpp, and tests were run. At this stage it turned out that the conversion was not quite correct. Instead of the original sparse attention, the model was actually using regular dense attention from GPT-2, so quality dropped sharply on long sequences.

This was quickly confirmed by the perplexity metric: the first check showed PPL 50.1, whereas the original ruGPT3XL was reported to be 12.05.

After restoring alternating sparse attention, the situation changed dramatically. PPL dropped to 11.68, meaning the model returned to original values and began computing attention as intended in the original architecture.

In parallel, support in llama.cpp had to be updated: the previous patch transferred weights to GGUF, but did not implement sparse attention itself, so the local version was also computing using a dense scheme. The author additionally fixed an error in the mask for batches longer than one example and added acceleration through SDPA, torch.

compile, and Triton. On an RTX 4090, this gave a training speed increase of approximately 1.85x relative to the base eager implementation.

The main goal of the next stage was practical: remove the old 2048-token limit that hindered work with long chats and documents. But for ruGPT3XL this is not enough to simply change a number in the config. The model uses learned absolute positional embeddings, which cannot extrapolate properly to new positions without additional training, and the sparse attention scheme also depends on the maximum context length.

Therefore, the expansion was done in stages: first from 2k to 4k, then from 4k to 8k. For new positions, positional embedding tiling was applied to avoid breaking already-learned short sequences, and the dataset was mixed from long and short examples in a 60 to 40 ratio. Training on the Gazeta dataset took about 2.

6 hours on the first stage and 3.9 hours on the second. The result turned out neat, not demonstrative.

On the original 2k window, the final 8k version showed PPL 11.77 versus 11.68 in the base model, meaning the regression was only 0.

09. At 4k the final figure was 11.99, and at the full 8k window — 13.

00, which looks very solid for a four-fold increase in context. In terms of memory, the experiment proved viable too: thanks to sparse attention, the increase in consumption did not become catastrophic, and both training and inference fit on an RTX 4090 with 48 GB VRAM. During training on 8k, another practical problem arose — CUDA memory fragmentation — but it was worked around by setting expandable_segments, after which peak consumption dropped from 46.

8 to 38.5 GB. Generation speed does decline with a growing prompt, but at the full 8k context the model still maintains about 38 tokens per second, so this is not just a research trick but a fully operational local scenario.

For the Russian-language open-source segment, this is an important signal: even old models can be not just archived, but brought to contemporary working condition if you carefully restore architectural details and do not skimp on validation. In the case of ruGPT3XL, this is no longer a cosmetic update, but a real increase in usefulness: the model became closer to the original in quality, received support in popular tools, and learned to work with long context without serious loss on short tasks.

ZK
Hamidun News
AI news without noise. Daily editorial selection from 400+ sources. A product by Zhemal Khamidun, Head of AI at Alpina Digital.

Want to stop reading about AI and start using it?

AI News is a curated feed of AI/tech news. Hamidun Academy teaches you to use AI systematically in your work.

What do you think?
Loading comments…