Hugging Face Blog→ original

IBM reveals how it built Granite 4.1: 15 trillion tokens, 512K context window, and focus on quality

IBM showed the behind-the-scenes development of Granite 4.1—a family of open-source LLMs at 3B, 8B, and 30B parameters. The models were trained on 15…

AI-processed from Hugging Face Blog; edited by Hamidun News
IBM reveals how it built Granite 4.1: 15 trillion tokens, 512K context window, and focus on quality
Source: Hugging Face Blog. Collage: Hamidun News.
◐ Listen to article

On April 29, 2026, IBM published a detailed breakdown on the Hugging Face blog of how the Granite 4.1 family was created. The company revealed not only the model architecture but also the entire pipeline: from dataset composition and long context to SFT data filtering and multi-stage reinforcement learning.

How the Granite 4.1 Series is Structured

Granite 4.1 is a family of decoder-only dense models with 3B, 8B, and 30B parameters. In all three variants, IBM uses the same basic logic: Grouped Query Attention, Rotary Position Embeddings, SwiGLU, RMSNorm, and shared input and output embeddings.

The main difference is scale — the number of layers, hidden state size, and MLP parameters. This approach allows comparing models within the same family without discounts for different architectural philosophy. IBM's main thesis is that the quality of a small model is determined not only by computational budget but also by discipline in working with data.

Therefore, Granite 4.1 was built as dense rather than MoE models, and the bet was placed on careful changes in data mixtures during training. All models are released under the Apache 2.

0 license, and instruct versions support 12 languages, including English, German, Spanish, Japanese, Arabic, Chinese, and Portuguese.

Five Training Stages

Pretraining of Granite 4.1 started from scratch and covered approximately 15 trillion tokens. IBM divided the process into five phases: first the model builds a broad language base on web data, then strengthens math and code, after which it gradually transitions to higher quality and specialized samples. In later phases, long reasoning trajectories, synthetic data, and instruction datasets are added to the mixture, and finally separate training occurs for handling very long context.

  • Phase 1: 10 trillion tokens of general pretraining, where about 59% of the mixture comes from CommonCrawl.
  • Phase 2: another 2 trillion tokens with a sharp increase in the share of math and code — up to 35% and 30% respectively.
  • Phase 3: 2 trillion tokens of high-quality annealing, where chain-of-thought, synthetic, and instruction data appear.
  • Phase 4: another 0.5 trillion tokens with emphasis on the highest quality mixture and learning rate decrease to zero.
  • Phase 5: long-context extension, which expands the window from 4K to 32K, 128K, and then to 512K.

To prevent long context from breaking performance on short queries, IBM merges the model after each LCE stage. For final expansion to 512K in the 8B and 30B versions, a mixture of books and code repositories was used. On the base models this gave noticeable results on RULER: the 8B variant maintains high metrics even at 128K, and 30B goes even higher. This is an important signal for teams that need not only chat responses but also work with long documents, logs, and large code snippets.

Fine-tuning and Quality

After pretraining, IBM ran the SFT dataset through a strict quality control loop. About 4.1 million examples made it to the final selection, but before that each answer was checked through an LLM-as-Judge scheme and a set of deterministic rules.

The evaluator model looked at instruction following, correctness, completeness, brevity, naturalness, and calibration, while strict rejection reasons included hallucinations, false premises, and computational errors. Additionally, normalization, schema validation, length filters, and global deduplication were applied. At the RL stage, IBM didn't limit itself to one pass.

The company used on-policy GRPO with DAPO loss and collected four sequential stages: multi-domain RL, RLHF for general utility and dialogue, identity and knowledge-calibration RL, and then separate math RL, which restores and improves mathematical skills after RLHF. According to IBM, RLHF alone added an average of about 18.9 points on AlpacaEval relative to SFT checkpoints.

The most notable result is that the instruct model Granite 4.1 8B consistently compares to Granite 4.0-H-Small 32B-A9B and outperforms it on several benchmarks.

In parallel, IBM released FP8 variants, which roughly halve memory and disk space requirements.

What This Means

IBM demonstrated that competing in open-source LLMs is possible not only through model size but through training recipe quality. For companies, this makes Granite 4.1 a practical candidate: predictable latency without long reasoning traces, long context, open license, and lower running costs compared to heavier systems.

ZK
Hamidun News
AI news without noise. Daily editorial selection from 400+ sources. A product by Zhemal Khamidun, Head of AI at Alpina Digital.

Want to stop reading about AI and start using it?

AI News is a curated feed of AI/tech news. Hamidun Academy teaches you to use AI systematically in your work.

What do you think?
Loading comments…