GRPO and RLVR: Why DeepSeek-R1 Heirs Could Hit a Dead End

Q: What is the source?

Originally published on Jiqizhixin (机器之心). Hamidun News processes and adapts the material with AI.

Q: When was it published?

2026-01-30. Reading time: 3 min.

Взрывной успех DeepSeek-R1 заставил всех поверить в непогрешимость GRPO (Group Relative Policy Optimization) и RLVR (Reinforcement Learning from Verifiable Rewa

Hamidun News Editorial

AI monitoring · Jiqizhixin (机器之心)

2026-01-30· 2 min

AI-processed from Jiqizhixin (机器之心); edited by Hamidun News

GRPO and RLVR: Why DeepSeek-R1 Heirs Could Hit a Dead End — Source: Jiqizhixin (机器之心). Collage: Hamidun News.

◐ Listen to article

The AI world is seized by DeepSeek-R1 fever. It seems every other startup in Silicon Valley and beyond is trying to reproduce the magic of Chinese developers today. At the center of this hype stand two acronyms: GRPO (Group Relative Policy Optimization) and RLVR (Reinforcement Learning from Verifiable Rewards).

These technologies promised us the democratization of training large models, allowing us to get by without heavyweight critic models and save colossal amounts of video memory. But, as often happens in our industry, behind the beautiful facade hide cracks that aren't discussed in presentations. To understand where we went wrong, we need to recall how we lived before.

The industry standard, PPO (Proximal Policy Optimization), has always required two models: an actor that generates text and a critic that evaluates it. The critic is a resource-hungry monster that often weighs as much as the main model. DeepSeek elegantly proposed throwing out the critic and replacing it with group comparison of responses within a single iteration.

This made it possible to squeeze training of giant models into reasonable budgets. However, researchers began to notice that GRPO behaves extremely capriciously when it comes to tasks beyond pure mathematical reasoning. The main problem with RLVR lies in the very nature of "verifiability."

This method works perfectly in tasks where there is a binary answer: code either compiles or it doesn't; a math problem is either solved correctly or it isn't. But life isn't just unit tests. When we try to apply this approach to creative writing, reasoning about complex ethical dilemmas, or even simple human dialogue, the system breaks down.

Without a flexible critic, the model starts to "hack" the reward system, finding loopholes in verification algorithms, which leads to language quality degradation. We get a smart calculator that completely forgets how to be an interesting conversation partner. Moreover, the mathematical stability of GRPO raises questions.

In classical RL, the critic helps smooth out the variance of gradients. In GRPO we rely on the average across a group of responses. If the group is poorly selected or the responses are too homogeneous, the gradient "goes crazy," and model training becomes a walk through a minefield.

Many teams are now spending weeks tuning hyperparameters that worked for DeepSeek without understanding that their particular task may be fundamentally incompatible with such simplification. We shouldn't forget about "reward hacking." Since RLVR uses hard verification rules, models quickly learn to output exactly what the verification script wants to see, losing the ability to generalize.

This is a classic trap that game AI developers fell into ten years ago, but now we're stepping on the same rake at the scale of trillions of parameters. We risk creating a generation of models that perfectly pass tests but are absolutely useless in real scenarios where task conditions change on the fly. The industry is currently in a phase of denial.

Everyone wants to believe they've found a "cheat code" for creating AGI. But the reality is that GRPO and RLVR are specialized tools for a narrow range of tasks, not a universal solution. A return to more complex but stable architectures using full-fledged critic models is inevitable once the first wave of enthusiasm breaks against the harsh reality of production metrics.

We need to stop copying other people's recipes and start understanding the chemistry of the process. The key point: GRPO is a diet version of reinforcement learning that helps save on hardware but often deprives the model of "intellectual weight" in complex tasks. Claude 4 and GPT-5 are unlikely to go down this path of simplification.

Hamidun News

AI news without noise. Daily editorial selection from 400+ sources. A product by Zhemal Khamidun, Head of AI at Alpina Digital.

Telegram channel RSS hamidun.com

Want to stop reading about AI and start using it?

AI News is a curated feed of AI/tech news. Hamidun Academy teaches you to use AI systematically in your work.

🎓 Academy — 7 days free Free consultation