Training

Reinforcement Learning with Verifiable Rewards (RLVR)

Reinforcement Learning with Verifiable Rewards (RLVR) is a training approach in which RL reward signals come from objective, programmatically checkable criteria — such as numerical correctness of a math answer or code passing unit tests — rather than from a learned neural reward model.

RLVR is a training methodology for language models in which reinforcement learning rewards are derived from objective, programmatically verifiable criteria rather than from a neural network trained on human preference annotations. The term was widely adopted following the release of DeepSeek-R1 in January 2025, though the underlying principle appeared in earlier work on code generation and mathematical reasoning.

In practice, RLVR applies policy-gradient algorithms — most commonly Group Relative Policy Optimization (GRPO) or a variant of PPO — where the reward function is a deterministic checker: for a math problem, the model's final numerical answer is compared to a ground-truth value (correct = +1, incorrect = 0); for code, the generated program is executed against hidden unit tests and scored on pass rate. This structure directly avoids reward hacking — the tendency of learned reward models to be exploited by surface-level linguistic patterns that score well without being genuinely correct.

RLVR has become the dominant approach for training reasoning-specialized models because verifiable domains provide abundant, reliable training signal without per-instance annotation cost. The DeepSeek-R1 technical report showed that a base model trained with RLVR on mathematical and coding problems spontaneously developed extended chain-of-thought reasoning — including self-correction and multi-step exploration — without any supervised reasoning traces. Similar results were replicated by Qwen, Kimi, and several academic groups within months of that release.

As of mid-2026, RLVR is a core training stage for frontier reasoning models from most major labs. Research is expanding the verifiable reward paradigm beyond math and code to formal theorem proving with Lean 4 proof checkers, structured scientific data generation, and database query synthesis. Open-source RLVR training frameworks such as OpenRLHF and verl have lowered the barrier to replication for smaller research teams.

Example

A reasoning model trained with RLVR on competition mathematics datasets is rewarded only when it produces the exact correct numerical answer, causing it to learn longer, self-checking reasoning chains rather than surface-level pattern-matching shortcuts.

Related terms

← Glossary