Verifiable rewards: how AWS improves neural network training
AWS developed the RLVR method, which uses objectively verifiable rewards instead of approximate evaluations. The technique works on mathematical reasoning tasks

AWS presented a new approach to training models with reinforcement learning — Reinforcement Learning with Verifiable Rewards (RLVR), which introduces verification and transparency into reward signals. Instead of approximate quality assessments of responses, the model receives a reward only if the result is completely correct and can be objectively verified.
The Problem with Traditional RL
In standard reinforcement learning, the reward function evaluates the quality of the model's action. But these assessments are often inaccurate: it's difficult to devise a function that correctly evaluates complex behavior. The model can optimize for the wrong thing — an effect known as reward hacking. RLVR solves this problem fundamentally: a reward is issued only for a completely correct result. This is possible in tasks where the answer can be unambiguously verified. The model learns from ground truth, not from approximate assessments.
Where Verifiable Rewards Work
Verification is applicable wherever the result has an objective criterion:
- Mathematical reasoning — theorem proving, equation solving. The answer is either mathematically correct or not
- Code generation — syntax is checked by a parser, functionality by tests. No room for subjectivity
- Symbolic manipulation — logical transformations, algebra. Verification is fully automated
- Structured data extraction — if the task has a correct format, it's easy to validate
For tasks without objective verification (e.g., text generation, design), RLVR performs worse.
GRPO + Few-Shot Learning
AWS adds to RLVR the Group Relative Policy Optimization (GRPO) technique — a modification of the model's policy optimization algorithm. Instead of improving each step independently, GRPO groups sequences of actions and compares them to each other. This accelerates convergence and avoids local minima. An additional layer is few-shot examples. The model first sees several solved examples (typically 3–5), after which it trains on the full dataset. This helps establish the desired behavioral pattern before optimization begins. The combination works synergistically: verifiable rewards provide a clean signal, GRPO accelerates the search for the optimum, few-shot establishes the format.
Results on GSM8K
AWS tested the approach on the GSM8K dataset — a collection of 8500 school math problems of varying difficulty levels. The model trained with RLVR showed significant accuracy improvement in solving problems compared to baseline methods. The key result: verification is built into the training process, not added as a check at the end. This allows the model to learn from correct examples rather than trying to satisfy an approximate reward function. The methodology transfers well to adjacent domains: code generation, logic checking, configuration validation.
What This Means
Verifiable rewards represent a shift from heuristics to verification at the core of learning. For engineers and researchers: if your task admits objective verification, RLVR will deliver higher accuracy and fewer strange artifacts. AWS is preparing this approach for scaling through SageMaker AI, which will ease adoption for cloud users.
Хотите не читать про ИИ, а внедрить его?
«AI News» — это полезные новости из мира ИИ. Системно научиться работать с нейросетями и применять их в работе — в Hamidun Academy.