AWS Machine Learning Blog→ original

AWS explained the launch of reinforcement fine-tuning in Amazon Bedrock via OpenAI-compatible APIs

AWS released a technical breakdown of reinforcement fine-tuning in Amazon Bedrock via OpenAI-compatible APIs. The scenario is as follows: configure the…

AI-processed from AWS Machine Learning Blog; edited by Hamidun News
AWS explained the launch of reinforcement fine-tuning in Amazon Bedrock via OpenAI-compatible APIs
Source: AWS Machine Learning Blog. Collage: Hamidun News.
◐ Listen to article

AWS released a detailed breakdown of how to run reinforcement fine-tuning in Amazon Bedrock through OpenAI-compatible APIs. Essentially, the company offers a route familiar to developers: the same OpenAI SDK, but with Bedrock as the platform for training, evaluation, and inference.

How the Process Works

Reinforcement fine-tuning, or RFT, is needed in cases where it's not enough to simply show models the correct answers, as in classical supervised fine-tuning. Here, the model generates multiple answer variants for the same prompt, and then a separate reward function assigns them a numerical score. Amazon Bedrock takes this signal and runs the optimization cycle itself through the GRPO algorithm. For a team, this means there's no need to build heavy infrastructure for reinforcement learning: orchestration, parallelization, checkpoints, and metrics are handled by the service.

In a practical walkthrough, AWS shows that the entry point to this scenario has been made as similar as possible to the familiar OpenAI stack. A developer only needs to point `OPENAI_BASE_URL` at the regional Bedrock Mantle endpoint and pass an `OPENAI_API_KEY` generated for Bedrock. After that, you can use the same calls: `client.files.create()`, `client.fine_tuning.jobs.create()`, and `client.chat.completions.create()`. That is, the barrier is not in a new SDK, but in how well you've formalized the criterion for answer quality.

Data and Reward Function

In the example, AWS uses the GSM8K dataset for school math problems. Data is loaded through the Files API in JSONL format: each line contains a `messages` block, and for tasks being evaluated, a `reference_answer` is added. This format allows not only sending a question to the model, but also preserving the reference answer or verification rule.

In the walkthrough, it's separately shown that the prompt can be pre-structured so that the final answer is easy to extract automatically — for example, in a special format like `\boxed{}` or after a `####` marker.

The key node of the entire scheme is the reward function in AWS Lambda. In the demonstration, it receives trajectories, finds the last assistant response, extracts the correct answer from `reference_answer`, and returns a score from 0 to 1. For mathematics, this is simply a binary check, but the logic isn't limited to just such cases. AWS separately emphasizes that custom rules can be built into Lambda, and for less formalizable tasks, a model-as-a-judge approach can be used. Plus an important point for enterprise: data doesn't leave AWS's protected environment during the process and isn't used to train Bedrock models.

Training and Launch

The training launch itself looks quite compact: in `fine_tuning.jobs.create()`, you pass the base model, training file, method type `reinforcement`, the Lambda grader ARN, and a set of hyperparameters. The example features `openai.gpt-oss-20b`, one epoch, `batch_size=4`, and `learning_rate_multiplier=1.0`, although documentation recommends starting with a value below one for stability. Then Bedrock itself creates the job, counts steps, and saves intermediate checkpoints that can be used for quality evaluation before training is complete.

During training, AWS suggests monitoring not just job status, but also events with metrics. In the example, a job on a GSM8K subset runs 67 steps, and the reward curve rises from approximately 0.56 to the range of 0.85–0.97 by mid-training. At the same time, answers become shorter, which the authors interpret as a sign that the model has learned to solve tasks more accurately and without unnecessary verbosity.

  • `critic_rewards_mean` — the main signal: if it grows, the model is learning
  • `actor_entropy` — shows whether answer diversity is collapsing into mode collapse
  • `actor_grad_norm` — helps notice instability if gradients start jumping sharply
  • `response_length_mean` — useful against reward hacking, when the model starts inflating answers for the score

After job completion, the model doesn't need to be deployed separately. It's enough to get `fine_tuned_model` from the job details and immediately call it through the Chat Completions API or Responses API, including streaming. This is the main practical advantage of the entire scheme: training and inference remain in the same API landscape.

Bedrock documentation separately clarifies that the OpenAI-compatible path for fine-tuning is currently available for `openai.gpt-oss-20b` and `qwen.qwen3-32b` in the `us-west-2` region.

"No separate endpoint and hosting."

What This Means

AWS clearly wants to make reinforcement fine-tuning not a research curiosity, but a normal engineering tool. If a team already has code for the OpenAI SDK and clear reward logic, the entry into RFT becomes noticeably easier: you can start with 100–200 examples, check metrics, compare checkpoints, and understand whether the tuning will yield a cheaper and faster model for a specific task. This is especially interesting for mathematics, code, and other scenarios where answer quality can be verified automatically.

ZK
Hamidun News
AI news without noise. Daily editorial selection from 400+ sources. A product by Zhemal Khamidun, Head of AI at Alpina Digital.

Want to stop reading about AI and start using it?

AI News is a curated feed of AI/tech news. Hamidun Academy teaches you to use AI systematically in your work.

What do you think?
Loading comments…