AWS Shows How to Fine-Tune Amazon Nova via Nova Forge SDK and SageMaker Jobs

Q: What is the source?

Originally published on AWS Machine Learning Blog. Hamidun News processes and adapts the material with AI.

Q: When was it published?

Apr 30, 2026. Reading time: 3 min.

AWS detailed how to customize Amazon Nova via Nova Forge SDK and SageMaker AI. In the example, the team trains a model on Stack Overflow question…

Hamidun News Editorial

AI monitoring · AWS Machine Learning Blog

Apr 30, 2026· 2 min

AI-processed from AWS Machine Learning Blog; edited by Hamidun News

AWS Shows How to Fine-Tune Amazon Nova via Nova Forge SDK and SageMaker Jobs — Source: AWS Machine Learning Blog. Collage: Hamidun News.

◐ Listen to article

AWS showed a practical scenario for customizing Amazon Nova models through Nova Forge SDK and Amazon SageMaker AI. In the guide, the team goes through the entire cycle — from basic model assessment to SFT, RFT, and deployment of a custom endpoint for inference.

Scenario and Data

AWS positions Nova Forge SDK as a layer that removes the most tedious part of LLM customization: infrastructure preparation, image selection, configuration validation, and launching training recipes. Instead of manually building a pipeline, a developer gets a set of ready-made components for loading a dataset, transforming its format, starting a job in SageMaker, and subsequently evaluating the results. In the article, this is demonstrated not with a toy example, but with a clear practical task — automatic classification of Stack Overflow questions by quality.

For the experiment, AWS took the Stack Overflow Question Quality dataset of 60 thousand questions from 2016–2020 and randomly selected 4700 records. The model was supposed to classify each question into one of three categories: HQ, LQ_EDIT, or LQ_CLOSE. For SFT, 3500 examples were allocated, 500 for evaluation, and for RFT, another 700 specialized examples were used, supplemented with all 3500 SFT records to prevent the model from forgetting the answer format it had already learned.

How Training Progressed

The experiment scheme is divided into four steps: first, a baseline assessment of the pre-trained Nova 2.0, then supervised fine-tuning, followed by reinforcement fine-tuning, and finally deployment on Amazon SageMaker AI Inference. To load CSV, verify the schema, and transform data, AWS uses the CSVDatasetLoader class, and to launch computations — SMTJRuntimeManager. SFT in the example runs on four ml.p5.48xlarge instances, and the SDK is able to validate environment and parameter compatibility in advance to avoid catching errors after the job starts.

Baseline shows how the model behaves without fine-tuning
SFT teaches the correct format and subject matter pattern of the answer
RFT fine-tunes the solution through a reward function
Deployment can be done either in Bedrock or in SageMaker

For RFT, AWS added a simple reward function through Lambda: +1 for the correct class and -1 for the incorrect one. Fine-tuning was launched on top of the SFT checkpoint on two ml.p5.48xlarge instances, and the run itself was kept short — just 40 steps. Additionally, the team limited output length and introduced a KL-penalty to prevent the model from deviating too far from the behavior established during the SFT phase. In other words, the SDK here acts not just as a wrapper around launching, but as a unified point for data preparation, training, logging, and deployment.

What the Metrics Showed

The most useful part of the article — the numbers. The baseline Nova 2.0 showed only 13% exact match on a three-class task, where random guessing would yield about 33.3%. Even if we ignore the verbosity of responses and extract only the class label from the text, accuracy was 52.2%. AWS explains this with two problems: the model was too eager to write long explanations instead of a single label, and was biased toward the HQ answer regardless of the actual quality of the question.

After a short SFT, exact match rose to 77.2%, and classification accuracy on extracted labels — to 79.0%. The next layer, RFT, added a bit more: exact match rose to 78.8%, quasi-EM — to 80.6%, F1 — to 78.8%. The improvement after the reinforcement stage turned out to be not huge, but consistent across almost all key metrics. AWS also notes separately that BLEU is almost useless for such a task: when the model answers with a single token like HQ or LQ_CLOSE, it's more important to look at exact match and F1, rather than n-gram overlap.

What This Means

AWS is trying to sell not just another model, but a shorter path to its practical customization. If Nova Forge SDK really covers validation, launching, monitoring, and deployment in one interface, then teams will find it easier to test hypotheses on niche datasets without a separate MLOps quest for each iteration.

Hamidun News

AI news without noise. Daily editorial selection from 400+ sources. A product by Zhemal Khamidun, Head of AI at Alpina Digital.

Telegram channel RSS hamidun.com

Want to stop reading about AI and start using it?

AI News is a curated feed of AI/tech news. Hamidun Academy teaches you to use AI systematically in your work.

🎓 Academy — 7 days free Free consultation