AWS Shows How to Build AI Agents on SageMaker and Test Models via MLflow
AWS released a practical breakdown on building AI agents with Strands Agents SDK and models deployed in SageMaker. The setup includes JumpStart for launching…
AI-processed from AWS Machine Learning Blog; edited by Hamidun News
On April 27, 2026, AWS published a practical breakdown of how to run AI agents on their own managed infrastructure, not just on fully managed services. The company demonstrated a combination of Strands Agents SDK, SageMaker AI, and Serverless MLflow, where an agent can be quickly assembled, deployed to an endpoint, observe its behavior in production, and compare several model variants without changing the overall architecture. For teams that prioritize control, predictable costs, and security requirements, this looks like an attempt to transform agent systems from an experimental layer into a normal MLOps process.
At the core of the approach is Strands Agents SDK, an open source framework for building agents from a model, prompt, and set of tools. In AWS's example, it first shows a basic scenario with a model in Bedrock, then transfers the same idea to models running in SageMaker AI. The key point is that Strands can work with SageMaker inference endpoints as a model provider if it supports an OpenAI-compatible chat completions API.
The demonstration uses two versions of Qwen3 from SageMaker JumpStart — 4B and 8B. The first is deployed as the primary endpoint, after which the agent gains access to tools like HTTP requests and a calculator, and can execute typical tasks on top of its own model infrastructure.
Why move agent logic to SageMaker at all if there are ready-made APIs on the market? AWS is betting on four arguments. First — infrastructure control: you can precisely choose instances, network settings, and scaling rules for the required latency and SLA. Second — flexibility with models: in addition to ready-made foundation models, you can use custom or fine-tuned variants, as well as open-source models. Third — more predictable economics for large workloads through dedicated endpoints and precise resource tuning. Fourth — a proper enterprise framework around agents: tracing, versioning, A/B tests, and audit, which are needed not in demos but in production.
AWS separately emphasizes observability. For this, SageMaker AI Serverless MLflow is used: the service automatically writes execution traces, agent steps, tool calls, and metrics, without forcing the team to manually instrument the code with custom telemetry. After enabling autolog, data flows into the MLflow interface, where you can view the list of runs, expand a specific trace, see the Agent Loop, a tree of spans, inputs and outputs of each step. This is important not just for debugging. This level of transparency is needed when an agent starts making decisions in sensitive business processes, and the team needs to understand exactly where it failed, why it chose a specific tool, and how its behavior changes after a model update.
The most practical part of the material is A/B testing between model variants. AWS shows how to attach two production variations to the same endpoint, in the example Qwen3 4B and Qwen3 8B, and initially split traffic between them 50/50. After that, you can either compare answers in the live stream or create two separate agents, each looking at its own target variant.
Next, MLflow GenAI evaluation is connected: the team collects a single set of test cases, sets expectations for facts and tools used, then runs both variants through the same scorers. The example uses both deterministic checks and LLM-as-a-judge metrics like correctness and relevance. This scenario turns model selection from a debate about feelings into a reproducible procedure: the new version doesn't just seem smarter, but passes the same tests, after which it can be gradually made the default by changing weights.
The conclusion is simple: AWS is not selling another agent SDK, but an engineering scheme in which an agent becomes a managed product component. If companies need their own models, their own perimeter, agent action auditing, and careful rollout of new versions, the Strands, SageMaker, and MLflow combination addresses this scenario much closer to enterprise reality than many quick demo stacks. For the market, this is another signal that the next competition in AI is no longer just about model quality, but about the quality of the infrastructure around it.
Want to stop reading about AI and start using it?
AI News is a curated feed of AI/tech news. Hamidun Academy teaches you to use AI systematically in your work.