AWS open-sourced Agent-EvalKit: systematic evaluation of AI agents in six phases
AWS has open-sourced Agent-EvalKit, an Apache 2.0 framework for systematic evaluation of AI agents. The tool integrates with Claude Code, Kiro CLI, and Kilo…
AI-processed from AWS Machine Learning Blog; edited by Hamidun News
AWS released Agent-EvalKit — an open-source tool (Apache 2.0) for systematic evaluation of AI agents. The framework integrates with Claude Code, Kiro CLI, and Kilo Code and takes an agent through six sequential verification phases.
Why Agent Evaluation Matters
Developing an AI agent is straightforward. Understanding how well it performs is another story. An agent can return plausible answers while calling unnecessary tools, spending orders of magnitude more tokens than needed, or skipping critical steps in its reasoning chain. Standard metrics like accuracy don't work here: an agent is a dynamic system where not only the final endpoint matters, but the entire path to it. Tool logs, call order, intermediate decisions — all of this affects the reliability of an agent in production. This is why the AWS team created specialized evaluation infrastructure.
Six Verification Phases
The framework sequentially runs an agent through six stages:
- Task preparation — forming a set of test cases with input data, context, and reference answers
- Agent execution — performing tasks in a controlled environment with full trace recording
- Trajectory evaluation — checking whether the agent called the required tools in the correct order
- Final answer evaluation — comparing the result against the reference by content, structure, and accuracy
- Security analysis — checking for undesired behavior and scope violations
- Report generation — aggregating metrics and forming a final score with category breakdown
Each phase can be configured separately: run only trajectory evaluation, only the final report, or the full cycle.
Example: Travel Planning Agent
As a demonstration, AWS shows an agent written using Strands Agents SDK and running on Amazon Bedrock. The agent takes a user request — for example, "Plan a seven-day trip to Tokyo with a $2000 budget" — searches for flights and hotels through external tools, analyzes attractions, and returns a final itinerary. Agent-EvalKit checks such an agent across all six phases: verifies that the flight search tool was called before the hotel search, that the final answer contains specific dates and prices, that the agent stayed within budget and didn't invent nonexistent flights. Such a check reveals errors invisible in regular manual testing.
Integration with AI Assistants
The principal difference of Agent-EvalKit from analogues is deep integration with AI coding assistants. Claude Code, Kiro CLI, and Kilo Code can run evaluation directly inside the developer's working environment, without switching to a separate platform or setting up a separate pipeline. The framework is distributed under the Apache 2.0 license. The source code is open on GitHub; documentation describes ready-made examples for several popular AI frameworks.
"We wanted to create evaluation infrastructure that developers could plug in within minutes, without building it from scratch," write the authors in the AWS
Machine Learning blog.
What This Means
The appearance of a standardized evaluation tool is an important step toward industrial use of AI agents. Without the ability to systematically measure agent performance on real tasks, it's difficult to justify its application in critical business processes. Agent-EvalKit offers a concrete methodology instead of manual testing.
Need AI working inside your business — not just in your newsfeed?
I build production AI for companies — custom CRM, internal tools, autonomous agents, workflow automation. Owned by you, shaped to your process, no per-seat tax. Built by Zhemal Khamidun, CPO of AlpinaGPT (AI platform, 6,000+ users).
The AI world, distilled — once a week
Seven stories that actually mattered, hand-picked. No noise, no reposts, no press releases.
Done! Check your inbox for a confirmation.