AWS open-sourced Agent-EvalKit: systematic evaluation of AI agents in six phases

AWS has open-sourced Agent-EvalKit, an Apache 2.0 framework for systematic evaluation of AI agents. The tool integrates with Claude Code, Kiro CLI, and Kilo…

Hamidun News Editorial

AI monitoring · AWS Machine Learning Blog

Jun 30, 2026· 2 min

AI-processed from AWS Machine Learning Blog; edited by Hamidun News

AWS open-sourced Agent-EvalKit: systematic evaluation of AI agents in six phases — Source: AWS Machine Learning Blog. Collage: Hamidun News.

◐ Listen to article

AWS released Agent-EvalKit — an open-source tool (Apache 2.0) for systematic evaluation of AI agents. The framework integrates with Claude Code, Kiro CLI, and Kilo Code and takes an agent through six sequential verification phases.

Why Agent Evaluation Matters

Developing an AI agent is straightforward. Understanding how well it performs is another story. An agent can return plausible answers while calling unnecessary tools, spending orders of magnitude more tokens than needed, or skipping critical steps in its reasoning chain. Standard metrics like accuracy don't work here: an agent is a dynamic system where not only the final endpoint matters, but the entire path to it. Tool logs, call order, intermediate decisions — all of this affects the reliability of an agent in production. This is why the AWS team created specialized evaluation infrastructure.

Six Verification Phases

The framework sequentially runs an agent through six stages:

Task preparation — forming a set of test cases with input data, context, and reference answers
Agent execution — performing tasks in a controlled environment with full trace recording
Trajectory evaluation — checking whether the agent called the required tools in the correct order
Final answer evaluation — comparing the result against the reference by content, structure, and accuracy
Security analysis — checking for undesired behavior and scope violations
Report generation — aggregating metrics and forming a final score with category breakdown

Each phase can be configured separately: run only trajectory evaluation, only the final report, or the full cycle.

Example: Travel Planning Agent

As a demonstration, AWS shows an agent written using Strands Agents SDK and running on Amazon Bedrock. The agent takes a user request — for example, "Plan a seven-day trip to Tokyo with a $2000 budget" — searches for flights and hotels through external tools, analyzes attractions, and returns a final itinerary. Agent-EvalKit checks such an agent across all six phases: verifies that the flight search tool was called before the hotel search, that the final answer contains specific dates and prices, that the agent stayed within budget and didn't invent nonexistent flights. Such a check reveals errors invisible in regular manual testing.

Integration with AI Assistants

The principal difference of Agent-EvalKit from analogues is deep integration with AI coding assistants. Claude Code, Kiro CLI, and Kilo Code can run evaluation directly inside the developer's working environment, without switching to a separate platform or setting up a separate pipeline. The framework is distributed under the Apache 2.0 license. The source code is open on GitHub; documentation describes ready-made examples for several popular AI frameworks.

"We wanted to create evaluation infrastructure that developers could plug in within minutes, without building it from scratch," write the authors in the AWS

Machine Learning blog.

What This Means

The appearance of a standardized evaluation tool is an important step toward industrial use of AI agents. Without the ability to systematically measure agent performance on real tasks, it's difficult to justify its application in critical business processes. Agent-EvalKit offers a concrete methodology instead of manual testing.

Hamidun News

AI news without noise. Daily editorial selection from 400+ sources. A product by Zhemal Khamidun, Head of AI at Alpina Digital.

Telegram channel RSS hamidun.com

Need AI working inside your business — not just in your newsfeed?

I build production AI for companies — custom CRM, internal tools, autonomous agents, workflow automation. Owned by you, shaped to your process, no per-seat tax. Built by Zhemal Khamidun, CPO of AlpinaGPT (AI platform, 6,000+ users).

Book a free consultation →