AWS Machine Learning Blog→ original

AWS outlines five patterns for evaluating deep AI agents

AWS published a guide for evaluating deep AI agents. The article covers five evaluation patterns and demonstrates how to set up offline tests with pytest and La

AWS outlines five patterns for evaluating deep AI agents
Source: AWS Machine Learning Blog. Collage: Hamidun News.
◐ Listen to article

AWS and LangSmith have published a comprehensive guide for evaluating deep AI agents—autonomous systems that solve multi-step tasks independently, making inferences and decisions along the way.

Five Evaluation Criteria

AWS's key finding: evaluating an agent by a single metric is incorrect. You need a comprehensive approach. The company proposes five directions of assessment, each revealing different aspects of operation:

  • Result correctness — did the agent provide the correct final answer to the user's question
  • Solution trajectory — which path did the agent choose, are the steps logical, are there obvious errors in reasoning
  • Tool management — which APIs, services, and databases did the agent call, did it use them efficiently
  • Security and compliance — did the agent adhere to access policies, did it stay within the boundaries of permitted actions
  • Decision transparency — can a developer understand the logic behind each of the agent's decisions

In early prototypes, the focus is on correctness and logical consistency. In a production system, especially if it's critical, the priority shifts to security, monitoring, and the ability to explain each agent decision.

Offline Testing and Live Monitoring

AWS describes a two-level approach: pre-deployment control and post-deployment control. The first level is offline testing in development mode. You write tests in pytest, where you give the agent predefined input data and check whether it produces the correct answer. This is classical unit testing, but for AI systems: a set of questions, expected results, verification of matches.

LangSmith supplements this with call trace tracking. When the agent operates, the tool records every step: which sub-questions the agent posed to itself, which services it called, how it transitioned from one step to another. If the result is incorrect, you can see exactly where the error occurred and fix it.

The second level activates after production deployment. When the agent works with real users, LangSmith continues observation. The system tracks metrics in real time: request response time, error percentage, execution success rate, duration of each intermediate step. If metrics start degrading, an alert triggers automatically.

Text-to-SQL Agent as a Complete Example

AWS built a demonstration agent that translates natural language into SQL queries to databases. A user writes: "Show me the top 10 customers by sales volume this quarter," the agent parses the request, forms an SQL command, executes it on the database, and returns a results table. This example covers all five evaluation criteria completely: correctness of the final result, logic of the steps, choice of necessary tools (which tables to query), security (not exceeding access boundaries for available data), and the ability to understand why the agent formed this particular SQL command.

The agent is deployed on Amazon Bedrock—a managed cloud service for working with large language models. Bedrock handles infrastructure scaling, fault tolerance, and security compliance. The developer concentrates on agent logic, Bedrock guarantees reliability and performance.

What This Means

Until now, evaluation of complex AI systems was more art than science: you run the agent, look at the result, guess why this or that happened. AWS and LangSmith bring engineering thinking. When you can see the full trace of the agent's decisions and verify it step by step, it becomes possible not just to catch an error, but to prevent it at the development stage. For large and critical systems—where the agent manages payments, controls access to confidential data, or makes important business decisions—this moves from the category of "nice to have" to the category of "mandatory."

ZK
Hamidun News
AI news without noise. Daily editorial selection from 400+ sources. A product by Zhemal Khamidun, Head of AI at Alpina Digital.
What do you think?
Loading comments…