AWS Machine Learning Blog→ original

AWS introduces ActorSimulator for testing multi-turn AI agents in Strands Evals

AWS introduced ActorSimulator in Strands Evals — a tool for testing AI agents not on single requests, but in live multi-turn dialogues. The simulator…

AI-processed from AWS Machine Learning Blog; edited by Hamidun News
AWS introduces ActorSimulator for testing multi-turn AI agents in Strands Evals
Source: AWS Machine Learning Blog. Collage: Hamidun News.
◐ Listen to article

AWS showed ActorSimulator — a component of the Strands Evaluations SDK that helps test AI agents in multi-turn conversations with realistically simulated users. Instead of static "question-answer" pairs, teams get managed dialogues with personas, goals, and natural branching as the conversation unfolds.

Why it's hard

Testing an agent in a single turn is relatively straightforward: there's an input, there's a response, there's a set of metrics like helpfulness or correct tool usage. But in a real product, a conversation almost never ends with one message. The user clarifies their request, changes direction, brings the dialogue back to the original task, or gets frustrated if the agent missed an important detail.

Because of this, the next turn can't be pre-recorded in a test dataset — it depends on everything that was said before. Manual testing only partially solves this problem. A team can indeed run scenarios by hand, but hundreds of multi-turn conversations after each agent update quickly become unmanageable.

The attempt to replace this with a simple prompt like "play the user" also yields weak results: behavior varies from run to run, the persona breaks down, and comparing scores between versions becomes difficult. AWS is proposing a more structured approach where realism doesn't kill repeatability.

How the simulator works

ActorSimulator builds a simulated user around a test case. It takes an initial request and, optionally, a task description—for example, booking a trip within a budget. Then the LLM constructs a character profile: communication style, level of expertise, patience, context, and final goal. After that, the simulator conducts the dialogue turn by turn, keeps the conversation history in memory, and generates the next response not from a template but in the logic of that specific user. AWS highlights several practical mechanisms here:

  • Auto-generation of a stable user profile for a specific scenario
  • Tracking the conversation goal and checking whether it's been achieved
  • A stop signal if the task is solved, the agent is stuck, or the turn limit is exhausted
  • Structured explanation of why the simulator asked that particular next question
  • The ability to plug in custom profiles to check specific user segments

This matters not just for scenario elegance. If an agent answered only part of a request, the simulator will continue along the missing thread rather than veer into random off-topic. If the agent asks for clarification, the answer will come within the chosen persona. Moreover, each turn is accompanied by structured reasoning: you can see whether the user is now clarifying a gap, expressing confusion, or trying to steer the conversation back to the goal. For debugging, this level of transparency is especially useful.

Integration into the pipeline

AWS shows that you can get started with just a few lines of code via the `strands-agents-evals` package. In the example, a travel assistant is tested: a Case is defined with a user request, then ActorSimulator creates a multi-turn dialogue until the goal is reached, it becomes clear the agent can't handle it, or it hits `max_turns`. The resulting transcript can already be analyzed as a full multi-turn session for evaluation rather than as a set of isolated responses.

For production evaluation, this is connected to OpenTelemetry and Strands Evals session mapping. AWS suggests collecting spans on each turn, including tool calls, model invocations, and timings, and then passing the entire trajectory to evaluators like HelpfulnessEvaluator and GoalSuccessRateEvaluator. Plus, you can set custom profiles manually—for instance, an impatient expert or a novice—and see where the agent consistently gets lost.

In its recommendations, AWS suggests starting with 3–5 turns for simple tasks and 8–10 for longer scenarios.

What it means

The AI agent market is quickly moving away from one-shot successful demos toward systematic validation of real user trajectories. ActorSimulator from AWS is important precisely because it turns multi-turn dialogues from manual pain into part of a regular evaluation pipeline: with clear personas, measurable goals, and tracing by which you can hunt for regressions before shipping to production.

ZK
Hamidun News
AI news without noise. Daily editorial selection from 400+ sources. A product by Zhemal Khamidun, Head of AI at Alpina Digital.

Want to stop reading about AI and start using it?

AI News is a curated feed of AI/tech news. Hamidun Academy teaches you to use AI systematically in your work.

What do you think?
Loading comments…