Hugging Face Blog→ original

ServiceNow introduced EVA — a new framework for evaluating voice AI agents

ServiceNow released EVA — a new framework for evaluating voice AI agents. It measures two things at once: how well an agent completes a task and how…

AI-processed from Hugging Face Blog; edited by Hamidun News
ServiceNow introduced EVA — a new framework for evaluating voice AI agents
Source: Hugging Face Blog. Collage: Hamidun News.
◐ Listen to article

ServiceNow unveiled EVA — a framework for end-to-end evaluation of voice AI agents, which attempts to measure not only whether a task is completed, but also how convenient the conversation was for the user. The project was published on the Hugging Face blog on March 24, 2026, along with an open dataset, code, and initial results for 20 systems.

Why Existing Tests Fall Short

Most existing benchmarks for voice AI test individual components of the system one at a time: speech recognition, synthesis quality, response timing, or tool-calling ability. In practice, this is insufficient. Users don't interact with STT, TTS, or LLM in isolation — they speak with a single agent that must understand the request, maintain context, correctly invoke tools, and complete the task without confusion in a live dialogue.

This is why the EVA authors propose evaluating a voice agent as a complete product. In a phone scenario, even a small error quickly ruins the entire experience: a misheard confirmation code makes good model logic useless, a long list of options is difficult to comprehend when spoken, and an extra pause causes the user to ask for clarification or abandon the call. Old metrics often miss these failures because they evaluate components in isolation and outside the overall user scenario.

How EVA Works

EVA is built as an end-to-end check of a multi-turn conversation in audio. The system simulates a real phone call between a voice agent and a user bot that acts according to a given goal and role. The agent must use tools, follow scenario rules, and reach a verifiable final state. In the initial release, the authors published a synthetic airline dataset with 50 scenarios and 15 tools: from rebooking flights to cancellations, standby, and passenger vouchers.

  • User simulator sets the caller's goal, behavior, and speech manner
  • Voice agent passes the test in real audio stream
  • Tool executor returns deterministic answers and changes the scenario database state
  • Validators filter out low-quality runs without manual annotation
  • A set of metrics analyzes the conversation recording, transcript, and tool call logs

EVA has two main aggregate scores. EVA-A measures accuracy: did the agent reach the correct result, did it invent policies, did it distort important entities like flight numbers or amounts? EVA-X measures user experience: was the response short enough for a spoken channel, did the conversation move forward without repetition, and did the agent speak at the right time? The authors also calculate pass@3 and pass^3 to see not only the best run, but also behavioral stability across multiple attempts in the same scenario.

What the Tests Showed

The team ran 20 systems through EVA — proprietary and open-source, cascaded and audio-native — and reached a key conclusion: there is a persistent trade-off between accuracy and conversation quality. No configuration dominates on both axes at once. Some agents better complete the task but make the conversation less convenient; others sound more natural but make more errors in critical steps and on long multi-turn scenarios. This makes model comparison noticeably more honest than a typical binary pass/fail.

"Agents that are better at task completion often provide worse user

experience, and vice versa."

Another notable failure relates to named entities. A single misheard letter in a confirmation code or flight number can break authentication and collapse the entire scenario. The authors also note that multi-step operations proved particularly challenging — for example, when you need to rebook a flight while preserving ancillary services like baggage and seat selection. At the same time, the gap between pass@3 and pass^3 turned out to be large for many systems: an agent might solve a task once but not do it consistently. It is also important that the current release is still limited to English scenarios in aviation, so ahead lie expansions to noisy conditions, accents, other languages, and new domains.

What This Means

The voice agent market is shifting from flashy demos to more rigorous engineering evaluation. If EVA or similar frameworks take hold, winners will not be systems that simply sound natural, but those that are simultaneously accurate, concise, and reliably carry conversations to results in real scenarios, not just in lucky single runs. For enterprise deployments, this is a particularly important shift.

ZK
Hamidun News
AI news without noise. Daily editorial selection from 400+ sources. A product by Zhemal Khamidun, Head of AI at Alpina Digital.

Want to stop reading about AI and start using it?

AI News is a curated feed of AI/tech news. Hamidun Academy teaches you to use AI systematically in your work.

What do you think?
Loading comments…