Голосовые агенты не готовы к двуязычным клиентам. Исследование ServiceNow-AI
Новое исследование ServiceNow-AI показало, что большинство голосовых агентов с трудом понимают людей, говорящих одновременно на двух языках (code-switching). Ко
AI-processed from Hugging Face Blog; edited by Hamidun News
Voice agents perform poorly with bilingual clients. This was shown by research from the ServiceNow-AI team, which tested seven popular speech recognition systems on examples of code-switching — when people smoothly transition from one language to another in a single utterance, interweaving words and phrases.
The problem remains acute
More than half of the world's population speaks two or more languages. For them, it is natural to mix languages in speech — especially when discussing specialized terms or in informal settings. However, voice assistants and agents are trained primarily on monolingual data and often do not understand when a client switches between languages. This becomes a critical problem for companies serving multilingual markets through voice interfaces. Support services, booking, consultations — all of this works worse if the client is accustomed to speaking in two languages simultaneously. ServiceNow-AI decided to measure the scale of the problem and find which systems handle it better.
How and what was tested
Researchers created a synthetic dataset of 918 utterances across four language pairs: Spanish-English, French-English (Canadian dialect), German-English. Examples were taken from real-world scenarios of HR and IT support operations — dialogues that actually occur in multilingual corporations. Each of the seven automatic speech recognition (ASR) systems was evaluated on three metrics:
- WER — standard transcription accuracy (Word Error Rate)
- SWER — errors that change the meaning of an utterance (Semantic WER)
- AER — errors that break the system's understanding of meaning (Answer Error Rate)
This set of metrics helps understand not just whether the system makes errors, but how critical those errors are.
Results: there are leaders
In terms of WER, two leaders pulled ahead: ElevenLabs Scribe V2 and AssemblyAI Universal-3 Pro. Google Gemini Flash 3 took third place, demonstrating a solid result. Meanwhile, OpenAI Whisper showed weak results — the system by default does not transcribe bilingual speech, but translates it, which does not align with the task at all. When transcription of code-switching speech is needed, Whisper becomes of little use. Interestingly, the best models show minimal accuracy degradation compared to monolingual baseline levels. This means that code-switching for them is not a catastrophe, but simply a somewhat more complex task.
Strange pattern of errors
Error analysis revealed something unexpected: words in the embedded English language made more errors than in the base language (Spanish, French, or German). This is paradoxical because English is normally easier for these models to recognize in a monolingual context. Researchers suggest two reasons. First, this could be an issue with technical and specialized vocabulary, which is more common in English. Second, models experience difficulty adapting when switching languages in the middle of a phrase. The model's brain, so to speak, becomes "distracted" by the new language and misses details.
What this means
Voice systems of the new generation are becoming better, but bilingualism is still a complex case. For business, this means two things. First, if you use voice agents to support multilingual customers, the choice of ASR system is critical — the difference between ElevenLabs and Whisper could be in the tens of percent. Second, this is an area of active development, and in subsequent versions, results will likely improve.
Want to stop reading about AI and start using it?
AI News is a curated feed of AI/tech news. Hamidun Academy teaches you to use AI systematically in your work.