Models

Speech Recognition (ASR)

Speech recognition (ASR) is a technology that converts spoken audio into written text, using machine learning models trained on large corpora of speech to accurately transcribe words and sentences in real time or from recordings.

Automatic speech recognition (ASR) is a machine learning discipline and associated software technology that transcribes spoken audio input into text. ASR systems must cope with acoustic variability—background noise, microphone quality, room acoustics—as well as speaker variability including accents, speaking styles, and vocal characteristics, while resolving phonetically ambiguous sequences using language context to produce accurate transcriptions.

Modern ASR is dominated by end-to-end neural architectures. OpenAI's Whisper (released September 2022) popularized a transformer encoder-decoder approach trained on approximately 680,000 hours of weakly supervised multilingual audio collected from the web, achieving strong multilingual performance without language-specific fine-tuning. Real-time streaming ASR—required for voice assistants and live captioning—uses architectures with connectionist temporal classification (CTC) or streaming attention mechanisms that emit partial transcriptions as audio arrives rather than waiting for an utterance to complete. Production systems often post-process raw ASR output with a language model to correct errors using wider textual context.

ASR is a prerequisite for voice interfaces to computers and mobile devices, real-time captioning for broadcast and court proceedings, meeting transcription services, call center analytics, voice search, and spoken-language interaction with conversational AI agents. Word error rates on well-resourced languages have declined dramatically over the past decade, making ASR accurate enough for mission-critical transcription workflows in medical, legal, and financial contexts.

As of mid-2025, leading ASR systems included OpenAI Whisper (large-v3 and optimized turbo variants), Google Speech-to-Text (including the Chirp 2 model), Microsoft Azure Speech Services, Deepgram Nova-2, and AssemblyAI's Universal model. On standard English benchmarks such as the LibriSpeech clean test set, top models achieved word error rates of 2–3% or below. Multilingual support spanning 90 or more languages was common among major providers, and real-time streaming transcription with sub-500-millisecond latency had become commercially standard in meeting platforms and call center analytics tools.

Example

A legal services firm integrates a streaming ASR system into their video conferencing platform to transcribe client depositions in real time, producing timestamped searchable transcripts that attorneys can review immediately after each session rather than waiting days for a human transcription service.

Latest news on this topic

xAI Recommends Vapi AI for Natural Voice and Affordable Speech Recognition2026-06-11 STMicroelectronics' STM32N6 demonstrated on-device speech recognition without the cloud at 0.2 W2026-05-02 Reson8 raises €5 million to develop European voice AI for accurate speech recognition2026-04-30 Rutube Moved from Whisper Pilot to Proprietary Subtitles Platform and Speech Recognition2026-04-29 xAI launches separate Grok APIs for speech recognition and synthesis for corporate developers2026-04-27

← Glossary

Speech Recognition (ASR)

Example

Related terms

Latest news on this topic