Models

Speech Recognition (ASR)

Speech recognition (ASR) is a technology that converts spoken audio into written text, using machine learning models trained on large corpora of speech to accurately transcribe words and sentences in real time or from recordings.

Automatic speech recognition (ASR) is a machine learning discipline and associated software technology that transcribes spoken audio input into text. ASR systems must cope with acoustic variability—background noise, microphone quality, room acoustics—as well as speaker variability including accents, speaking styles, and vocal characteristics, while resolving phonetically ambiguous sequences using language context to produce accurate transcriptions.

Modern ASR is dominated by end-to-end neural architectures. OpenAI's Whisper (released September 2022) popularized a transformer encoder-decoder approach trained on approximately 680,000 hours of weakly supervised multilingual audio collected from the web, achieving strong multilingual performance without language-specific fine-tuning. Real-time streaming ASR—required for voice assistants and live captioning—uses architectures with connectionist temporal classification (CTC) or streaming attention mechanisms that emit partial transcriptions as audio arrives rather than waiting for an utterance to complete. Production systems often post-process raw ASR output with a language model to correct errors using wider textual context.

ASR is a prerequisite for voice interfaces to computers and mobile devices, real-time captioning for broadcast and court proceedings, meeting transcription services, call center analytics, voice search, and spoken-language interaction with conversational AI agents. Word error rates on well-resourced languages have declined dramatically over the past decade, making ASR accurate enough for mission-critical transcription workflows in medical, legal, and financial contexts.

As of mid-2025, leading ASR systems included OpenAI Whisper (large-v3 and optimized turbo variants), Google Speech-to-Text (including the Chirp 2 model), Microsoft Azure Speech Services, Deepgram Nova-2, and AssemblyAI's Universal model. On standard English benchmarks such as the LibriSpeech clean test set, top models achieved word error rates of 2–3% or below. Multilingual support spanning 90 or more languages was common among major providers, and real-time streaming transcription with sub-500-millisecond latency had become commercially standard in meeting platforms and call center analytics tools.

Example

A legal services firm integrates a streaming ASR system into their video conferencing platform to transcribe client depositions in real time, producing timestamped searchable transcripts that attorneys can review immediately after each session rather than waiting days for a human transcription service.

Related terms

Latest news on this topic

← Glossary