Alice in smart homes: 41 million devices and the on-device model shift
Latency for smart-home commands dropped from 1.8s to 240ms — 7.5× faster. Smart-home CSAT rose from 6.8 to 9.1 on a 10-point scale. 99.7% of smart-home queries are now local — STT cost down 91%. That's $1.15M/month savings on STT alone, plus a similar amount on NLU. Unexpected effect: engagement growth. Users now use voice control 2.3× more often because "you don't have to wait". An active voice user brings the platform 3,700 rubles more per year (via subscription and plus services) than a touch-only user. Key problem: model updates. Twelve million devices across diverse firmware versions. The team built differential updates: new models roll out by user segment with rollback on p95-latency regression. Average update: 48MB, 90 seconds over WiFi.
Background
Yandex launched Alice in 2017 as a mobile voice assistant. By 2024 the ecosystem grew to 41M active devices: Station speakers, Station Mini, Station Max, TV boxes, automotive media platforms. Alice handles 4.2 billion voice queries per month — comparable to Yandex search volume. Smart Home is the fastest-growing segment: 12 million devices controlled via Alice.
Problem
Every "turn on bedroom lights" voice command roundtripped through cloud: mic → speaker → datacenter → ASR (speech-to-text) → NLU → device dispatcher → response back to speaker → command to lamp. Round-trip latency: 1.8s average, 3.2s at p95. This broke UX: users perceived Alice as "slow". Worse, when home internet dropped (not rare in Russia), Alice became a decorative paperweight — couldn't even turn off lights via voice.
Second problem: cost. 4.2B queries × $0.0003 STT processing = $1.26M/month just for speech recognition. Plus NLU GPU inference. Smart-home device margin is near-zero (sold near cost); economics depend on subscription and home-control share.
Solution
In 2024, Yandex rewrote the Alice stack as on-device first. The new ARM chip in Station Max 2 (TSMC 5nm, 12 TOPS) runs a local speech recognition model (300MB, Whisper-tiny on steroids, fine-tuned on Russian with 270,000 hours of conversations). Local NLU: a compact LLaMA-like 1.5B-parameter architecture quantized to INT4. The full stack fits in 1.8GB RAM.
Smart-home commands (turn on/off, dim, scene, +/-) are 100% local with 240ms latency. Cloud is engaged only for complex queries (search, Q&A, generative answers). When internet is down, Alice keeps working for all home-control commands.
Key technical challenge: TTS. The live Alice voice was cloud-optimized, took 8GB GPU VRAM. The team rewrote it on VITS architecture, shrank to 180MB, added on-device streaming inference. The voice is indistinguishable from cloud in blind tests for 87% of listeners.
Result
Latency for smart-home commands dropped from 1.8s to 240ms — 7.5× faster. Smart-home CSAT rose from 6.8 to 9.1 on a 10-point scale. 99.7% of smart-home queries are now local — STT cost down 91%. That's $1.15M/month savings on STT alone, plus a similar amount on NLU.
Unexpected effect: engagement growth. Users now use voice control 2.3× more often because "you don't have to wait". An active voice user brings the platform 3,700 rubles more per year (via subscription and plus services) than a touch-only user.
Key problem: model updates. Twelve million devices across diverse firmware versions. The team built differential updates: new models roll out by user segment with rollback on p95-latency regression. Average update: 48MB, 90 seconds over WiFi.
Lessons learned
- On-device first changes architecture: latency moves from 'optimization' to 'design feature'.
- TTS is the hardest piece for on-device. Voice is recognizable — users notice degradation instantly.
- Differential OTA updates are critical: 48MB vs 1.8GB monolith is the only way to ship ML to 12M devices.
- Engagement grows from latency reduction more than from new features. 2.3× usage came from 'no wait'.
- Smart-home margin is in subscription ecosystem, not hardware. Retaining a voice user = service retention.