Habr AI→ original

STMicroelectronics' STM32N6 demonstrated on-device speech recognition without the cloud at 0.2 W

The STM32N6 microcontroller with an integrated NPU was able to recognize free-form speech directly on the device — without the cloud and with power…

AI-processed from Habr AI; edited by Hamidun News
STMicroelectronics' STM32N6 demonstrated on-device speech recognition without the cloud at 0.2 W
Source: Habr AI. Collage: Hamidun News.
◐ Listen to article

Microcontrollers with built-in NPU are entering territory that previously belonged almost entirely to cloud speech recognition services. An experiment on the STM32N6 showed that local recognition of arbitrary speech can already be launched directly on the device — without the internet, almost in real time, and with power consumption of around 0.215 W.

How the system works

The project author organized speech recognition into three blocks: acoustic model, decoder, and rescoring. The heaviest part is acoustics: it receives raw audio signal from the microphone and converts it into a sequence of phonemes. The decoder should assemble words from these phonemes, and the rescoring block should double-check the result taking context into account.

At the current stage, the acoustic model is already running on the STM32N6 — that is, the key foundation of the entire system. Practically, this looks like this: the device listens to speech in real time, runs it through the NPU, and outputs a stream of phonemes. In the demonstration above, words and numbers are displayed, while below are the "raw" phonemes predicted by the model.

For now, the conversion of phonemes to words is done through hard matching rather than a full-fledged language decoder. Because of this, the system is still limited, but the mere fact of the acoustic model working locally on a microcontroller is more important than the current "wrapper" around it.

Figures and limitations

The strongest result is power consumption. During active speech recognition, the entire system consumes about 215 mW. Of these, approximately 160 mW go to the NPU and Cortex-M55 core, another 45 mW to external Flash and PSRAM memory, about 10 mW to external pins.

Moreover, this is not a mode after optimization: the core is still working without aggressive sleep, and the NPU is loaded only at 10.4%, so there's still room for further power reduction. In terms of quality, the picture also looks serious for this class of hardware.

The model contains 8.5 million parameters, and after quantization to int8 lost almost no accuracy, showing PER of 5.3% on dev_clean and 14.

4% on dev_other on the target device. Inference time on the NPU was 52 ms for 500 ms of audio, and full latency was 985 ms. Nearly half of this delay is related not to hardware, but to the "future window" that the model uses for more accurate phoneme prediction.

  • Acoustic model size — 8.5 million parameters
  • Power consumption during recognition — about 0.215 W
  • NPU inference time — 52 ms for 500 ms of audio
  • Quality loss after quantization to int8 — less than 0.5%
  • RAM usage — 18%, NPU load — 10.4%

It's worth noting a comparison with larger systems. By PER, this model turned out to be comparable to wav2vec 2.0 Base and HuBERT Base, although those are about 11 times larger and not designed to run on microcontrollers. At the same time, the author honestly outlines the project's boundaries: this is not yet a replacement for full-fledged dictation, but rather a local engine for short commands and phrases where autonomy and energy efficiency are critical.

Where the microcontroller will win

The strong point of this approach is not universality at any cost, but closing the gap between simple keyword spotting and heavy cloud ASR. Ordinary local voice interfaces require exact command matching, but here the device can already interpret different formulations of the same request. Instead of one rigid phrase, a user can say "make it warmer," "add about five degrees," or "turn up the temperature" — and the system will understand one action.

This opens up quite practical scenarios: smart homes without sending voice outside, voice input of numbers and parameters at manufacturing facilities, work in warehouses, medical devices, and transport, where the network is unstable or absent entirely. Another plus is room for growth. Currently, the STM32N6 uses only 18% of RAM, and the NPU is utilized at about one-tenth of its capacity.

The next steps are clear: add a phoneme decoder, a language model, and noise suppression. These should turn a convincing technical prototype into a truly useful user interface.

What this means

STM32N6 does not cancel cloud speech recognition, but shows that some tasks can already be confidently moved to the edge. Where autonomy, privacy, cost, and low power consumption are important, MCUs with NPU are beginning to look not like an experiment, but as a new practical class of voice interfaces.

ZK
Hamidun News
AI news without noise. Daily editorial selection from 400+ sources. A product by Zhemal Khamidun, Head of AI at Alpina Digital.

Want to stop reading about AI and start using it?

AI News is a curated feed of AI/tech news. Hamidun Academy teaches you to use AI systematically in your work.

What do you think?
Loading comments…