Habr AI→ original

Repka-Pi 4 gets local text-to-speech with Piper and FastAPI, without a GPU and with autostart

A local speech synthesizer based on Piper and FastAPI was assembled for Repka-Pi 4. The system works without a GPU, accepts HTTP requests from external…

AI-processed from Habr AI; edited by Hamidun News
Repka-Pi 4 gets local text-to-speech with Piper and FastAPI, without a GPU and with autostart
Source: Habr AI. Collage: Hamidun News.
◐ Listen to article

Repka-Pi 4 demonstrated a working scenario for local text-to-speech synthesis using the Piper neural network engine. Even on a single-board computer with 2 GB of RAM and without a GPU, the board can voice text over HTTP, automatically start up when the system boots, and serve external devices.

Why this is interesting

The story here is not about yet another cloud API, but about moving TTS directly to a compact local computer. The author shows that a modern single-board computer can already handle not just simple scripts, but also Russian-language speech synthesis of acceptable quality. For scenarios where privacy, autonomy, and operation without the internet are important, this is critical: smart home devices, voice kiosks, toys, local assistants, and educational setups can speak for themselves, without sending text to external services.

It's particularly important to contrast two approaches. Lightweight eSpeak NG requires almost no resources and starts immediately, but sounds too mechanical. Piper, on the other hand, uses a neural network model in ONNX format and delivers a more natural voice even without a graphics accelerator. This makes Repka-Pi 4 not just a board for experiments, but a foundation for real interfaces where synthesis needs to be intelligible and tolerable to the ear, not just formally functional.

What the solution consists of

The practical scheme is assembled from open components that can be deployed locally. As a basic option, the author uses eSpeak NG, and for higher-quality sound — Piper TTS. Piper is installed through a Python environment, after which a Russian-language voice model is loaded onto the board. After that, synthesis can be launched either from the command line or directly from Python, passing text to a stream and immediately sending audio for playback.

"This is the voice of a robot from the 80s" — this is how the article describes the result of eSpeak NG compared to

Piper.

  • eSpeak NG — the lightest option for boards with strict resource constraints.
  • Piper TTS — neural network synthesis based on VITS and ONNX with noticeably more natural speech.
  • FastAPI server — HTTP interface for external clients that send text for voicing.
  • Queue and separate thread — a mechanism that prevents blocking the client until synthesis is complete.
  • systemd service — autostart of TTS after powering on the board.

The article separately discusses two modes of Piper operation: through WAV file recording and through streaming output without an intermediate file. The second option is particularly useful for embedded scenarios because it eliminates unnecessary disk operations and accelerates the path from text to sound. The author also shows how to use aplay and sounddevice, and notes that warnings about audio buffer underrun or the lack of a GPU on Repka-Pi 4 do not prevent achieving a result suitable for practical use.

How the server works

The key part of the project is a speech synthesis server based on FastAPI. It runs on the Repka-Pi 4 itself, listens for HTTP requests, and receives text via the POST /say route. After that, the server does not make the client wait for the entire phrase to be voiced. Instead, the task is placed in a queue, and a separate background thread handles calling Piper, assembling the PCM stream, and outputting audio through sounddevice. For automation systems, this is more convenient than a synchronous call, which would freeze the device's entire logic.

There is also a service route GET /status: through it you can check whether the server is free, whether playback is currently happening, and how many tasks are already in the queue. The model is loaded once when the application starts, so it doesn't need to be initialized on every request. For continuous operation, a systemd unit file is provided: it starts the service after system boot, enables restart on failures, and allows viewing logs through journalctl. According to the author's description, the delay before voicing begins ranges from one to several seconds and depends on the length of the text.

What this means

The practical value of this use case is that a local voice interface no longer requires expensive hardware or constant cloud connectivity. Repka-Pi 4 can already be used in home automations, terminals, robots, and educational projects, and with the emergence of more powerful boards, we should expect faster offline TTS, a combination of synthesis with speech recognition, and full-fledged Russian-language assistants working entirely on the device. For the Russian-language DIY market, this is a rare example of how a ready-made stack can quickly be transferred from an article to a working prototype.

ZK
Hamidun News
AI news without noise. Daily editorial selection from 400+ sources. A product by Zhemal Khamidun, Head of AI at Alpina Digital.

Want to stop reading about AI and start using it?

AI News is a curated feed of AI/tech news. Hamidun Academy teaches you to use AI systematically in your work.

What do you think?
Loading comments…