MarkTechPost→ original

StepFun Releases StepAudio 2.5 Realtime Voice Model with Roleplay Support

StepFun released the StepAudio 2.5 Realtime voice model with fully customizable personas. The model understands paralinguistics (intonation, emotions)…

AI-processed from MarkTechPost; edited by Hamidun News
StepFun Releases StepAudio 2.5 Realtime Voice Model with Roleplay Support
Source: MarkTechPost. Collage: Hamidun News.
◐ Listen to article

StepFun released a next-generation voice model, StepAudio 2.5 Realtime. The model works end-to-end, reproduces speech in real time, and can adapt voice to any scenario through fully customizable personas.

How the Model Works

StepAudio 2.5 Realtime is an integrated system for voice work that combines speech recognition and synthesis. The model connects via WebSocket API and supports both Chinese and English. Real-time processing means the response arrives with minimal latency, which is critical for interactive applications and voice assistants. The key feature of StepAudio is fully customizable personas without retraining. This is not just voice changing, but complete adaptation to context: the model changes tone, speech style, even accent depending on who or what it should voice. This is especially important for character voicing and creating personalized assistants.

Paralinguistics and Naturalness

The model is trained with special reinforcement learning (RLHF) to understand paralinguistics—everything that goes beyond ordinary speech: intonation, rhythm, emotional coloring, pauses in the right places. Standard voice systems often sound monotone and unnatural. StepAudio 2.5 Realtime solves this problem by making speech more lively and expressive.

Key features of the model include:

  • Full persona customization without retraining
  • Deep understanding of paralinguistics (intonation, pace, emotions)
  • Real-time synthesis via WebSocket API
  • Support for Chinese and English
  • Special RLHF for roleplay and voicing

Benchmark Results

In April 2026, StepAudio 2.5 Realtime underwent independent testing across five parameters and ranked first in all of them. The most impressive result is 80.41 points in human evaluation, meaning people consider this model very close to natural speech. For understanding paralinguistics, the model scored 82.18 points. This means the model not only generates sound but truly 'understands' the meanings and emotions behind words. For voice assistants, this is critical—they should sound like a real conversation partner, not a robot repeating text.

What This Means

StepAudio 2.5 Realtime is a step toward more natural voice systems that compete with OpenAI Voice and ElevenLabs. For developers, this represents a powerful tool for creating applications with voice interfaces and genuine emotional expressiveness.

ZK
Hamidun News
AI news without noise. Daily editorial selection from 400+ sources. A product by Zhemal Khamidun, Head of AI at Alpina Digital.

Want to stop reading about AI and start using it?

AI News is a curated feed of AI/tech news. Hamidun Academy teaches you to use AI systematically in your work.

What do you think?
Loading comments…