StepFun Releases StepAudio 2.5 Realtime Voice Model with Roleplay Support
StepFun released the StepAudio 2.5 Realtime voice model with fully customizable personas. The model understands paralinguistics (intonation, emotions)…
AI-processed from MarkTechPost; edited by Hamidun News
StepFun released a next-generation voice model, StepAudio 2.5 Realtime. The model works end-to-end, reproduces speech in real time, and can adapt voice to any scenario through fully customizable personas.
How the Model Works
StepAudio 2.5 Realtime is an integrated system for voice work that combines speech recognition and synthesis. The model connects via WebSocket API and supports both Chinese and English. Real-time processing means the response arrives with minimal latency, which is critical for interactive applications and voice assistants. The key feature of StepAudio is fully customizable personas without retraining. This is not just voice changing, but complete adaptation to context: the model changes tone, speech style, even accent depending on who or what it should voice. This is especially important for character voicing and creating personalized assistants.
Paralinguistics and Naturalness
The model is trained with special reinforcement learning (RLHF) to understand paralinguistics—everything that goes beyond ordinary speech: intonation, rhythm, emotional coloring, pauses in the right places. Standard voice systems often sound monotone and unnatural. StepAudio 2.5 Realtime solves this problem by making speech more lively and expressive.
Key features of the model include:
- Full persona customization without retraining
- Deep understanding of paralinguistics (intonation, pace, emotions)
- Real-time synthesis via WebSocket API
- Support for Chinese and English
- Special RLHF for roleplay and voicing
Benchmark Results
In April 2026, StepAudio 2.5 Realtime underwent independent testing across five parameters and ranked first in all of them. The most impressive result is 80.41 points in human evaluation, meaning people consider this model very close to natural speech. For understanding paralinguistics, the model scored 82.18 points. This means the model not only generates sound but truly 'understands' the meanings and emotions behind words. For voice assistants, this is critical—they should sound like a real conversation partner, not a robot repeating text.
What This Means
StepAudio 2.5 Realtime is a step toward more natural voice systems that compete with OpenAI Voice and ElevenLabs. For developers, this represents a powerful tool for creating applications with voice interfaces and genuine emotional expressiveness.
Want to stop reading about AI and start using it?
AI News is a curated feed of AI/tech news. Hamidun Academy teaches you to use AI systematically in your work.