MarkTechPost→ original

Stability AI Releases Stable Audio 3 for Fast Music Generation

Stability AI released Stable Audio 3 — open models for instrumental music and sound effects generation. The models use three-stage training with flow matching.

Stability AI Releases Stable Audio 3 for Fast Music Generation
Source: MarkTechPost. Collage: Hamidun News.
◐ Listen to article

Stability AI unveiled Stable Audio 3 — a new family of models for generating instrumental music and sound effects. Unlike previous versions, the new models are significantly faster and require fewer computational resources, making sound generation technology accessible to a broad range of users. The company released open model weights, allowing developers to use them freely and integrate them into their applications.

Quality Accessible on Any Hardware

The main change in the third version is the democratization of access to sound generation. The company released open weights for two model variants: small and medium. The small version runs on the MacBook Pro M4 processor without any additional GPU — a regular laptop that most users worldwide already have is sufficient. This means that even people without expensive equipment can generate sound and music on their devices.

The medium variant requires a graphics card with 8GB VRAM, which the vast majority of consumer GPUs released over the last 2-3 years have. Even users with budget graphics cards like the GTX 1660 or RTX 3060 can run the model locally on their own computer, without dependence on cloud services and monthly subscriptions.

Both variants generate stereo audio with a 44.1 kHz sampling rate, which is the professional standard for music and sound effects. This means that the quality is high enough even for use in commercial projects, including films, games, podcasts, and music albums.

Engineering Solutions for Speed

Behind the compactness and speed lay unconventional training architecture. Stability AI abandoned the traditional approach and used an innovative three-stage process that allowed simultaneous improvement of sound quality and reduced computational requirements:

  • Flow matching in the first stage for basic model training on massive audio datasets from various sources
  • Distillation warmup — a model compression technology that preserves quality despite radical reduction in file size
  • Adversarial post-training for final improvement of realism and sound quality to a level difficult to distinguish from human performance

This three-stage approach achieved a rare balance between quality and speed. In traditional machine learning, these two requirements often contradict each other: high quality requires large models that run slowly, while speed requires compression that loses quality. Stability AI found the middle ground.

Results Better Than Competitors

On the BBC Sound Effects benchmark, where models are tested on 5-second audio clips, Stable Audio 3 medium received a FAD (Fréchet Audio Distance) score of 0.369. This is lower than all other openly available models tested in the company's research. The difference between SA3 and the nearest competitor is approximately 15-20%, which in the world of generative models is considered a significant improvement.

For reference: a lower FAD means better sound quality. The model generates audio that sounds more natural and closer to real examples in the dataset. In other words, Stable Audio 3 surpassed all open competitive solutions on the market, including the company's own previous model versions.

What This Means

Sound generation is transitioning from an experimental niche to a practical tool for work. Independent musicians and video creators will be able to generate background music, sound effects, and the needed atmosphere directly on their laptop, without dependence on cloud services and the internet. Local generation also means greater privacy — no data is sent to servers.

For professional studios, this also means reduced costs for licensing royalty-free music and sound libraries. Instead of purchasing ready-made compositions, developers and content creators will be able to generate unique audio content in literally minutes, saving both money and time spent searching for suitable music for projects.

ZK
Hamidun News
AI news without noise. Daily editorial selection from 400+ sources. A product by Zhemal Khamidun, Head of AI at Alpina Digital.
What do you think?
Loading comments…