ACE-Step 1.5 from ACE Studio outperforms Suno v5 and runs music generation locally
ACE-Step 1.5 from ACE Studio and StepFun is a rare case of open-source music generation catching up with commercial AI. The model runs locally with as little…
AI-processed from Habr AI; edited by Hamidun News
ACE-Step 1.5 claims to be the first truly powerful open source model for music generation that not only works locally but also catches up to closed services in quality. According to developers and analysis on Habr, the model outperforms Suno v5 on SongEval, runs from 4 GB VRAM and generates a full track in seconds.
Why this matters
Until now, the AI-music market was divided quite simply: if users needed convenient and high-quality results, they went to Suno, Udio or other closed services. Open source alternatives existed, but usually lost either in quality, speed, or hardware requirements. ACE-Step 1.
5 attempts to break this scheme. The model was released by ACE Studio and StepFun, and along with the release, they published a paper on arXiv — for music generators this is still rare. According to the official table, ACE-Step 1.
5 scores 8.09 on SongEval, and the ACE-Step 1.5 XL version scores 8.
12. For comparison, Suno v5 on the same table has 7.87.
At the same time, the model shows strong indicators for Lyric Alignment: 8.35 for the base version and 8.42 for XL.
In practice, this means not only a higher overall track rating, but also better vocal-to-text alignment, which remains one of the most difficult tasks for generative music.
How the model works The key idea of ACE-Step is to separate composition and synthesis.
At the first stage, a Language Model works, which takes the user's prompt and turns it into a detailed song plan: genre, tempo, structure of verses and choruses, instruments, lyrics and metadata. In the paper, this module is described as a kind of composer agent. It doesn't generate sound directly, but removes from the main audio module the task of guessing what the user actually wanted.
The more precise the plan, the less chaos at the next stage. At the second stage, Diffusion Transformer takes over. The base version uses DiT with roughly 2 billion parameters, XL — 4 billion.
It receives the ready-made plan and synthesizes audio in latent space, and acceleration is achieved through distillation: instead of the usual 50–100 diffusion steps, the model fits in 4–8 steps. Hence the speed numbers: a full track in about 2 seconds on A100 and less than 10 seconds on RTX 3090. It's precisely the combination of LM as a planner and DiT as a renderer that makes this release noteworthy.
What it can do in practice
Besides regular text-to-music, ACE-Step 1.5 attempts to become a universal tool for music work, not just a generator for a single track based on description. The project incorporates the same scenario expected from professional software: you can not only create a song from scratch, but also intervene in existing material, rebuild a separate piece, rearrange the source or adapt the accompaniment to the voice. For an open source system, this is already the level of a full working environment, not just a demo.
- Cover generation — rearrangement of an existing composition in a different style Repainting — regeneration of separate fragments without rebuilding the entire track Vocal-to-BGM — creating accompaniment for ready-made vocals LoRA fine-tuning — adjustment to your own style on a small set of songs Support for 50+ languages and tracks from 10 seconds to 10 minutes Another strong argument is hardware requirements. Base mode can work locally with less than 4 GB VRAM, and for heavier configurations, offload options are available. The project supports not only NVIDIA but also Mac on Apple Silicon, AMD and Intel, and local launch comes down to ready-made scripts with a Gradio interface. For musicians, producers and developers, this looks like a real opportunity to experiment without a cloud subscription and without sending materials to an external service.
Where the weak spots are The developers don't hide the fact that the model has notable limitations.
The main problem is result instability. The same prompt can produce a strong track on one seed and a weak one on another, so the authors directly call this gacha-style behavior. Also listed are rough vocals lacking proper nuance, poor performance in some genres like Chinese rap, unnatural transitions when repainting and overly coarse control of musical parameters.
In other words, it's not yet possible to set a song with precise harmonic logic and fully predictable results. Because of this, it's important not to confuse the model and service. Suno still wins with most users on simplicity: opened the site, wrote a couple of lines, got a song.
ACE-Step 1.5 requires installation, GPU, prompt tuning and tolerance for variability. But in return it provides privacy, a local pipeline, no mandatory subscription and the ability to fine-tune through LoRA.
For a mass user, this is not yet a replacement for Suno, but for those who need control over the process, the situation is already changing.
What this means ACE-Step 1.5 shows that music generation is ceasing to
be a zone only for closed platforms. If an open source model already outperforms a commercial player on some metrics and runs on consumer hardware, the market will move toward local, customizable and cheaper music AI tools.
Want to stop reading about AI and start using it?
AI News is a curated feed of AI/tech news. Hamidun Academy teaches you to use AI systematically in your work.