Together AI Unveiled ATLAS: A Speculator That Accelerates LLM by 4x

Q: What is the source?

Originally published on Together AI Blog. Hamidun News processes and adapts the material with AI.

Q: When was it published?

2026-05-21. Reading time: 3 min.

Together AI introduced ATLAS, an adaptive machine learning-based speculator that accelerates LLM inference by 4x without manual tuning. The system automatically

Hamidun News Editorial

AI monitoring · Together AI Blog

2026-05-21· 2 min

AI-processed from Together AI Blog; edited by Hamidun News

Together AI Unveiled ATLAS: A Speculator That Accelerates LLM by 4x — Source: Together AI Blog. Collage: Hamidun News.

◐ Listen to article

Together AI presented ATLAS (AdapTive-LeArning Speculator System)—a revolutionary technology for accelerating LLM inference that automatically improves with use. The system achieves 500 tokens per second on DeepSeek-V3.1 and 460 on Kimi-K2—nearly 4x acceleration without manual tuning. Results were obtained on NVIDIA HGX B200 using real traffic from the Arena Hard benchmark.

What is Speculative Decoding

Speculative decoding is one of the most powerful ways to accelerate text generation in LLMs. Instead of the standard method, where a model generates one token at a time in sequential passes, the system uses a faster speculator (draft model) that proposes multiple tokens at once. The main (target) model then verifies them all in parallel in a single forward pass. The output quality remains identical to standard decoding (mathematically guaranteed), but speed increases proportionally. If the speculator guesses correctly (high acceptance coefficient α), the system processes multiple tokens instead of one. In practice, this means significantly reduced time-to-first-token and accelerated overall generation.

How ATLAS Differs from Other Solutions

Standard speculators are trained once on general workloads and perform uniformly everywhere. Specialized speculators (custom speculators) are trained on company-specific data but only for a single point in time. When workloads evolve—codebases grow, traffic patterns change, request distributions shift, new user types emerge—even highly optimized speculators start falling behind. ATLAS solves this problem fundamentally differently. The system continuously learns (continual learning) as it's used, adapting to real traffic and target model behavior in real-time. The longer you use the service, the better ATLAS predicts the target model's next actions, and the higher the acceptance coefficient. This creates a positive feedback loop: each new request is a training example that improves the speculator.

Results in Practice

Together AI demonstrated results on NVIDIA HGX B200 hardware with real traffic:

DeepSeek-V3.1: 500 TPS (tokens per second)—2.65x faster than standard decoding
Kimi-K2-0905: 460 TPS—also a significant improvement
Comparison with Groq: ATLAS in fully adapted mode outperforms specialized hardware from Groq
4x acceleration compared to baseline solutions without optimization

Efficiency is achieved by balancing two key parameters: the acceptance coefficient (α)—a metric of how often the target model agrees with the speculator's proposals—and relative latency (c) between speculator speed and target model speed. ATLAS automatically finds the sweet spot where the speculator runs very fast while its predictions remain accurate enough for high acceptance.

Integration into Together Turbo

ATLAS is integrated into Together Turbo—Together AI's package of engineering solutions for accelerating LLMs. It works alongside the proprietary speculator and supports the use of custom speculators. The key difference: ATLAS requires zero manual parameter tuning. Users get automatic performance improvements simply by using the platform. This is especially critical for growing teams where workloads aren't static. In the growth phase, when requests come from different user types, business logic constantly evolves, and model requirements change—old optimizations often become outdated within weeks or months. ATLAS continuously updates itself.

What This Means

LLM inference acceleration is transitioning from a one-time engineering task to a built-in living feature of the service. Developers and users get progressively faster responses simply by using the platform, without any manual intervention. For startups, agencies, and companies, this means real cost reduction in processing requests to large models in production.

Hamidun News

AI news without noise. Daily editorial selection from 400+ sources. A product by Zhemal Khamidun, Head of AI at Alpina Digital.

Telegram channel RSS hamidun.com

Want to stop reading about AI and start using it?

AI News is a curated feed of AI/tech news. Hamidun Academy teaches you to use AI systematically in your work.

🎓 Academy — 7 days free Free consultation