Together AI Unveiled ATLAS: A Speculator That Accelerates LLM by 4x

Together AI introduced ATLAS, an adaptive machine learning-based speculator that accelerates LLM inference by 4x without manual tuning. The system automatically learns and adapts to your workload as it's used. On DeepSeek-V3.1, it achieves 500 tokens per second—2.65x faster than standard decoding and outperforming Groq's specialized hardware.

Khamidun Zhemal

AI monitoring · Together AI Blog

May 23, 2026· 2 min·updated Jul 12, 2026

AI-processed from Together AI Blog; edited by Hamidun News

Together AI Unveiled ATLAS: A Speculator That Accelerates LLM by 4x — Source: Together AI Blog. Collage: Hamidun News.

◐ Listen to article

Together AI presented ATLAS (AdapTive-LeArning Speculator System)—a revolutionary technology for accelerating LLM inference that automatically improves with use. The system achieves 500 tokens per second on DeepSeek-V3.1 and 460 on Kimi-K2—nearly 4x acceleration without manual tuning. Results were obtained on NVIDIA HGX B200 using real traffic from the Arena Hard benchmark.

What is Speculative Decoding

Speculative decoding is one of the most powerful ways to accelerate text generation in LLMs. Instead of the standard method, where a model generates one token at a time in sequential passes, the system uses a faster speculator (draft model) that proposes multiple tokens at once. The main (target) model then verifies them all in parallel in a single forward pass. The output quality remains identical to standard decoding (mathematically guaranteed), but speed increases proportionally. If the speculator guesses correctly (high acceptance coefficient α), the system processes multiple tokens instead of one. In practice, this means significantly reduced time-to-first-token and accelerated overall generation.

How ATLAS Differs from Other Solutions

Standard speculators are trained once on general workloads and perform uniformly everywhere. Specialized speculators (custom speculators) are trained on company-specific data but only for a single point in time. When workloads evolve—codebases grow, traffic patterns change, request distributions shift, new user types emerge—even highly optimized speculators start falling behind. ATLAS solves this problem fundamentally differently. The system continuously learns (continual learning) as it's used, adapting to real traffic and target model behavior in real-time. The longer you use the service, the better ATLAS predicts the target model's next actions, and the higher the acceptance coefficient. This creates a positive feedback loop: each new request is a training example that improves the speculator.

Results in Practice

Together AI demonstrated results on NVIDIA HGX B200 hardware with real traffic:

DeepSeek-V3.1: 500 TPS (tokens per second)—2.65x faster than standard decoding
Kimi-K2-0905: 460 TPS—also a significant improvement
Comparison with Groq: ATLAS in fully adapted mode outperforms specialized hardware from Groq
4x acceleration compared to baseline solutions without optimization

Efficiency is achieved by balancing two key parameters: the acceptance coefficient (α)—a metric of how often the target model agrees with the speculator's proposals—and relative latency (c) between speculator speed and target model speed. ATLAS automatically finds the sweet spot where the speculator runs very fast while its predictions remain accurate enough for high acceptance.

Integration into Together Turbo

ATLAS is integrated into Together Turbo—Together AI's package of engineering solutions for accelerating LLMs. It works alongside the proprietary speculator and supports the use of custom speculators. The key difference: ATLAS requires zero manual parameter tuning. Users get automatic performance improvements simply by using the platform. This is especially critical for growing teams where workloads aren't static. In the growth phase, when requests come from different user types, business logic constantly evolves, and model requirements change—old optimizations often become outdated within weeks or months. ATLAS continuously updates itself.

What This Means

LLM inference acceleration is transitioning from a one-time engineering task to a built-in living feature of the service. Developers and users get progressively faster responses simply by using the platform, without any manual intervention. For startups, agencies, and companies, this means real cost reduction in processing requests to large models in production.

Hamidun News

AI news without noise. Daily editorial selection from 50+ sources. A product by Zhemal Khamidun, Head of AI at Alpina Digital.

Telegram channel RSS hamidun.com

Need AI working inside your business — not just in your newsfeed?

I build production AI for companies — custom CRM, internal tools, autonomous agents, workflow automation. Owned by you, shaped to your process, no per-seat tax. Built by Zhemal Khamidun, CPO of AlpinaGPT (AI platform, 6,000+ users).

Book a free consultation →