Together AI beat TensorRT-LLM by 31% in benchmarks for code agents
Together AI published the first fair benchmarks for production-level code-agent workloads. Together Inference Engine beat TensorRT-LLM by 31% in tokens per…
AI-processed from Together AI Blog; edited by Hamidun News
Together AI published the first production-oriented inference benchmarks for coding agents — and the results challenge most of the industry's conventional tests.
Why Standard Benchmarks Are Useless
A classic inference benchmark measures a single user on a dedicated server. The numbers look impressive — and reveal nothing about real working conditions. In production, dozens and hundreds of requests compete simultaneously for a single KV-cache, memory bandwidth, and GPU cycles. The more traffic, the more time to first token (TTFT) grows. At some point, the system becomes unusable before formal failure. Different engines reach this point at very different load levels — and that's exactly what needs to be measured.
Together AI designed the test precisely for this scenario: coding agent load, long context, high concurrency, and zero tolerance for latency degradation.
What Makes Coding Agents a Special Workload
Requests from coding agents carry enormous context: the edited file, surrounding code, dialogue history, fragments from vector search. Input token length varied from 45 to 200 thousand — simulating real session growth during development. Average response length was around 450 tokens: the agent writes a function, not a novel.
This type of load creates three problems that standard tests miss:
- TTFT sensitivity. The developer sees a blank screen until the first token arrives. In this interval — between sending and the start of streaming — trust in the tool is lost. Generation speed is secondary: once tokens start flowing, the experience feels fast.
- Concurrent long context. Dozens of developers with requests of 80+ thousand tokens fill the KV-cache simultaneously. The scheduler loses maneuverability, TTFT climbs — and the system degrades long before formal failure.
- Prefill-oriented profile. The load here is predominantly on prefill, not decode. Engines optimized for long generation don't get their usual advantage.
The test ran on 4× NVIDIA B200 for each engine.
Together Inference Engine Results
Together Inference Engine was compared to TensorRT-LLM and other leading OSS engines on identical hardware. On production load for coding agents, the results were:
- +31% tokens per second (TPS) compared to the nearest OSS competitor
- 2x better TTFT at traffic saturation
- 76% lower cost compared to Claude Opus 4 from Anthropic
- Stable latency under high concurrency — where competitors already degrade
The gains came from full-stack optimization: ThunderMLA technology, rewritten custom CUDA kernels, and end-to-end profiling on real traffic.
"Most benchmarks measure a single user on a dedicated server.
The numbers look great. They are absolutely useless for reasoning about production," says Together AI's blog.
What This Means
The gap between inference engines is huge precisely under real load — you don't see it in synthetic tests. For teams building AI assistants for developers, the choice of provider directly affects how many users simultaneously get a fast response — and how many see a blank screen. Production-quality inference is no longer a technical nuance, but a competitive advantage.
Want to stop reading about AI and start using it?
AI News is a curated feed of AI/tech news. Hamidun Academy teaches you to use AI systematically in your work.
The AI world, distilled — once a week
Seven stories that actually mattered, hand-picked. No noise, no reposts, no press releases.
Done! Check your inbox for a confirmation.