MarkTechPost→ original

Google Introduces TurboQuant: 6x KV-cache Compression for LLMs Without Accuracy Loss

Google unveiled TurboQuant — a new method to compress the KV-cache of large language models without fine-tuning and without quality loss on benchmarks. The…

AI-processed from MarkTechPost; edited by Hamidun News
Google Introduces TurboQuant: 6x KV-cache Compression for LLMs Without Accuracy Loss
Source: MarkTechPost. Collage: Hamidun News.
◐ Listen to article

Google Research has introduced TurboQuant — a KV-cache compression algorithm for large language models that aims to address one of the key limitations of long context. According to the company, the method reduces memory consumption by a minimum of six times and in certain configurations accelerates attention computation up to eight times without quality loss on benchmark tasks.

Why KV-cache slows things down

When an LLM works with long context, it stores intermediate keys and values in the KV-cache to avoid recalculating them for each token. This saves computation, but quickly hits memory limits: the larger the model and the longer the dialogue or document, the more the cache grows. As a result, the bottleneck becomes not only the GPU itself, but also data transfer between fast SRAM and HBM memory. For inference this is especially painful, because long queries begin to cost significantly more both in latency and in hardware requirements.

Google compares the KV-cache to a "high-speed digital cheat sheet" that the model uses instead of repeated computations.

Standard quantization partially solves the problem, but it has its own side effect: along with compressed data, additional quantization constants must be stored. These overhead bits eat away at some of the savings, especially when dealing with billions of values within a long context. This is precisely where Google builds TurboQuant: the idea is not just to compress vectors more aggressively, but to remove unnecessary overhead that prevents achieving real memory savings.

How TurboQuant works TurboQuant consists of two stages.

First, PolarQuant is used: an algorithm that randomly rotates vectors, translates them into a more convenient representation, and then quantizes coordinates individually. This approach preserves the basic structure of the original data without complex tuning for each block. Then the second layer kicks in — Quantized Johnson-Lindenstrauss, or QJL.

It takes the residual error after the first stage and encodes it with a single additional bit to eliminate systematic bias in inner product and attention score computation. Practically, this is important for two reasons. First, TurboQuant remains data-oblivious: it requires no datasets for calibration, additional training, or fine-tuning for specific models.

Second, the method works for online scenarios, where the cache needs to be compressed directly during inference, rather than preparing a separate offline pipeline. Google emphasizes that this approach is useful not only for LLMs, but also for vector search, where large arrays of embeddings also need to be stored and compared quickly and cheaply. TurboQuant itself is being prepared for presentation at ICLR 2026.

What results did

Google achieve Google tested TurboQuant on LongBench, Needle In A Haystack, ZeroSCROLLS, RULER, and L-Eval, using open models Gemma and Mistral. According to the company, TurboQuant maintains quality on long-context tasks while significantly reducing the KV-cache. Google's blog emphasizes 3-bit quantization without quality loss on tested benchmarks, and the paper's abstract on arXiv separately notes full quality preservation at 3.

5 bits per channel with only slight degradation at 2.5 bits. minimum 6x reduction in KV-cache memory up to 8x acceleration of attention logits computation on H100 in 4-bit configuration compared to unquantized 32-bit keys no need for additional training, fine-tuning, or calibration datasets strong results in vector search too: TurboQuant outperformed baseline PQ and RabbiQ methods on recall on the GloVe dataset Separately, Google is banking on applying the method to search.

TurboQuant, PolarQuant, and QJL reduce not only memory usage but also index construction time while maintaining the accuracy of nearest neighbor search. This makes the technology interesting not only for generative models, but for any infrastructure that needs to work with huge collections of vectors: from semantic search to recommendation systems and the retrieval layer of AI products.

What this means

TurboQuant shows that the next major breakthrough for LLMs may come not from new model parameters, but from smarter memory management. If Google's approach is validated in production and appears in popular inference stacks, long context will become cheaper, faster, and more accessible even without hardware upgrades. For developers, this is a chance to fit longer sessions and RAG scenarios within the same GPU budget, and for users — to get more stable answers on large documents and long conversations.

ZK
Hamidun News
AI news without noise. Daily editorial selection from 400+ sources. A product by Zhemal Khamidun, Head of AI at Alpina Digital.

Want to stop reading about AI and start using it?

AI News is a curated feed of AI/tech news. Hamidun Academy teaches you to use AI systematically in your work.

What do you think?
Loading comments…