Google Unveiled TurboQuant: 3-Bit KV-Cache for LLM, but Memory Market Panicked Prematurely
Google unveiled TurboQuant — a method to compress LLM KV-cache to 3 bits and sharply reduce VRAM consumption on long contexts. The memory market initially…
AI-processed from Habr AI; edited by Hamidun News
Google Research presented TurboQuant — an algorithm that promises to compress the KV-cache of language models to 3 bits without retraining and with almost no loss of quality. Against the backdrop of these claims, the memory market quickly became frightened, although a real revolution in data centers is still far away.
Why the market reacted
On March 24, 2026, Google Research released material on TurboQuant, and just two days later, investors began selling shares of memory-related companies. SK Hynix lost 6.23%, Samsung dropped 4.8%, Micron and SanDisk in the US fell by approximately 5% and 8% respectively. The market's logic looked straightforward: if large models could manage with significantly less memory for inference, demand for HBM and DRAM in data centers should also decrease. But this interpretation proved too crude and did not account for where exactly TurboQuant provides benefits.
The problem the algorithm targets is not related to all model memory, but only to the KV-cache. These are auxiliary token representations that a transformer stores during text generation to avoid recalculating them at each step. On short contexts, the KV-cache barely interferes, but on long ones, it becomes the main memory consumer. For large models with windows of tens and hundreds of thousands of tokens, the volume of such a cache can occupy tens of gigabytes and become a bottleneck in mass inference.
What TurboQuant does
The industry learned long ago how to quantize model weights: there is GPTQ, AWQ, and other approaches for this. With KV-cache, the situation is more complex because it arises in real time and is unique for each request. You cannot prepare data once, calibrate a scheme, and then simply apply it everywhere. You need a method that can quickly compress any new vector on the fly while not compromising answer quality on long contexts.
This is precisely the task that TurboQuant attempts to solve. Google's scheme is two-stage. First, the PolarQuant step rotates the vector with a random orthogonal matrix to make the value distribution more even and predictable. After this, you can apply a pre-calculated optimal quantizer without calibration data. Then the QJL step kicks in, which encodes the sign of the residual error with one bit and reduces systematic bias in dot products. Because of this, error does not accumulate noticeably over a long sequence of tokens, and the model better preserves answer quality.
- 3-bit representation of KV-cache without model retraining
- Up to 8 times faster attention logit computation on H100 according to the authors
- At least 6 times less VRAM for the KV-cache itself
- No mandatory offline calibration for a specific model
Where the method has limits
The most important nuance is that the technology still looks premature as an industrial standard. The community has already noticed: on small models up to 3B parameters, aggressive compression to 3 bits can noticeably degrade quality, cause repetitions, and worsen text coherence. For many practical scenarios, a 4-bit mode remains the safer option.
Additionally, Google has only published a blog post and a preprint so far. An official implementation does not yet exist, and as of April 29, 2026, the algorithm is not built into vLLM, llama.cpp, or SGLang.
There is also a scientific dispute over priority of the idea. Jianyang Gao, one of the authors of the earlier RaBitQ algorithm, claimed that TurboQuant is too close to their approach and incorrectly describes the predecessor. Among the complaints are understating methodological similarity, questionable criticism of RaBitQ theory, and comparison under unequal conditions: TurboQuant was tested on an A100 GPU, while RaBitQ in one benchmark was on single-threaded Python.
The complaint has already been submitted to the ICLR ethics committee, and Google has not yet provided a public response.
What it means
TurboQuant looks not like a crash in the memory market, but as a strong improvement in one specific narrow point of LLM inference. If Google releases the code and the method enters standard stacks, long contexts will become cheaper, and running large models on more modest hardware will become more realistic. But right now it is more of an important research result than a ready-made industry revolution.
Want to stop reading about AI and start using it?
AI News is a curated feed of AI/tech news. Hamidun Academy teaches you to use AI systematically in your work.