Google's TurboQuant Algorithm Crashes Memory Manufacturer Stocks After Research Publication
Google introduced TurboQuant — a KV-cache compression method for LLMs that reduces memory consumption by at least six times and, according to the company…
AI-processed from TNW; edited by Hamidun News
Google Research has introduced TurboQuant — a compression algorithm for AI models that reduces key-value cache memory by at least six times without noticeable quality loss. The market reacted immediately: after publication on March 24, 2026, investors began reassessing how much memory the generative AI industry would actually need.
What Google Demonstrated
TurboQuant addresses a narrow but expensive bottleneck in large language model inference — the key-value cache, or KV-cache. This is a context storage that allows the model to avoid recalculating already-processed tokens. The longer the request, document, or conversation, the faster this cache grows and the more GPU memory it consumes.
According to Google, the new method compresses the KV-cache to 3 bits per value instead of the standard 16 bits, and reduces memory consumption by at least six times. For practice, this is no less important than for science. Freed memory allows serving more concurrent requests on the same hardware, running longer context windows, or using larger models without expanding the accelerator fleet.
The authors write that TurboQuant requires no retraining or fine-tuning and will be presented at ICLR 2026. The tests used models from the Gemma, Mistral, and Llama families, as well as standard long-context benchmarks.
How the Algorithm Works
TurboQuant is based on a two-stage scheme. First, the PolarQuant method converts vectors to polar representation to eliminate unnecessary overhead data that normally consumes some of the gains from traditional quantization. Then QJL is applied — a technique that encodes residual error with just one additional bit per dimension and reduces distortions in attention.
As a result, most of the bit budget goes toward preserving the semantic meaning of the original data rather than technical overhead.
Google calls the KV-cache a "high-speed digital cheat sheet" for the model.
- KV-cache compression from 16 to 3 bits
- minimum 6-fold memory reduction
- up to 8-fold speedup of attention calculation on Nvidia H100 in 4-bit mode
- operation without training and fine-tuning
- application not only in LLMs but also in vector search
Google claims that on Needle in a Haystack tasks, TurboQuant maintained perfect results even with sixfold cache compression. On LongBench and ZeroSCROLLS, the method also matched or surpassed KIVI — one of the well-known baseline approaches to KV-cache quantization.
Separately, the company tested TurboQuant for vector search and achieved higher recall without large codebooks and tuning to specific datasets. This is already a direct area of interest for search, recommendation, and advertising systems.
Why the Market Reacted
The stock market saw in this publication not academic progress but a signal of possible declining demand for memory in AI infrastructure. Within hours of the article's release, Micron shares fell 3%, Western Digital dropped 4.7%, and SanDisk fell 5.7%.
The logic is simple: if a key component of inference suddenly requires significantly less memory, future purchases of HBM, DRAM, and storage may not look as linear as investors had projected.
But this does not mean the industry suddenly needs six times less hardware. Memory is just one line item in data center expenses, and models' appetite for computation grows faster than any local optimization. Even analysts warn against too direct conclusions: compression algorithms existed before but did not collapse overall infrastructure demand.
Computing history more often shows the opposite effect: once resources become cheaper, companies begin building heavier and more massive systems on the same budget.
What This Means
TurboQuant is not a reason to write off memory manufacturers but an early indicator of a new stage in the efficiency race. Now winners will not only be those who buy more GPUs but also those who can better compress inference without losing quality. For AI products this is a chance to reduce per-request costs, and for the market — a reminder that software already influences hardware capitalization.
Want to stop reading about AI and start using it?
AI News is a curated feed of AI/tech news. Hamidun Academy teaches you to use AI systematically in your work.