TechCrunch→ original

Google unveils TurboQuant — an algorithm that compresses AI working memory sixfold

Google has announced TurboQuant — an algorithm for compressing neural networks’ working memory with a claimed ratio of up to sixfold. For now, it is a lab…

AI-processed from TechCrunch; edited by Hamidun News
Google unveils TurboQuant — an algorithm that compresses AI working memory sixfold
Source: TechCrunch. Collage: Hamidun News.
◐ Listen to article

Google announced TurboQuant — a new algorithm for compressing neural network working memory that, according to the company, can reduce memory consumption by up to six times. The announcement immediately sparked a wave of jokes in the tech community: users around the world are comparing the development to Pied Piper — a fictional algorithm from HBO's Silicon Valley series, which became a cult symbol of unfounded technological hype. For now, TurboQuant remains a laboratory experiment: the company has disclosed neither a technical paper, nor public code, nor timelines for commercial deployment.

Why neural network memory is a critical problem

Large language models require enormous volumes of GPU memory. This problem has two dimensions. The first is static: the model's weights themselves.

Llama 3.1 with 70 billion parameters occupies about 140 gigabytes in full precision. The second dimension is dynamic: intermediate computations that the model performs when processing each request.

These temporary data are called activations, and it is they that become the main bottleneck when working with long contexts. When a model processes a document with 100,000 tokens, it must keep in memory the computation results of each layer for each token — the so-called KV-cache. The volume of this data grows linearly with context length and can exceed the volume of the weights themselves with a sufficiently long input.

This is exactly where TurboQuant offers a radical solution.

How TurboQuant Works

The algorithm applies quantization — a technique for reducing numerical precision — directly to activations in real time. Standard quantization has been applied for decades to static model weights: using 8-bit or 4-bit integers instead of 32-bit floating-point numbers. This works well for unchanging weights because the range of values is predictable. Activations are a completely different matter. Their values vary unpredictably depending on the specific request, which makes standard quantization ineffective without loss of quality. Google claims that TurboQuant solves this problem using adaptive methods that account for activation statistics on the fly. According to the company, this achieves six-fold compression without significant degradation of answer quality.

What confirmation of results would mean

Even more modest practical results — two to three-fold compression — would change the economics of AI infrastructure. The largest cloud providers spend tens of billions of dollars annually on GPU infrastructure to service model requests. A significant portion of these costs is driven by memory requirements during inference.

Compressing activations would mean more powerful models on the same hardware, lower latency through reduced memory operations, and the ability to handle long contexts without performance degradation. For edge devices, the consequences are even more significant. Currently, running models at the level of Llama 3.

1 70B requires multiple graphics cards or aggressive compromises on precision. TurboQuant could significantly lower this barrier — opening powerful models to laptops and workstations with limited memory.

The Pied Piper Phenomenon and What's Behind It

The comparison to Pied Piper is more than just a meme. In the series, a fictional startup creates a universal compression algorithm with fantastic characteristics, based on the original "Weissman coefficient." The parallels with TurboQuant are obvious: revolutionary numbers, closed code, absence of independent verification.

The difference is that Google DeepMind is not a garage startup. The company has a long track record of real achievements in efficiency: Flash Attention, KV-cache optimization, distillation algorithms. If TurboQuant passed internal review and was announced publicly, it most likely represents a real result.

The next mandatory step is publication on arXiv and independent reproduction of the results by third-party researchers. Until that moment, TurboQuant remains a promise. If the results are confirmed, the jokes about Pied Piper will become a thing of the past along with the neural network memory problem — and that would be a good outcome.

ZK
Hamidun News
AI news without noise. Daily editorial selection from 400+ sources. A product by Zhemal Khamidun, Head of AI at Alpina Digital.

Want to stop reading about AI and start using it?

AI News is a curated feed of AI/tech news. Hamidun Academy teaches you to use AI systematically in your work.

What do you think?
Loading comments…