MarkTechPost→ original

DeepSeek, Google, and Meta: 10 Techniques for LLM KV-Cache Compression to Reduce Inference Memory

KV-cache has long been a bottleneck for running LLMs with long context, and researchers now offer more than a dozen practical ways to compress it. The survey…

AI-processed from MarkTechPost; edited by Hamidun News
DeepSeek, Google, and Meta: 10 Techniques for LLM KV-Cache Compression to Reduce Inference Memory
Source: MarkTechPost. Collage: Hamidun News.
◐ Listen to article

KV-cache has evolved from an auxiliary detail into one of the main bottlenecks for production LLM inference. A new survey has compiled 10 techniques that help reduce memory consumption without full model retraining and in many cases significantly accelerate generation.

Where LLMs Hit the Wall

The longer the context and the more concurrent requests a model serves, the faster the KV-cache grows—an intermediate storage of keys and values from the attention mechanism. The survey provides a telling example: a model with 30 billion parameters at batch size 128 and 1024 token input can occupy up to 180 GB of memory in its KV-cache. Even for a 7B model, weights take about 14 GB of GPU memory, while the cache takes roughly 72 GB—meaning the generation mechanism itself begins to cost more than storing the parameters.

Because of this, KV-cache optimization has become a separate research direction, not a minor tuning task. Compressing the cache allows you to increase batch size, serve more users on the same GPU, and avoid hitting memory limits with long prompts. An important advantage is that much of these methods work directly during inference: the base model doesn't need to be retrained, and the effect is immediately visible in throughput and serving cost.

How It Gets Compressed

Researchers now use several strategies that differ not only in quality but also in placement within the pipeline. Some methods discard the least useful tokens, others reduce the precision of cache representation, and still others change the attention architecture itself. A separate class redistributes memory across layers because early layers need richer context, while deeper layers can work with fewer keys and values. Essentially, it's no longer about percentages of savings but about the ability to run long context on the same hardware.

  • Token pruning: H2O, StreamingLLM, and SnapKV keep only a portion of states. H2O retains "heavy" tokens with high attention contribution, StreamingLLM keeps the first tokens and a recent window, while SnapKV selects important positions by attention at the end of the prompt.
  • Per-layer budget allocation: PyramidKV and PyramidInfer operate on the assumption that deep layers need less rich context than early ones, so memory is allocated unevenly.
  • Quantization: KIVI, KVQuant, and TurboQuant reduce the precision of KV-cache representation while trying to preserve generation quality.
  • Architectural changes: MQA, GQA, and MLA reduce cache size at the level of the attention scheme itself, rather than on top of an existing model.
  • Low-rank compression: Palu, LoRC, and similar methods cut the hidden dimension of KV tensors rather than sequence length.

The simplest to deploy are methods without additional training. H2O finds tokens that collect the bulk of attention and discards weak positions. StreamingLLM keeps the first tokens as "attention anchors" plus a recent window, making it suitable for infinite conversations but risking loss of important mid-context information. SnapKV operates during the prefill stage and selects important positions separately per attention head, so it typically outperforms cruder schemes at the same cache budget.

Attention weight distributions often follow a power law, so removing

low-contribution tokens doesn't always severely impact quality.

Who Delivers the Best Gains

In quantization, KIVI, KVQuant, and TurboQuant stand out notably. KIVI converts KV-cache to 2-bit representation without fine-tuning and, according to the survey, delivers up to 2.6x lower peak memory usage in the "weights plus cache" combination and allows running batches up to four times larger.

KVQuant goes further: it uses calibration, mixed precision, and separate outlier handling to maintain quality even on extremely long contexts. The most aggressive result in the material is attributed to TurboQuant from Google Research. This method first aligns value distributions through random orthogonal rotation, then corrects quantization error so that scalar product estimation remains unbiased.

On H100, it shows at least 6x memory reduction and up to 8x faster attention at 3-bit precision. For infrastructure teams, this is no longer a local optimization but a claim for a new serving standard.

A separate track is changing the model architecture itself. GQA has already become the de facto norm for modern open-weight LLMs: while it was used only in Llama 2's 70B version, in Llama 3 it expanded to both 8B and 70B. Going further is MLA from DeepSeek, where instead of full-sized key and value, a compressed latent representation is stored per token. The survey notes that DeepSeek-V2 reduced KV-cache by 93.3% compared to its previous dense 67B model through MLA.

What It Means

The LLM market is increasingly less constrained by weight size and increasingly by the cost of memory on long context. For teams building inference services, the takeaway is straightforward: gains now come not from one magic technique but from thoughtful selection between eviction, quantization, and architecture tailored to specific workloads, SLAs, and GPU budgets.

ZK
Hamidun News
AI news without noise. Daily editorial selection from 400+ sources. A product by Zhemal Khamidun, Head of AI at Alpina Digital.

Want to stop reading about AI and start using it?

AI News is a curated feed of AI/tech news. Hamidun Academy teaches you to use AI systematically in your work.

What do you think?
Loading comments…