KV-Cache
KV-Cache (Key-Value Cache) is a memory buffer that stores the key and value tensors produced by a transformer's attention layers for already-processed tokens, eliminating redundant recomputation during autoregressive generation. It is the primary mechanism that makes per-token LLM generation practical at production latency.
In a transformer's multi-head attention mechanism, every input token produces three vectors: a query (Q), a key (K), and a value (V). During autoregressive generation — where the model emits one token at a time — each new token's query must attend over all previous tokens' keys and values. Without caching, those K and V tensors would be recomputed from scratch at every generation step, causing inference cost to scale quadratically with sequence length. The KV-cache stores computed K and V tensors in GPU or CPU memory so each pair is computed exactly once and reused for all subsequent generation steps.
At the start of a generation request, the model executes a "prefill" phase: the entire input prompt is processed in a single parallel forward pass and the resulting K and V tensors for every prompt token are written into the cache. The subsequent "decode" phase processes one new token per step; each step reads the full KV-cache to compute attention and appends the new token's own K/V entry to the cache. Cache size scales as O(num_layers × num_key_value_heads × head_dim × sequence_length), reaching tens of gigabytes for long contexts on large models, making GPU memory capacity a primary constraint in production serving.
Without a KV-cache, per-token decode cost would grow as O(n) because the full growing sequence would be reprocessed at each step. With the cache, each decode step is O(1) in computation (excluding memory bandwidth), keeping latency roughly constant regardless of how many tokens have already been generated. The KV-cache also enables advanced serving optimizations: vLLM's PagedAttention (2023) applies virtual-memory paging to KV-cache storage to eliminate fragmentation and support larger concurrent batch sizes; speculative decoding uses a shared KV-cache between a small draft model and a large verifier to increase effective throughput.
As of 2026, KV-cache management is a central focus of LLM inference engineering. Quantizing KV-cache tensors to INT8 or INT4 (independently of weight quantization) is standard practice at production serving stacks including vLLM, TensorRT-LLM, and SGLang, roughly halving cached-activation memory usage. Flash Attention 2 and 3 reduce the memory bandwidth cost of reading and writing the cache. Research into cache offloading (spilling tensors to CPU or NVMe when GPU memory is exhausted) and cache compression (Infini-Attention, H2O token eviction) is active, targeting deployments where context length exceeds available GPU memory.