Prompt Caching
Prompt caching is an API and serving technique that stores the computed KV-cache state for a shared prompt prefix — such as a system prompt or a large document — and reuses it across multiple separate requests, eliminating redundant computation and reducing both time-to-first-token and API cost.
Prompt caching extends the KV-cache concept from within a single generation call to across multiple API requests from different users or sessions. When the same text prefix — typically a lengthy system prompt, a retrieved knowledge document, or a reference codebase — appears at the start of many requests, the transformer does not need to re-process that prefix for each new request. Instead, the K/V tensors computed for the cached prefix are stored server-side and reattached to incoming requests as if the prefill step for the prefix had already completed, allowing the model to proceed directly to generating the response.
Implementation differs by provider. Anthropic introduced explicit prompt caching in its Claude API in 2024: developers annotate cache breakpoints in request JSON using a cache_control field; the first request that encounters a breakpoint computes and stores the prefix tensors, and subsequent requests arriving within the cache TTL (5 minutes by default, longer for higher-tier accounts) reuse them. Anthropic charges approximately 10% of the normal input-token price for cache-hit tokens, with a small one-time write fee. OpenAI's API introduced automatic prefix caching in late 2024, transparently reusing the longest matching prefix in the server-side cache without requiring markup. Google's Gemini API introduced "context caching" in 2024 with an explicit TTL parameter and per-second storage costs for very large cached contexts.
For applications where a large, stable context is shared across many requests — customer service bots with detailed product knowledge bases, coding assistants with a full repository loaded, RAG pipelines with large retrieved passages — prompt caching reduces input processing costs by 60–90% and cuts time-to-first-token latency substantially. A 20,000-token system prompt reused across thousands of daily requests would otherwise consume substantial compute on every call; with caching, it is processed once per cache lifetime regardless of request volume.
As of 2026, prompt caching is a standard production feature across all major cloud AI APIs. At the infrastructure layer, local inference frameworks implement the same concept without application-level markup: vLLM's prefix caching shares KV tensors across requests with identical prefixes, and SGLang's RadixAttention (2024) organizes cached prefixes as a radix tree to maximize reuse across partially overlapping prompts, achieving substantial throughput gains in agentic and RAG workloads where prompt structure is highly regular across requests.