In one line: During autoregressive generation, previously-computed K/V matrices are cached and reused so we don't recompute the whole prefix for every new token. But the cache grows linearly with context length — it's the dominant memory and bandwidth consumer in long-context inference.
What it is#
When generating the N-th token:
- Only compute Q/K/V for the new token itself;
- For the previous N-1 tokens, read K/V from the cache;
- Run attention with the new Q against all cached K.
Without cache → recompute everything for every token → O(N²) per step → O(N³) total.
With cache → O(N) per step → O(N²) total.
Analogy#
The KV cache is the meeting transcript: when a new speaker chimes in, only this one statement needs recording; for the rest you just consult the file — no need to make everyone say it all again.
Size estimation#
KV cache bytes ≈
2 (K + V)
× num_layers
× num_kv_heads // post-GQA / MQA
× head_dim
× seq_len
× dtype_bytes // bf16 = 2, fp8 = 1, int4 = 0.5
LLaMA-3 70B, 8K context, bf16:
2 × 80 × 8 × 128 × 8192 × 2 ≈ 2.7 GB
At batch 16 → 43 GB — the KV cache alone needs its own GPU.
Key concepts#
How it works#
Practical notes#
- Throughput and latency are different things. Slow prefill = long TTFT (time to first token); slow decode = low TPS (tokens per second). Different fixes.
- vLLM / TensorRT-LLM / SGLang already implement PagedAttention + continuous batching + prefix cache. Hand-rolled inference will always lose.
- For very long contexts (>32K) the KV cache dominates: consider KV quantisation / sliding-window attention / RoPE extrapolation + hybrid local-global attention.
- Long system prompts: turn on prefix cache so N users share one KV.
- Batch inference OOMs: lower
max_seq_lenormax_num_seqsinstead of unlimited batching. - batch=1 doesn't save much KV: cache size scales with
seq_len, not batch.
Easy confusions#
Compute-bound, can run with large batches.
Bandwidth-bound — **most of the optimisation surface lives here**.