KV Cache (the inference performance bottleneck)

核心 · Key Idea

In one line: During autoregressive generation, previously-computed K/V matrices are cached and reused so we don't recompute the whole prefix for every new token. But the cache grows linearly with context length — it's the dominant memory and bandwidth consumer in long-context inference.

What it is#

When generating the N-th token:

Only compute Q/K/V for the new token itself;
For the previous N-1 tokens, read K/V from the cache;
Run attention with the new Q against all cached K.

Without cache → recompute everything for every token → O(N²) per step → O(N³) total.
With cache → O(N) per step → O(N²) total.

Analogy#

打个比方 · Analogy

The KV cache is the meeting transcript: when a new speaker chimes in, only this one statement needs recording; for the rest you just consult the file — no need to make everyone say it all again.

Size estimation#

KV cache bytes ≈
   2 (K + V)
 × num_layers
 × num_kv_heads        // post-GQA / MQA
 × head_dim
 × seq_len
 × dtype_bytes         // bf16 = 2, fp8 = 1, int4 = 0.5

LLaMA-3 70B, 8K context, bf16:

2 × 80 × 8 × 128 × 8192 × 2  ≈  2.7 GB

At batch 16 → 43 GB — the KV cache alone needs its own GPU.

Key concepts#

Prefill / DecodePrefill / Decode

Prefill computes all prompt tokens at once (compute-bound). Decode emits one token at a time (bandwidth-bound).

PagedAttentionPaged cache

vLLM slices the cache into 16-token pages to avoid fragmentation from uneven sequence lengths.

Continuous BatchingContinuous batching

New requests are inserted into the batch immediately, not at batch boundaries — throughput doubles.

Prefix CachePrefix reuse

Shared system prompts → reuse KV → save prefill time.

KV QuantKV quantisation

int8 / int4 KV further shrinks long-context VRAM (mild quality loss).

Off-loadOffloading

Long-context KV lives in CPU / NVMe, swapped back to GPU on demand (slow, but holds).

How it works#

Practical notes#

Throughput and latency are different things. Slow prefill = long TTFT (time to first token); slow decode = low TPS (tokens per second). Different fixes.
vLLM / TensorRT-LLM / SGLang already implement PagedAttention + continuous batching + prefix cache. Hand-rolled inference will always lose.
For very long contexts (>32K) the KV cache dominates: consider KV quantisation / sliding-window attention / RoPE extrapolation + hybrid local-global attention.
Long system prompts: turn on prefix cache so N users share one KV.
Batch inference OOMs: lower max_seq_len or max_num_seqs instead of unlimited batching.
batch=1 doesn't save much KV: cache size scales with seq_len, not batch.

Easy confusions#

Prefill

Process the whole prompt at once.
Compute-bound, can run with large batches.

Decode

One token at a time.
Bandwidth-bound — **most of the optimisation surface lives here**.