核心 · Key Idea
In one line: A lightweight draft model guesses K tokens at once; the target model verifies all K in parallel. Accept the agreed prefix in one go; on disagreement, fall back to the target's choice. Mathematically equivalent output distribution, 2–3× faster.
What it is#
Traditional decode: 1 big-model call per token
Spec decode:
Draft model proposes 5 tokens at once
Target model: one forward pass (incremental KV) → real distributions at all 5 positions
Accept each position if draft ⊆ acceptance set; otherwise replace with target's sample
Average: 2-4 tokens per big-model call
Analogy#
打个比方 · Analogy
The draft model is the intern who writes the first draft of the minutes; the big model is the manager who reviews 5 paragraphs at once — most pass through; only the rejected ones get rewritten. Bulk review beats sentence-by-sentence dictation.
Key concepts#
Draft ModelDraft model
A small fast model (often a smaller member of the same family, e.g. Llama-3 8B drafting for 70B).
Verification StepVerification
Forward the K candidate tokens through the target once, **get true distribution at all K positions**.
Acceptance RateAcceptance rate
Fraction where draft matches target. 50–80% common.
LosslessLossless
Sampling distribution strictly equivalent — guaranteed by rejection sampling math.
Medusa / Eagle / LookaheadDraft-less variants
Use the model's own auxiliary heads to predict future tokens; no separate draft model.
Self-speculativeSelf-speculative
Use early layers of the same model as the draft; full model as verifier.
How it works#
Practical notes#
- High acceptance hinges on similarity. Draft and target should share family / RLHF lineage.
- Right-size the draft. Too large → it isn't faster. Typically 1/10 ~ 1/30 of the target.
- High temperature / divergent sampling → acceptance drops, speedup shrinks. Greedy / low-temperature gets the best speedup.
- vLLM / TGI / TensorRT-LLM all ship spec-decode (auto-pick Medusa / draft model).
- In serving: multiple requests can share a draft, saving more compute.
- Don't expect quality gains. Spec decode is mathematically identical to the target's normal decode.
Easy confusions#
Speculative Decoding
**Doesn't touch weights**; pure inference speedup.
Mathematically equivalent — **same output**.
Mathematically equivalent — **same output**.
Quantization / Distillation
**Modifies weights** (precision / size).
Output differs slightly.
Output differs slightly.