ArcLibrary

Speculative Decoding

Small model 'drafts' tokens; big model verifies in one pass — 2–3× speedup, mathematically lossless.

InferenceSpeed
核心 · Key Idea

In one line: A lightweight draft model guesses K tokens at once; the target model verifies all K in parallel. Accept the agreed prefix in one go; on disagreement, fall back to the target's choice. Mathematically equivalent output distribution, 2–3× faster.

What it is#

Traditional decode: 1 big-model call per token

Spec decode:
  Draft model proposes 5 tokens at once
  Target model: one forward pass (incremental KV) → real distributions at all 5 positions
  Accept each position if draft ⊆ acceptance set; otherwise replace with target's sample
  Average: 2-4 tokens per big-model call

Analogy#

打个比方 · Analogy

The draft model is the intern who writes the first draft of the minutes; the big model is the manager who reviews 5 paragraphs at once — most pass through; only the rejected ones get rewritten. Bulk review beats sentence-by-sentence dictation.

Key concepts#

Draft ModelDraft model
A small fast model (often a smaller member of the same family, e.g. Llama-3 8B drafting for 70B).
Verification StepVerification
Forward the K candidate tokens through the target once, **get true distribution at all K positions**.
Acceptance RateAcceptance rate
Fraction where draft matches target. 50–80% common.
LosslessLossless
Sampling distribution strictly equivalent — guaranteed by rejection sampling math.
Medusa / Eagle / LookaheadDraft-less variants
Use the model's own auxiliary heads to predict future tokens; no separate draft model.
Self-speculativeSelf-speculative
Use early layers of the same model as the draft; full model as verifier.

How it works#

Practical notes#

  • High acceptance hinges on similarity. Draft and target should share family / RLHF lineage.
  • Right-size the draft. Too large → it isn't faster. Typically 1/10 ~ 1/30 of the target.
  • High temperature / divergent sampling → acceptance drops, speedup shrinks. Greedy / low-temperature gets the best speedup.
  • vLLM / TGI / TensorRT-LLM all ship spec-decode (auto-pick Medusa / draft model).
  • In serving: multiple requests can share a draft, saving more compute.
  • Don't expect quality gains. Spec decode is mathematically identical to the target's normal decode.

Easy confusions#

Speculative Decoding
**Doesn't touch weights**; pure inference speedup.
Mathematically equivalent — **same output**.
Quantization / Distillation
**Modifies weights** (precision / size).
Output differs slightly.

Further reading#