ArcLibrary

Knowledge Distillation

Teach a small model to mimic a big model's output distribution — squeeze 70B-class capability into 7B.

DistillationCompression
核心 · Key Idea

In one line: Distillation has a small model (the student) imitate a large model (the teacher) — not just the hard label but the full probability distribution. At the same parameter count, distilled models are far stronger than ones trained from scratch.

What it is#

Traditional:  label "cat" → student loss (one right answer)
KD:           teacher output [0.7 cat, 0.2 leopard, 0.05 tiger...] → student imitates the distribution

Learning the full distribution = learning inter-class relationships (cat and leopard similar > cat and aeroplane) — far more information than a hard label.

Analogy#

打个比方 · Analogy

Traditional training = rote memorisation: "this question's answer is B".
Distillation = the teacher walks through the reasoning: B is most likely, A is plausible but missing detail, C/D are wrong — the student learns the reasoning path.

Key concepts#

Teacher / StudentTeacher / Student
Teacher is typically larger / stronger and already trained; student is the one we train.
Soft TargetsSoft targets
Teacher's full softmax probabilities (with temperature). Much more informative than hard labels.
TemperatureTemperature
softmax / T; T>1 smooths the distribution, exposing inter-class differences.
On-policy / Off-policyStrategy
Score the student on its own outputs (better) vs. on a fixed dataset.
Sequence DistillationSequence-level
Match the student's whole-sequence distribution to the teacher's, not token-by-token loss.
Self-distillationSelf-distillation
Teacher = an earlier version of the student; iterative refinement.

Three mainstream approaches#

Classic KD (Hinton 2015)
KL(student || teacher) + α·CE(student, label).
Sequence-level KD
Teacher generates large volumes of data; student does SFT on it. Most common.
On-policy distillation (DistillD)
Student emits tokens, teacher provides feedback. Better quality but expensive.

How it works#

Only the student's parameters are updated; the teacher is frozen.

Practical notes#

  • Data beats loss tricks. 99% of practical distillation = teacher generates data → student SFT. "Soft-label KL" matters less than producing 1M high-quality samples.
  • Pick a teacher per domain. General Q&A: GPT-4 / Claude / DeepSeek-V3; vertical domains: domain-expert models.
  • Distil chains of thought. Have the teacher emit reasoning traces, student learns the whole CoT. Phi / Orca series follow this approach.
  • Rejection sampling. Teacher generates many candidates → auto-validate / score → keep only the best to train the student.
  • Student ceiling. No matter how much you distil, you can't exceed the teacher. A 1.5B will not "be" 70B, but it can approach it on specific sub-tasks.
  • License caveat. Many commercial APIs forbid using their outputs to train your own model. Open-weight models (Qwen / DeepSeek / Llama) are looser but still check the terms.

Easy confusions#

Distillation
**Architecture change (smaller).**
Requires training.
Quantization
**Precision change (cheaper).**
Mostly training-free.

Further reading#