In one line: Distillation has a small model (the student) imitate a large model (the teacher) — not just the hard label but the full probability distribution. At the same parameter count, distilled models are far stronger than ones trained from scratch.
What it is#
Traditional: label "cat" → student loss (one right answer)
KD: teacher output [0.7 cat, 0.2 leopard, 0.05 tiger...] → student imitates the distribution
Learning the full distribution = learning inter-class relationships (cat and leopard similar > cat and aeroplane) — far more information than a hard label.
Analogy#
Traditional training = rote memorisation: "this question's answer is B".
Distillation = the teacher walks through the reasoning: B is most likely, A is plausible but missing detail, C/D are wrong — the student learns the reasoning path.
Key concepts#
Three mainstream approaches#
- Classic KD (Hinton 2015)
- KL(student || teacher) + α·CE(student, label).
- Sequence-level KD
- Teacher generates large volumes of data; student does SFT on it. Most common.
- On-policy distillation (DistillD)
- Student emits tokens, teacher provides feedback. Better quality but expensive.
How it works#
Only the student's parameters are updated; the teacher is frozen.
Practical notes#
- Data beats loss tricks. 99% of practical distillation = teacher generates data → student SFT. "Soft-label KL" matters less than producing 1M high-quality samples.
- Pick a teacher per domain. General Q&A: GPT-4 / Claude / DeepSeek-V3; vertical domains: domain-expert models.
- Distil chains of thought. Have the teacher emit reasoning traces, student learns the whole CoT. Phi / Orca series follow this approach.
- Rejection sampling. Teacher generates many candidates → auto-validate / score → keep only the best to train the student.
- Student ceiling. No matter how much you distil, you can't exceed the teacher. A 1.5B will not "be" 70B, but it can approach it on specific sub-tasks.
- License caveat. Many commercial APIs forbid using their outputs to train your own model. Open-weight models (Qwen / DeepSeek / Llama) are looser but still check the terms.
Easy confusions#
Requires training.
Mostly training-free.