In one line: DPO collapses "learn a reward → optimise with PPO" into one step, fitting "chosen is better than rejected" with a single classification loss. Simple to implement and stable to train — the most popular alignment method after RLHF.
What it is#
The dataset looks like:
{
"prompt": "Explain what the GFW is",
"chosen": "The GFW is ...(neutral, thorough, objective)",
"rejected": "The GFW is ...(emotional, biased)"
}DPO directly trains: "make chosen more probable under the model than under the reference, and the further above rejected, the better."
Analogy#
RLHF = train a judge → have the contestant compete repeatedly → scored by judge → iterate: many steps, prone to flying off.
DPO = give the contestant paired exemplars — "which of these two is better" — skip the judge, far less hassle.
Key concepts#
How it works#
Mathematically equivalent to maximum-likelihood over an implicit reward r = β log(π/π_ref).
Practical notes#
- Quality > quantity. 1k high-quality preference pairs beat 10k noisy ones. Use GPT-4 / Claude as judge for automated comparisons.
- Start from SFT. DPO on a base model that hasn't been SFT'd performs poorly. SFT first, then DPO.
- Tune β. Common range 0.1 ~ 0.5. Train 1 epoch in small steps and watch the reward margin (chosen vs rejected) move.
- Overfit warning. Tiny preference set + many epochs → model becomes "spiky" and emits weird answers.
- Production loop: SFT → DPO → safety eval → ship / iterate.
- Small models too. LoRA + DPO on a 7B with a single 24 GB GPU is feasible; TRL / axolotl ship it out of the box.
Easy confusions#
Powerful but unstable to train.
Simple and stable; quality close to PPO.