In one line: RLHF = Reinforcement Learning from Human Feedback. Have humans compare two answers, train a "preference scorer" (Reward Model), then use RL to push the LLM towards the human-preferred direction. This was the key step that made ChatGPT "suddenly usable".
What it is#
Three stages:
- SFT — Start with human demonstration data, train a base model that "answers properly."
- Train a Reward Model (RM) — Sample multiple answers per question, have humans rank them, then train a small model that can score any answer.
- PPO / DPO fine-tuning — Use RL to make the model maximise the reward, while not drifting too far from the SFT model (KL constraint).
After all this the model is "polite, on-topic, knows its limits" — the industry approach to "alignment".
Analogy#
SFT = teach a child how to answer.
RLHF = the teacher scores two essays: "this one is better than that one" — the child gradually learns the teacher's grading standards, rather than memorising answers.
Key concepts#
How it works#
In production today DPO has become the de-facto standard — engineering complexity is an order of magnitude lower than PPO.
Practical notes#
- App engineers rarely do RLHF, unless you're building a base model or working on safety alignment. Most use cases are fine with open-source aligned models (Llama-Instruct / Qwen-Chat).
- If you must, pick DPO. 5–20k preference samples, single-GPU hours. Simpler than PPO.
- Preference data can be "semi-AI". Have GPT-4 judge 9k samples, then human-review 1k — best quality/cost trade-off.
- Beware "over-alignment". The model refuses everything. In your preference data, include "reasonable refusal vs. over-refusal" pairs.
- Why ChatGPT beats base. A base model is a "text monster" that just continues text; RLHF turns it into a usable product.
Easy confusions#
Suppresses "meh" answers.
Can't express "A is better than B".
Simple, stable, cheap. Current default.
Complex, expensive, fragile.
Further reading#
- Pre-training → SFT → RLHF — the modern three-stage LLM training
- Hallucination — RLHF reduces hallucination as a side effect
- Paper: "Training language models to follow instructions with human feedback" (InstructGPT, 2022)