RLHF (Reinforcement Learning from Human Feedback)

核心 · Key Idea

In one line: RLHF = Reinforcement Learning from Human Feedback. Have humans compare two answers, train a "preference scorer" (Reward Model), then use RL to push the LLM towards the human-preferred direction. This was the key step that made ChatGPT "suddenly usable".

What it is#

Three stages:

SFT — Start with human demonstration data, train a base model that "answers properly."
Train a Reward Model (RM) — Sample multiple answers per question, have humans rank them, then train a small model that can score any answer.
PPO / DPO fine-tuning — Use RL to make the model maximise the reward, while not drifting too far from the SFT model (KL constraint).

After all this the model is "polite, on-topic, knows its limits" — the industry approach to "alignment".

Analogy#

打个比方 · Analogy

SFT = teach a child how to answer.
RLHF = the teacher scores two essays: "this one is better than that one" — the child gradually learns the teacher's grading standards, rather than memorising answers.

Key concepts#

Reward Model (RM)Reward model

A classifier / regressor: input (question, answer), output a score.

Preference DataPreference data

Paired samples (chosen, rejected) — tens of thousands, human-labelled or LLM-judged.

PPOProximal Policy Optimization

Classic RLHF algorithm. Complex, expensive, runs four models simultaneously.

DPODirect Preference Optimization

Proposed in 2023 — train the LLM directly on preference data; no separate RM or PPO needed.

RLAIFRL from AI feedback

Use a strong model as judge instead of humans — major cost savings.

How it works#

In production today DPO has become the de-facto standard — engineering complexity is an order of magnitude lower than PPO.

Practical notes#

App engineers rarely do RLHF, unless you're building a base model or working on safety alignment. Most use cases are fine with open-source aligned models (Llama-Instruct / Qwen-Chat).
If you must, pick DPO. 5–20k preference samples, single-GPU hours. Simpler than PPO.
Preference data can be "semi-AI". Have GPT-4 judge 9k samples, then human-review 1k — best quality/cost trade-off.
Beware "over-alignment". The model refuses everything. In your preference data, include "reasonable refusal vs. over-refusal" pairs.
Why ChatGPT beats base. A base model is a "text monster" that just continues text; RLHF turns it into a usable product.

Easy confusions#

RLHF

**Compare good vs. bad** — learn scoring standards.
Suppresses "meh" answers.

SFT

**Imitate demonstrations** — learn specific answers.
Can't express "A is better than B".

DPO

**Train directly on preference data.**
Simple, stable, cheap. Current default.

PPO

**Train RM + online sampling + RL.**
Complex, expensive, fragile.