ArcLibrary

SFT (Supervised Fine-Tuning)

The most direct way to teach a model a specific task — a small set of high-quality input/expected-output pairs.

TrainingFine-tuningSFT
核心 · Key Idea

In one line: SFT = Supervised Fine-Tuning: prepare a small set of {user input, expected answer} pairs and use the same "next-token prediction" objective to teach the model to answer in your format. It's the most direct way to make a base model "speak the way you want."

What it is#

The data looks like:

{"messages":[{"role":"user","content":"What is an LLM?"},
             {"role":"assistant","content":"An LLM is a large language model..."}]}
{"messages":[{"role":"user","content":"Help me request a refund"},
             {"role":"assistant","content":"Sure — please share your order number..."}]}

The model is trained for a few more epochs on these dialogues; loss is computed only on the assistant's reply. A few thousand high-quality conversations are enough to turn a base model into "customer-support voice", "medical voice", "brand persona", etc.

Analogy#

打个比方 · Analogy

Pre-training = read the whole library; learns language.
SFT = show it a stack of "exemplary essays" to imitate — it learns the pattern "this kind of question gets this kind of answer".
You're not making it know more — you're making it answer in your style.

Key concepts#

Instruction TuningInstruction tuning
SFT's earlier name — instruction/answer pairs that make the model 'listen.'
Quality > QuantityQuality > quantity
1k handpicked ≫ 100k filler. Demonstrated by the LIMA paper.
Loss MaskLoss mask
Compute loss only on assistant tokens; ignore user input.
Catastrophic ForgettingCatastrophic forgetting
Over-fine-tuning loses general ability. Mix data or limit epochs.

How it works#

Technically identical to pre-training (next-token prediction); only the data scale + loss mask differ.

Practical notes#

  • Data quality is 90% of the work. A week polishing 1,000 examples beats a month assembling 100k noisy ones.
  • Default to LoRA. Unless you have a big GPU cluster, full-parameter SFT is too expensive. LoRA fine-tunes in hours on a single GPU.
  • Lower learning rate than pre-training. Start 1e-5 ~ 5e-5. Crank it up and you'll lobotomise the base model.
  • Mix in general data to prevent forgetting. Add 1:1 generic dialogue or the model will only answer your task and break everything else.
  • Try prompt + few-shot first. If a prompt fixes it, don't SFT. SFT toolchains are complex; prompts iterate by editing one line.

Easy confusions#

SFT
**Imitate good answers**: teach the model "answer this way."
RLHF
**Compare good vs. bad**: use preference data to **avoid bad answers**.
SFT
Bake knowledge / style **into the weights**.
Updating means retraining.
RAG
Stuff knowledge **into the prompt at runtime**.
Updating means editing the corpus.

Further reading#

  • Pre-training — the step before SFT
  • LoRA — SFT's low-cost implementation
  • RLHF — SFT's next step: alignment