In one line: Pre-training = on trillions of tokens of text, train the model on one simple task — predict the next token. This single step yields grammar, world knowledge, reasoning frameworks, and is the foundation everything else (fine-tuning, alignment) sits on.
What it is#
Data: web pages + books + papers + code + conversations + … (post-cleaning, ~10–15 TB of text).
Task: for any text, predict the next word:
Input: "The cat sat on the"
Model predicts: "mat" (highest probability)
Correct: "mat" ← loss = 0
After trillions of these predictions, the model has compressed into its weights "how language works + how the world works" — that's the post-pre-training "base model".
Analogy#
Pre-training is like letting a child read the entire library — nobody quizzes them; they just keep guessing the next sentence. After enough books, they pick up grammar, common sense, rhetoric.
SFT / RLHF afterwards is "teach them to answer politely".
Key concepts#
How it works#
Pre-training is a one-shot, very expensive offline engineering effort: tens of thousands of GPU-months, millions to hundreds of millions of dollars per run.
Practical notes (application view)#
- App engineers don't do pre-training. 99.9% of practitioners never need to train from scratch — open-weights base / chat models + fine-tuning is enough.
- Understand "data sets the ceiling". Whether a model knows Rust or can answer medical questions depends on how much of that data was in pre-training. RAG / SFT can supplement, but cannot create what isn't there.
- Base vs Chat. Open-weights repos usually publish both. Base is for your own SFT; Chat is ready to use. Mixing them up makes for surprises.
- Watch the "T tokens" number. "How many trillion tokens" predicts capability better than "how many billion parameters" (Chinchilla / Llama 3 papers).
- Don't teach grammar in the prompt. Pre-training has language nailed — prompts should teach task format, not language.
Easy confusions#
Learns language + world knowledge.
Teaches task format / style.
Add specialised language / domain data.
Most expensive, rarest.
Further reading#
- LLM — the deliverable of pre-training
- SFT — the fine-tuning stage that follows
- RLHF — turn the model into "something that talks"
- Emergent Abilities — a side-effect of scale at pre-training