Pre-training

核心 · Key Idea

In one line: Pre-training = on trillions of tokens of text, train the model on one simple task — predict the next token. This single step yields grammar, world knowledge, reasoning frameworks, and is the foundation everything else (fine-tuning, alignment) sits on.

What it is#

Data: web pages + books + papers + code + conversations + … (post-cleaning, ~10–15 TB of text).

Task: for any text, predict the next word:

Input:  "The cat sat on the"
Model predicts: "mat" (highest probability)
Correct: "mat"  ← loss = 0

After trillions of these predictions, the model has compressed into its weights "how language works + how the world works" — that's the post-pre-training "base model".

Analogy#

打个比方 · Analogy

Pre-training is like letting a child read the entire library — nobody quizzes them; they just keep guessing the next sentence. After enough books, they pick up grammar, common sense, rhetoric.
SFT / RLHF afterwards is "teach them to answer politely".

Key concepts#

Next Token PredictionNext-token prediction

The only training objective. Simple + massive data = emergent capability.

Tokens TrainedTokens trained

Llama 3 ≈ 15T, DeepSeek-V3 ≈ 14.8T — order of magnitude directly determines capability ceiling.

ComputeTraining compute

Measured in FLOPs. Chinchilla rule: ~20 tokens per parameter for compute-optimal training.

Data MixtureData mixture

The ratio of web / code / math / multilingual / long-doc data — every lab's secret sauce.

Base ModelBase model

The 'raw' post-pre-training artifact; it continues text but doesn't 'reply' nicely — needs SFT / RLHF.

How it works#

Pre-training is a one-shot, very expensive offline engineering effort: tens of thousands of GPU-months, millions to hundreds of millions of dollars per run.

Practical notes (application view)#

App engineers don't do pre-training. 99.9% of practitioners never need to train from scratch — open-weights base / chat models + fine-tuning is enough.
Understand "data sets the ceiling". Whether a model knows Rust or can answer medical questions depends on how much of that data was in pre-training. RAG / SFT can supplement, but cannot create what isn't there.
Base vs Chat. Open-weights repos usually publish both. Base is for your own SFT; Chat is ready to use. Mixing them up makes for surprises.
Watch the "T tokens" number. "How many trillion tokens" predicts capability better than "how many billion parameters" (Chinchilla / Llama 3 papers).
Don't teach grammar in the prompt. Pre-training has language nailed — prompts should teach task format, not language.

Easy confusions#

Pre-training

Massive **unlabeled** data.
Learns language + world knowledge.

SFT / Fine-tuning

Small amount of **labeled** data.
Teaches task format / style.

Continued Pre-training

**Continue pre-training** on top of an existing base.
Add specialised language / domain data.

Pre-training

**Start from random weights.**
Most expensive, rarest.

What it is#

Analogy#

Key concepts#

How it works#

Practical notes (application view)#

Easy confusions#

Further reading#