ArcLibrary

Pre-training

Train from scratch on massive unlabeled data so the model learns language — the source of everything an LLM can do.

TrainingPreTraining
核心 · Key Idea

In one line: Pre-training = on trillions of tokens of text, train the model on one simple task — predict the next token. This single step yields grammar, world knowledge, reasoning frameworks, and is the foundation everything else (fine-tuning, alignment) sits on.

What it is#

Data: web pages + books + papers + code + conversations + … (post-cleaning, ~10–15 TB of text).

Task: for any text, predict the next word:

Input:  "The cat sat on the"
Model predicts: "mat" (highest probability)
Correct: "mat"  ← loss = 0

After trillions of these predictions, the model has compressed into its weights "how language works + how the world works" — that's the post-pre-training "base model".

Analogy#

打个比方 · Analogy

Pre-training is like letting a child read the entire library — nobody quizzes them; they just keep guessing the next sentence. After enough books, they pick up grammar, common sense, rhetoric.
SFT / RLHF afterwards is "teach them to answer politely".

Key concepts#

Next Token PredictionNext-token prediction
The only training objective. Simple + massive data = emergent capability.
Tokens TrainedTokens trained
Llama 3 ≈ 15T, DeepSeek-V3 ≈ 14.8T — order of magnitude directly determines capability ceiling.
ComputeTraining compute
Measured in FLOPs. Chinchilla rule: ~20 tokens per parameter for compute-optimal training.
Data MixtureData mixture
The ratio of web / code / math / multilingual / long-doc data — every lab's secret sauce.
Base ModelBase model
The 'raw' post-pre-training artifact; it continues text but doesn't 'reply' nicely — needs SFT / RLHF.

How it works#

Pre-training is a one-shot, very expensive offline engineering effort: tens of thousands of GPU-months, millions to hundreds of millions of dollars per run.

Practical notes (application view)#

  • App engineers don't do pre-training. 99.9% of practitioners never need to train from scratch — open-weights base / chat models + fine-tuning is enough.
  • Understand "data sets the ceiling". Whether a model knows Rust or can answer medical questions depends on how much of that data was in pre-training. RAG / SFT can supplement, but cannot create what isn't there.
  • Base vs Chat. Open-weights repos usually publish both. Base is for your own SFT; Chat is ready to use. Mixing them up makes for surprises.
  • Watch the "T tokens" number. "How many trillion tokens" predicts capability better than "how many billion parameters" (Chinchilla / Llama 3 papers).
  • Don't teach grammar in the prompt. Pre-training has language nailed — prompts should teach task format, not language.

Easy confusions#

Pre-training
Massive **unlabeled** data.
Learns language + world knowledge.
SFT / Fine-tuning
Small amount of **labeled** data.
Teaches task format / style.
Continued Pre-training
**Continue pre-training** on top of an existing base.
Add specialised language / domain data.
Pre-training
**Start from random weights.**
Most expensive, rarest.

Further reading#

  • LLM — the deliverable of pre-training
  • SFT — the fine-tuning stage that follows
  • RLHF — turn the model into "something that talks"
  • Emergent Abilities — a side-effect of scale at pre-training