Emergent Abilities

核心 · Key Idea

In one line: Emergent Abilities = when parameters / data / compute cross a critical threshold, the model's ability on certain tasks jumps from near-zero to working. Not smooth growth — a discontinuity. The most-discussed phenomenon beyond Scaling Law.

What it is#

Classic examples: "spell a word backwards", "3-digit addition", "Chain-of-Thought reasoning".

Model size    Pass rate
-----------------------
1B            0%
10B           1%
70B           15%   ← critical point
175B+         70%+

Not a smooth climb from 0% to 70% — at a certain scale the model "gets it". The GPT-3 paper first systematically recorded this jump.

Analogy#

打个比方 · Analogy

Like boiling water — at 99 °C it's still liquid, but a tiny bit more energy and it's steam — completely different state. Past a threshold a model "understands" a kind of structure — and smaller models can't be coaxed to.

Key concepts#

Scaling LawScaling law

Loss decays as a power law in parameters / data / compute — but 'capability' is not always smooth.

Capability ThresholdCapability threshold

The 'jump point' for a task — varies per task.

GrokkingGrokking

After enough training steps, accuracy on some tasks suddenly surges. A related phenomenon.

Mirage Critique'Emergence is an illusion' critique

Schaeffer et al.: with continuous metrics the 'jumps' smooth out. Emergence depends partly on binary scoring.

Typical emergent capabilities#

The further right, the more these stabilise only on big models. That's why if a small model can't do something, try a bigger model before tuning the prompt.

Practical notes (application view)#

Try a larger model first. Spending a week tuning prompts on 1B is less effective than 10 minutes with GPT-4 / Claude / DeepSeek-V3.
CoT barely works on small models. "Think step by step" is mostly noise on <7B — either go bigger or distil + SFT.
Don't conclude "this model is no good" too quickly. Often the prompt isn't quite there + scale is right at threshold. Combine: better prompt + bigger model + examples.
The "emergence is illusion" debate. In practice we care whether the task works, not the continuity of the metric — the concept helps you not blindly hope a small model gets there.

Easy confusions#

Emergent Abilities

**Discrete jump**: can / can't.

Scaling Law

**Smooth decline** of loss / perplexity.
They describe different dimensions.