Parameters

核心 · Key Idea

In one line: Parameter count is the total number of tunable knobs inside the model. More knobs means finer-grained language patterns it can store — but VRAM, inference cost, and training difficulty rise linearly or super-linearly.

What it is#

Everything a model learned during training is compressed into a set of "weight numbers." Every number is one parameter.

7B = 7 billion parameters
72B = 72 billion
671B (DeepSeek V3) = 671 billion

Heuristic: parameter count = number of brain cells the model has.

Analogy#

打个比方 · Analogy

3B is a pocket dictionary; 70B is a whole bookshelf; hundred-billion-class is a national library. More books mean finer lookups, but the shelves also take more space and cost more.

Key concepts#

Active ParametersActive params

Parameters actually used during inference. MoE models have huge totals but activate only a fraction per call.

PrecisionNumerical precision

FP16 / FP8 / INT4 etc. Whether each param is 2 bytes (FP16) or 0.5 bytes (INT4) decides VRAM usage.

ComputeTraining compute

Training cost ≈ Parameters × Tokens. Double the scale, more than double the cost.

Scaling LawsScaling laws

Chinchilla: roughly 20B training Tokens per 1B parameters to be 'fully trained.'

How to estimate VRAM#

VRAM to load a model (rough):

VRAM ≈ Parameters × bytes per parameter

Precision	Per param	70B model
FP32 (training)	4 B	~280 GB
FP16 / BF16	2 B	~140 GB
INT8	1 B	~70 GB
INT4 (quantised)	0.5 B	~35 GB

Add KV cache and activations and the real number is 20–50% higher.

Practical notes#

Self-hosting at home: a 24 GB consumer GPU comfortably runs 7B–13B quantised; a 4090 / A6000 reaches 30B–70B quantised.
Picking an API model: larger ≠ better for your task. Start small, scale up only if needed. Saves money and latency.
MoE (Mixture of Experts): DeepSeek V3 / Mixtral etc. have a huge total but only activate a slice per call — faster than a dense model of equivalent quality, though VRAM still scales with total params.
Quantisation is the value champion: INT4 typically loses ≤ 5% quality and drops VRAM to 1/4 — the default for local inference.

Easy confusions#

Parameters

Model weight count — **structural capacity**.
Decides what it "learned" and how much VRAM it eats.

Context window

Visible Tokens per inference — **runtime capacity**.
Independent of parameter count.

More params = smarter

**Common myth**: bigger model is automatically smarter.

Reality

**Data quality, training duration, and alignment** matter just as much.
Llama-3 70B often loses to Claude Sonnet (whose size is undisclosed).