ArcLibrary

MoE (Mixture of Experts)

Scale up parameters with 'sparse activation' so cost stays sane — the secret behind DeepSeek / Mixtral / GPT-4.

MoEScalingSparse
核心 · Key Idea

In one line: MoE replaces a single FFN with N parallel expert FFNs; each token is routed to only a few of them (e.g. top-2). Total params are large → more knowledge; active params are small → cheap inference.

What it is#

A regular Transformer has one big FFN per layer. MoE swaps that for "router + multiple experts":

Token → Router → pick top-K experts → those experts compute → weighted sum → output

DeepSeek-V3 has 671B total params but activates only ~37B per token — runs like a 37B model.

Analogy#

打个比方 · Analogy

A dense model = every question asked of every expert in the room — expensive, and most experts contribute nothing.
MoE = the registrar routes: math questions to math experts, writing questions to writing experts — more people total (broader knowledge), fewer asked per question (lower cost).

Key concepts#

ExpertExpert
An FFN sub-network. A layer typically has 8 / 64 / 256 experts.
Router / GateRouter
Small learned network that outputs a score per expert.
Top-KTop-K
Usually top-2 — only the two highest-scored experts participate.
Load Balance LossLoad-balance loss
Auxiliary loss preventing all tokens from collapsing to one expert.
Shared ExpertShared expert
Designs like DeepSeek's: a small fraction of experts is always active to carry general knowledge.
EP / TPParallelism
Expert Parallelism puts different experts on different GPUs. Communication cost is MoE training's hardest problem.

How it works#

Weights are the router's softmax scores.

Practical notes#

  • Total params ≠ inference cost. Read model cards for total vs active params. Mixtral 8x7B → 47B total / ~13B active.
  • VRAM still scales with total. All experts must be in VRAM ("active" doesn't mean "saves RAM"). MoE inference needs lots of memory, not lots of compute.
  • Router jitter. Similar contexts may route differently → outputs slightly unstable. Common tricks: higher top-k, temperature annealing.
  • Fine-tuning gotchas. Naive SFT can break the router. Freeze the router and only train experts, or apply LoRA to experts.
  • Distributed training. Experts on different GPUs → all-to-all communication dominates; Megatron-LM and DeepSeek's framework include heavy optimisations.

Easy confusions#

Dense Model
Every token uses all parameters.
Simple, compute-heavy.
MoE Model
Every token uses a subset.
Memory-heavy, compute-light.

Further reading#