核心 · Key Idea
In one line: There are three families of LLM evaluation: automatic benchmarks (MMLU / GSM8K / HumanEval), human head-to-head / arenas (Chatbot Arena), and LLM-as-Judge (use a strong model as evaluator). Any one alone is incomplete — read them in combination.
Three families#
- Static Benchmarks
- MMLU, GSM8K, HumanEval, CMMLU, MATH. Objective but easy to cheat / contaminate.
- Arena / Human eval
- Users blind-test two models → vote. Chatbot Arena ELO is among the most authoritative today.
- LLM-as-Judge
- GPT-4 / Claude scores answers. MT-Bench, AlpacaEval use this.
- Targeted task tests
- Tool use (BFCL), long context (LongBench, RULER), multilingual (CMMLU), safety (Anthropic HH), etc.
Analogy#
打个比方 · Analogy
Benchmark = a college entrance exam: objective scores, easy to compare; easy to game.
Arena = a debate with audience voting: close to real use, but slow and expensive.
LLM-Judge = another A-student grading: fast, biased (prefers long / verbose / familiar style).
Key concepts#
Pass@kPass@k
Probability of at least one pass out of k samples. HumanEval / MBPP standard.
ELOChess-style rating
Arena uses it to rank models, updating scores per match.
Judge BiasJudge bias
GPT-4 prefers longer answers, its own family, and answers placed first. Mitigate with position swap.
ContaminationContamination
Test items leak into training data → inflated scores. HELM, MMLU have all seen this.
Holistic EvalHolistic evaluation
HELM, Open LLM Leaderboard combine many dimensions into one score.
End-to-end Task EvalReal-task eval
SWE-bench (real GitHub issues), AgentBench, etc. — closer to actual utility.
Popular leaderboards#
Reading any single leaderboard is misleading; cross-checking is reliable.
Practical notes#
- Model-selection workflow:
- Use Arena ELO + Open LLM Leaderboard to narrow candidates;
- Run your own test set as a small-traffic A/B;
- Have your ops / product folks eyeball — don't trust automatic scores alone.
- Custom eval set. Write 50–200 questions matching your domain (including edge cases / adversarial / persona violations). Re-run on every model upgrade.
- Anti-contamination. Write your own; add private signatures; rotate periodically.
- LLM-Judge counters. Position swap (A first vs B first) + multi-judge voting (GPT-4o + Claude + DeepSeek).
- Capability ≠ usability. MMLU 90 doesn't mean it's nice to use. "Useful" is often RLHF + output style + speed, not raw knowledge.
Easy confusions#
Automatic benchmarks
Reproducible, cheap.
Easy to game and contaminate.
Easy to game and contaminate.
Human / arena
Closer to real use.
Slow, expensive, noisy.
Slow, expensive, noisy.