Model Evaluation (Benchmarks & LLM-as-Judge)

核心 · Key Idea

In one line: There are three families of LLM evaluation: automatic benchmarks (MMLU / GSM8K / HumanEval), human head-to-head / arenas (Chatbot Arena), and LLM-as-Judge (use a strong model as evaluator). Any one alone is incomplete — read them in combination.

Three families#

Static Benchmarks: MMLU, GSM8K, HumanEval, CMMLU, MATH. Objective but easy to cheat / contaminate.
Arena / Human eval: Users blind-test two models → vote. Chatbot Arena ELO is among the most authoritative today.
LLM-as-Judge: GPT-4 / Claude scores answers. MT-Bench, AlpacaEval use this.
Targeted task tests: Tool use (BFCL), long context (LongBench, RULER), multilingual (CMMLU), safety (Anthropic HH), etc.

Analogy#

打个比方 · Analogy

Benchmark = a college entrance exam: objective scores, easy to compare; easy to game.
Arena = a debate with audience voting: close to real use, but slow and expensive.
LLM-Judge = another A-student grading: fast, biased (prefers long / verbose / familiar style).

Key concepts#

Pass@kPass@k

Probability of at least one pass out of k samples. HumanEval / MBPP standard.

ELOChess-style rating

Arena uses it to rank models, updating scores per match.

Judge BiasJudge bias

GPT-4 prefers longer answers, its own family, and answers placed first. Mitigate with position swap.

ContaminationContamination

Test items leak into training data → inflated scores. HELM, MMLU have all seen this.

Holistic EvalHolistic evaluation

HELM, Open LLM Leaderboard combine many dimensions into one score.

End-to-end Task EvalReal-task eval

SWE-bench (real GitHub issues), AgentBench, etc. — closer to actual utility.

Popular leaderboards#

Reading any single leaderboard is misleading; cross-checking is reliable.

Practical notes#

Model-selection workflow:
1. Use Arena ELO + Open LLM Leaderboard to narrow candidates;
2. Run your own test set as a small-traffic A/B;
3. Have your ops / product folks eyeball — don't trust automatic scores alone.
Custom eval set. Write 50–200 questions matching your domain (including edge cases / adversarial / persona violations). Re-run on every model upgrade.
Anti-contamination. Write your own; add private signatures; rotate periodically.
LLM-Judge counters. Position swap (A first vs B first) + multi-judge voting (GPT-4o + Claude + DeepSeek).
Capability ≠ usability. MMLU 90 doesn't mean it's nice to use. "Useful" is often RLHF + output style + speed, not raw knowledge.

Easy confusions#

Automatic benchmarks

Reproducible, cheap.
Easy to game and contaminate.

Human / arena

Closer to real use.
Slow, expensive, noisy.