ArcLibrary

Model Evaluation (Benchmarks & LLM-as-Judge)

How to measure 'good' — exam-style tests, head-to-head arenas, and judges.

EvalBenchmark
核心 · Key Idea

In one line: There are three families of LLM evaluation: automatic benchmarks (MMLU / GSM8K / HumanEval), human head-to-head / arenas (Chatbot Arena), and LLM-as-Judge (use a strong model as evaluator). Any one alone is incomplete — read them in combination.

Three families#

Static Benchmarks
MMLU, GSM8K, HumanEval, CMMLU, MATH. Objective but easy to cheat / contaminate.
Arena / Human eval
Users blind-test two models → vote. Chatbot Arena ELO is among the most authoritative today.
LLM-as-Judge
GPT-4 / Claude scores answers. MT-Bench, AlpacaEval use this.
Targeted task tests
Tool use (BFCL), long context (LongBench, RULER), multilingual (CMMLU), safety (Anthropic HH), etc.

Analogy#

打个比方 · Analogy

Benchmark = a college entrance exam: objective scores, easy to compare; easy to game.
Arena = a debate with audience voting: close to real use, but slow and expensive.
LLM-Judge = another A-student grading: fast, biased (prefers long / verbose / familiar style).

Key concepts#

Pass@kPass@k
Probability of at least one pass out of k samples. HumanEval / MBPP standard.
ELOChess-style rating
Arena uses it to rank models, updating scores per match.
Judge BiasJudge bias
GPT-4 prefers longer answers, its own family, and answers placed first. Mitigate with position swap.
ContaminationContamination
Test items leak into training data → inflated scores. HELM, MMLU have all seen this.
Holistic EvalHolistic evaluation
HELM, Open LLM Leaderboard combine many dimensions into one score.
End-to-end Task EvalReal-task eval
SWE-bench (real GitHub issues), AgentBench, etc. — closer to actual utility.

Reading any single leaderboard is misleading; cross-checking is reliable.

Practical notes#

  • Model-selection workflow:
    1. Use Arena ELO + Open LLM Leaderboard to narrow candidates;
    2. Run your own test set as a small-traffic A/B;
    3. Have your ops / product folks eyeball — don't trust automatic scores alone.
  • Custom eval set. Write 50–200 questions matching your domain (including edge cases / adversarial / persona violations). Re-run on every model upgrade.
  • Anti-contamination. Write your own; add private signatures; rotate periodically.
  • LLM-Judge counters. Position swap (A first vs B first) + multi-judge voting (GPT-4o + Claude + DeepSeek).
  • Capability ≠ usability. MMLU 90 doesn't mean it's nice to use. "Useful" is often RLHF + output style + speed, not raw knowledge.

Easy confusions#

Automatic benchmarks
Reproducible, cheap.
Easy to game and contaminate.
Human / arena
Closer to real use.
Slow, expensive, noisy.

Further reading#