核心 · Key Idea
In one line: measure with indicators (SLI), commit to a target (SLO) like 99.9 %, and the rest becomes error budget. When the budget runs out, freeze new feature launches and fix reliability — Google SRE's way of resolving the dev-vs-ops tension.
Key concepts#
SLIService Level Indicator
**The metric itself** — e.g. 'HTTP success rate', 'p99 latency < 300ms'.
SLOService Level Objective
**Internal goal** — e.g. 'over a rolling 30-day window, SLI success ≥ 99.9 %'.
SLAService Level Agreement
**External contract** — missing it costs money / refunds. SLA < SLO leaves a cushion.
Error BudgetError Budget
100 % minus SLO. 99.9 % / 30 days = 43.2 minutes of allowable downtime.
Burn RateBurn Rate
Real-time rate of budget consumption. Burning a week's budget in an hour → 'fast burn' alert.
Customer-perceivedCustomer-perceived
SLI should reflect user experience (e.g. LB-edge success rate), not internal components.
Classic SLI types#
- Availability / success rate
- good_events / total_events — e.g. ratio of non-5xx responses.
- Latency
- Fraction of requests with P50 / P95 / P99 < threshold. Use percentiles, not averages.
- Correctness
- Fraction of correct outputs (e.g. order totals without errors).
- Freshness
- Data updated within ≤ X minutes (search, caching).
- Throughput
- RPS / QPS meeting business needs — usually a capacity metric, not an SLI.
How to use it#
Practical notes#
- Start with one SLO per critical user journey — don't roll out 50 at once.
- Use a 30-day rolling window — shorter is noisy, longer is sluggish.
- Multi-window, multi-burn-rate alerts (Google SRE recommended): four rules —
14.4× burn 5m,6× burn 1h,3× burn 6h,1× burn 3d— few false positives, doesn't miss slow burn. - SLI ≠ monitor everything: "all health checks green" ≠ "users are happy". Always measure from the user's perspective.
- Error-budget policy: write it down: budget < X % → freeze launches. Otherwise nobody enforces it.
- Dependency transparency: your SLO can't exceed your dependency's (99.9 % depending on 99 % is mathematically impossible).
- Composite SLO: weighted or worst-case combination of sub-SLIs — for complex journeys (login → browse → checkout).
Easy confusions#
SLO (internal)
Engineering team's **self-commitment**.
Slightly tighter — leaves a cushion.
Slightly tighter — leaves a cushion.
SLA (external)
Legal / financial liability.
Looser than SLO — **never promise customers your absolute limit**.
Looser than SLO — **never promise customers your absolute limit**.