ArcLibrary

SLO / SLI / Error budget

Core SRE idea — reliability isn't 'as high as possible', it's 'just enough'.

SLOSLISREReliability
核心 · Key Idea

In one line: measure with indicators (SLI), commit to a target (SLO) like 99.9 %, and the rest becomes error budget. When the budget runs out, freeze new feature launches and fix reliability — Google SRE's way of resolving the dev-vs-ops tension.

Key concepts#

SLIService Level Indicator
**The metric itself** — e.g. 'HTTP success rate', 'p99 latency < 300ms'.
SLOService Level Objective
**Internal goal** — e.g. 'over a rolling 30-day window, SLI success ≥ 99.9 %'.
SLAService Level Agreement
**External contract** — missing it costs money / refunds. SLA < SLO leaves a cushion.
Error BudgetError Budget
100 % minus SLO. 99.9 % / 30 days = 43.2 minutes of allowable downtime.
Burn RateBurn Rate
Real-time rate of budget consumption. Burning a week's budget in an hour → 'fast burn' alert.
Customer-perceivedCustomer-perceived
SLI should reflect user experience (e.g. LB-edge success rate), not internal components.

Classic SLI types#

Availability / success rate
good_events / total_events — e.g. ratio of non-5xx responses.
Latency
Fraction of requests with P50 / P95 / P99 < threshold. Use percentiles, not averages.
Correctness
Fraction of correct outputs (e.g. order totals without errors).
Freshness
Data updated within ≤ X minutes (search, caching).
Throughput
RPS / QPS meeting business needs — usually a capacity metric, not an SLI.

How to use it#

Practical notes#

  • Start with one SLO per critical user journey — don't roll out 50 at once.
  • Use a 30-day rolling window — shorter is noisy, longer is sluggish.
  • Multi-window, multi-burn-rate alerts (Google SRE recommended): four rules — 14.4× burn 5m, 6× burn 1h, 3× burn 6h, 1× burn 3dfew false positives, doesn't miss slow burn.
  • SLI ≠ monitor everything: "all health checks green" ≠ "users are happy". Always measure from the user's perspective.
  • Error-budget policy: write it down: budget < X % → freeze launches. Otherwise nobody enforces it.
  • Dependency transparency: your SLO can't exceed your dependency's (99.9 % depending on 99 % is mathematically impossible).
  • Composite SLO: weighted or worst-case combination of sub-SLIs — for complex journeys (login → browse → checkout).

Easy confusions#

SLO (internal)
Engineering team's **self-commitment**.
Slightly tighter — leaves a cushion.
SLA (external)
Legal / financial liability.
Looser than SLO — **never promise customers your absolute limit**.

Further reading#