核心 · Key Idea
In one line: always measure before you change. Brendan Gregg's USE method (Utilization / Saturation / Errors) plus the 60-second checklist finds 80 % of bottlenecks in 5 minutes.
60-second checklist#
uptime # load 1/5/15 min; > #cores = backlog
dmesg | tail # OOM? disk errors?
vmstat 1 5 # r col = CPU runq, si/so = swap, wa = IO wait
mpstat -P ALL 1 # per-CPU; single-core saturated = lock contention?
pidstat 1 # which process eats CPU
iostat -xz 1 # %util, await
free -m # mem / cache / swap
sar -n DEV 1 5 # NIC throughput
sar -n TCP,ETCP 1 5 # TCP retrans / segs
top / htop # overviewAnalogy#
打个比方 · Analogy
Performance tuning is like a doctor's exam: take temperature, blood pressure, blood test (USE metrics) before prescribing pills (tweaking knobs).
USE method#
UtilizationUtilization
Percent of time the resource is busy. CPU 80 %, disk 70 %.
SaturationSaturation
Queue depth waiting for the resource. runq, IO wait, TCP listen overflow.
ErrorsErrors
Drops, IO errors, OOMs.
Apply each dimension to: CPU, memory, disk, network.
Quick reference#
- CPU utilization
- top, mpstat
- CPU saturation
- uptime (loadavg), vmstat r
- CPU errors
- perf stat (cache miss / branch miss)
- Mem utilization
- free, /proc/meminfo
- Mem saturation
- vmstat si/so (swap), dmesg OOM
- Disk utilization
- iostat -x %util
- Disk saturation
- iostat await, vmstat wa
- Disk errors
- dmesg, smartctl
- Net utilization
- sar -n DEV, nload
- Net saturation
- ss -ti (cwnd, rwnd), netstat -s retrans
- Net errors
- ip -s link, ethtool -S
Flame graphs + perf#
# Sample for 30s
sudo perf record -F 99 -ag -- sleep 30
sudo perf script | inferno-flamegraph > flame.svgFlame graph = horizontal: sample share; vertical: call stack — see at a glance where CPU time goes.
How it works#
Always re-measure after changes — otherwise you may have fixed something unrelated.
Practical notes#
- Two flavors of "slow": throughput slow / per-request latency. Identify which before picking tools.
- Don't tune knobs first — check app / DB indexes / cache hit rate; most bottlenecks are in app code, not the kernel.
- High CPU isn't always bad — batch jobs maxing CPU is good; long wait times are bad.
- Network: watch retrans / TCP state distribution —
ss -tan+netstat -s. Retrans > 1 % → check the link. - Swap cautiously — using swap on a server usually means a slide is starting; OOM-killing is more controllable.
- eBPF tools: bcc / bpftrace are modern weapons (execsnoop / opensnoop / biosnoop / tcptop).
- Baselines matter — collect normal metrics so when something goes wrong you can compare.
Easy confusions#
High load average
Includes processes **waiting for I/O**.
CPU might be idle but disk stuck.
CPU might be idle but disk stuck.
High CPU utilization
CPU actually executing code.
Use perf / flame graphs to find hot spots.
Use perf / flame graphs to find hot spots.