Linux Performance Tuning Basics

核心 · Key Idea

In one line: always measure before you change. Brendan Gregg's USE method (Utilization / Saturation / Errors) plus the 60-second checklist finds 80 % of bottlenecks in 5 minutes.

60-second checklist#

uptime              # load 1/5/15 min; > #cores = backlog
dmesg | tail        # OOM? disk errors?
vmstat 1 5          # r col = CPU runq, si/so = swap, wa = IO wait
mpstat -P ALL 1     # per-CPU; single-core saturated = lock contention?
pidstat 1           # which process eats CPU
iostat -xz 1        # %util, await
free -m             # mem / cache / swap
sar -n DEV 1 5      # NIC throughput
sar -n TCP,ETCP 1 5 # TCP retrans / segs
top / htop          # overview

Analogy#

打个比方 · Analogy

Performance tuning is like a doctor's exam: take temperature, blood pressure, blood test (USE metrics) before prescribing pills (tweaking knobs).

USE method#

UtilizationUtilization

Percent of time the resource is busy. CPU 80 %, disk 70 %.

SaturationSaturation

Queue depth waiting for the resource. runq, IO wait, TCP listen overflow.

ErrorsErrors

Drops, IO errors, OOMs.

Apply each dimension to: CPU, memory, disk, network.

Quick reference#

CPU utilization: top, mpstat
CPU saturation: uptime (loadavg), vmstat r
CPU errors: perf stat (cache miss / branch miss)
Mem utilization: free, /proc/meminfo
Mem saturation: vmstat si/so (swap), dmesg OOM
Disk utilization: iostat -x %util
Disk saturation: iostat await, vmstat wa
Disk errors: dmesg, smartctl
Net utilization: sar -n DEV, nload
Net saturation: ss -ti (cwnd, rwnd), netstat -s retrans
Net errors: ip -s link, ethtool -S

Flame graphs + perf#

# Sample for 30s
sudo perf record -F 99 -ag -- sleep 30
sudo perf script | inferno-flamegraph > flame.svg

Flame graph = horizontal: sample share; vertical: call stack — see at a glance where CPU time goes.

How it works#

Always re-measure after changes — otherwise you may have fixed something unrelated.

Practical notes#

Two flavors of "slow": throughput slow / per-request latency. Identify which before picking tools.
Don't tune knobs first — check app / DB indexes / cache hit rate; most bottlenecks are in app code, not the kernel.
High CPU isn't always bad — batch jobs maxing CPU is good; long wait times are bad.
Network: watch retrans / TCP state distribution — ss -tan + netstat -s. Retrans > 1 % → check the link.
Swap cautiously — using swap on a server usually means a slide is starting; OOM-killing is more controllable.
eBPF tools: bcc / bpftrace are modern weapons (execsnoop / opensnoop / biosnoop / tcptop).
Baselines matter — collect normal metrics so when something goes wrong you can compare.

Easy confusions#

High load average

Includes processes **waiting for I/O**.
CPU might be idle but disk stuck.

High CPU utilization

CPU actually executing code.
Use perf / flame graphs to find hot spots.