核心 · Key Idea
In one line: a backup's only purpose is restore. 3 copies, 2 media types, 1 off-site + regularly drilled restore, missing any one disqualifies you.
The 3-2-1 rule#
3 copies: original + local copy + offsite copy
2 media types: local disk + object storage / tape
1 offsite: another DC / region / cloud
Add: at least 1 immutable copy (against ransomware; S3 Object Lock / WORM).
Analogy#
打个比方 · Analogy
No backup = only key in your pocket — lose it, can't get home. Backup = multiple keys in a safe — lose one, still have others; spread across locations means one fire doesn't take all.
Key concepts#
RTORecovery Time Objective
How long to be back online after an incident.
RPORecovery Point Objective
How much data loss can be tolerated.
Full / Incremental / DifferentialFull / Incremental / Differential
Full is large; incremental is small but restore stacks all; differential is the middle ground.
PITRPoint-in-Time Recovery
Restore to any point in time (DB WAL replay).
DrillDR Drill
Periodically restore a backup to a test environment and run end-to-end → only then is it an **effective backup**.
WORM / Object LockImmutability
Object storage that can't be deleted / modified for a period — **resists ransomware + accidental delete**.
Typical data types and approaches#
- PostgreSQL / MySQL
- pg_basebackup + WAL streaming / mysqlbackup + binlog → S3. PITR required.
- Redis
- RDB snapshots + AOF. Use both in production.
- Object storage
- Cross-region / cross-cloud replication (CRR). S3 → R2, OSS → COS, etc.
- K8s config + PVC
- Velero backs up etcd objects + volume snapshots.
- Code
- Image registry + git is itself distributed backup; mirror to GitHub → self-hosted Gitea / GitLab.
- Whole VMs
- Scheduled cloud snapshots + offsite copy.
How it works#
Monthly full-restore drill is the most-skipped and most-important step.
Practical notes#
- Write down RTO / RPO targets then derive the plan. "1-hour recovery, 5-minute loss" vs "3-day, 1-day" differs by 10× cost.
- Encrypt backups: S3 SSE + client keys (KMS); offsite copies encrypted too.
- Lifecycle policies: daily for 30 d, monthly for 12 mo, yearly for 5 yr — tiered cost.
- Automate drills: monthly restore to a test env + health check; alert on failure.
- Separate delete authority: backup system credentials are isolated from production credentials so a compromised admin can't wipe backups.
- Monitor the backups themselves: failures, upload errors, cross-region lag — all in Prometheus alerts.
- Don't keep backups only in the same account / cluster — account compromise = data and backups gone.
Easy confusions#
Backup
Protects against **data loss / logical mistakes**.
Historical copies, time travel.
Historical copies, time travel.
High Availability (HA)
Protects against **instance / DC outage**.
Real-time sync — **mistakes propagate instantly**.
Real-time sync — **mistakes propagate instantly**.