Backup & Restore (3-2-1 rule)

核心 · Key Idea

In one line: a backup's only purpose is restore. 3 copies, 2 media types, 1 off-site + regularly drilled restore, missing any one disqualifies you.

The 3-2-1 rule#

3 copies: original + local copy + offsite copy
2 media types: local disk + object storage / tape
1 offsite: another DC / region / cloud

Add: at least 1 immutable copy (against ransomware; S3 Object Lock / WORM).

Analogy#

打个比方 · Analogy

No backup = only key in your pocket — lose it, can't get home. Backup = multiple keys in a safe — lose one, still have others; spread across locations means one fire doesn't take all.

Key concepts#

RTORecovery Time Objective

How long to be back online after an incident.

RPORecovery Point Objective

How much data loss can be tolerated.

Full / Incremental / DifferentialFull / Incremental / Differential

Full is large; incremental is small but restore stacks all; differential is the middle ground.

PITRPoint-in-Time Recovery

Restore to any point in time (DB WAL replay).

DrillDR Drill

Periodically restore a backup to a test environment and run end-to-end → only then is it an **effective backup**.

WORM / Object LockImmutability

Object storage that can't be deleted / modified for a period — **resists ransomware + accidental delete**.

Typical data types and approaches#

PostgreSQL / MySQL: pg_basebackup + WAL streaming / mysqlbackup + binlog → S3. PITR required.
Redis: RDB snapshots + AOF. Use both in production.
Object storage: Cross-region / cross-cloud replication (CRR). S3 → R2, OSS → COS, etc.
K8s config + PVC: Velero backs up etcd objects + volume snapshots.
Code: Image registry + git is itself distributed backup; mirror to GitHub → self-hosted Gitea / GitLab.
Whole VMs: Scheduled cloud snapshots + offsite copy.

How it works#

Monthly full-restore drill is the most-skipped and most-important step.

Practical notes#

Write down RTO / RPO targets then derive the plan. "1-hour recovery, 5-minute loss" vs "3-day, 1-day" differs by 10× cost.
Encrypt backups: S3 SSE + client keys (KMS); offsite copies encrypted too.
Lifecycle policies: daily for 30 d, monthly for 12 mo, yearly for 5 yr — tiered cost.
Automate drills: monthly restore to a test env + health check; alert on failure.
Separate delete authority: backup system credentials are isolated from production credentials so a compromised admin can't wipe backups.
Monitor the backups themselves: failures, upload errors, cross-region lag — all in Prometheus alerts.
Don't keep backups only in the same account / cluster — account compromise = data and backups gone.

Easy confusions#

Backup

Protects against **data loss / logical mistakes**.
Historical copies, time travel.

High Availability (HA)

Protects against **instance / DC outage**.
Real-time sync — **mistakes propagate instantly**.