ArcLibrary

Backup & Restore (3-2-1 rule)

An untested backup isn't a backup — the bottom line for data safety.

BackupRestoreDisaster
核心 · Key Idea

In one line: a backup's only purpose is restore. 3 copies, 2 media types, 1 off-site + regularly drilled restore, missing any one disqualifies you.

The 3-2-1 rule#

3 copies: original + local copy + offsite copy
2 media types: local disk + object storage / tape
1 offsite: another DC / region / cloud

Add: at least 1 immutable copy (against ransomware; S3 Object Lock / WORM).

Analogy#

打个比方 · Analogy

No backup = only key in your pocket — lose it, can't get home. Backup = multiple keys in a safe — lose one, still have others; spread across locations means one fire doesn't take all.

Key concepts#

RTORecovery Time Objective
How long to be back online after an incident.
RPORecovery Point Objective
How much data loss can be tolerated.
Full / Incremental / DifferentialFull / Incremental / Differential
Full is large; incremental is small but restore stacks all; differential is the middle ground.
PITRPoint-in-Time Recovery
Restore to any point in time (DB WAL replay).
DrillDR Drill
Periodically restore a backup to a test environment and run end-to-end → only then is it an **effective backup**.
WORM / Object LockImmutability
Object storage that can't be deleted / modified for a period — **resists ransomware + accidental delete**.

Typical data types and approaches#

PostgreSQL / MySQL
pg_basebackup + WAL streaming / mysqlbackup + binlog → S3. PITR required.
Redis
RDB snapshots + AOF. Use both in production.
Object storage
Cross-region / cross-cloud replication (CRR). S3 → R2, OSS → COS, etc.
K8s config + PVC
Velero backs up etcd objects + volume snapshots.
Code
Image registry + git is itself distributed backup; mirror to GitHub → self-hosted Gitea / GitLab.
Whole VMs
Scheduled cloud snapshots + offsite copy.

How it works#

Monthly full-restore drill is the most-skipped and most-important step.

Practical notes#

  • Write down RTO / RPO targets then derive the plan. "1-hour recovery, 5-minute loss" vs "3-day, 1-day" differs by 10× cost.
  • Encrypt backups: S3 SSE + client keys (KMS); offsite copies encrypted too.
  • Lifecycle policies: daily for 30 d, monthly for 12 mo, yearly for 5 yr — tiered cost.
  • Automate drills: monthly restore to a test env + health check; alert on failure.
  • Separate delete authority: backup system credentials are isolated from production credentials so a compromised admin can't wipe backups.
  • Monitor the backups themselves: failures, upload errors, cross-region lag — all in Prometheus alerts.
  • Don't keep backups only in the same account / cluster — account compromise = data and backups gone.

Easy confusions#

Backup
Protects against **data loss / logical mistakes**.
Historical copies, time travel.
High Availability (HA)
Protects against **instance / DC outage**.
Real-time sync — **mistakes propagate instantly**.

Further reading#