核心 · Key Idea
In one line: Prometheus scrapes metrics + TSDB stores + Alertmanager alerts; Grafana visualizes. kube-prometheus-stack Helm chart installs everything — node-exporter / kube-state-metrics / cAdvisor — covering host + K8s + apps.
What it is#
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm install monitoring prometheus-community/kube-prometheus-stack \
-n monitoring --create-namespaceYou get:
- Prometheus — metrics TSDB
- Alertmanager — alert grouping / silencing / routing
- node-exporter — host CPU / memory / disk / network
- kube-state-metrics — K8s object state
- Grafana — visualization + curated dashboards
Analogy#
打个比方 · Analogy
Prometheus = the accountant collecting and tallying; Grafana = the wall display turning ledgers into charts; Alertmanager = the front-desk secretary — calls / pings / messages you when numbers go off.
Key concepts#
ServiceMonitor / PodMonitorScrape declarations
Operator CRDs — express 'scrape these Pods / Svcs /metrics' as K8s objects.
Recording RuleRecording rule
Periodically pre-compute expensive PromQL into new series — dashboards load fast.
Alert RuleAlert rule
PromQL returning non-empty + `for: 5m` → fires.
DashboardDashboard
Grafana JSON. Import official / community ones from grafana.com.
DatasourceDatasource
Prometheus / Loki / Tempo / MySQL — Grafana is the unified observability pane.
Long-term StorageLong-term Storage
Prometheus stores 15 days locally by default. Mimir / Thanos / VictoriaMetrics provide long-term + multi-cluster federation.
How it works#
Practical notes#
- kube-prometheus-stack is the de-facto default — install via Helm and you immediately get dashboards:
Kubernetes / Compute Resources / Node,Kubernetes / API server, etc. - Alert severity levels: critical (pages on-call) / warning (visible but no page) / info. Don't make everything critical — alert fatigue.
- Inhibition rules: when
Prometheus down, suppress downstreamtargets unreachableso one outage doesn't fire dozens of alerts. - Silence (maintenance windows): create silence in Alertmanager before maintenance.
- Persist PVs: Prometheus needs PVC so pod restarts don't lose data; Grafana defaults to sqlite — switch to Postgres for HA.
- Grafana SSO: OAuth / OIDC for unified login.
- Add Loki + Tempo as datasources: logs + traces in the same Grafana — metric → log → trace jumps in one click.
Easy confusions#
Prometheus native storage
Local disk, **single node**.
Short retention (default 15 days).
Short retention (default 15 days).
Mimir / Thanos / VictoriaMetrics
Object storage, **horizontally scalable + multi-cluster**.
Retention from months to years.
Retention from months to years.