Prometheus + Grafana stack

核心 · Key Idea

In one line: Prometheus scrapes metrics + TSDB stores + Alertmanager alerts; Grafana visualizes. kube-prometheus-stack Helm chart installs everything — node-exporter / kube-state-metrics / cAdvisor — covering host + K8s + apps.

What it is#

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm install monitoring prometheus-community/kube-prometheus-stack \
  -n monitoring --create-namespace

You get:

Prometheus — metrics TSDB
Alertmanager — alert grouping / silencing / routing
node-exporter — host CPU / memory / disk / network
kube-state-metrics — K8s object state
Grafana — visualization + curated dashboards

Analogy#

打个比方 · Analogy

Prometheus = the accountant collecting and tallying; Grafana = the wall display turning ledgers into charts; Alertmanager = the front-desk secretary — calls / pings / messages you when numbers go off.

Key concepts#

ServiceMonitor / PodMonitorScrape declarations

Operator CRDs — express 'scrape these Pods / Svcs /metrics' as K8s objects.

Recording RuleRecording rule

Periodically pre-compute expensive PromQL into new series — dashboards load fast.

Alert RuleAlert rule

PromQL returning non-empty + `for: 5m` → fires.

DashboardDashboard

Grafana JSON. Import official / community ones from grafana.com.

DatasourceDatasource

Prometheus / Loki / Tempo / MySQL — Grafana is the unified observability pane.

Long-term StorageLong-term Storage

Prometheus stores 15 days locally by default. Mimir / Thanos / VictoriaMetrics provide long-term + multi-cluster federation.

How it works#

Practical notes#

kube-prometheus-stack is the de-facto default — install via Helm and you immediately get dashboards: Kubernetes / Compute Resources / Node, Kubernetes / API server, etc.
Alert severity levels: critical (pages on-call) / warning (visible but no page) / info. Don't make everything critical — alert fatigue.
Inhibition rules: when Prometheus down, suppress downstream targets unreachable so one outage doesn't fire dozens of alerts.
Silence (maintenance windows): create silence in Alertmanager before maintenance.
Persist PVs: Prometheus needs PVC so pod restarts don't lose data; Grafana defaults to sqlite — switch to Postgres for HA.
Grafana SSO: OAuth / OIDC for unified login.
Add Loki + Tempo as datasources: logs + traces in the same Grafana — metric → log → trace jumps in one click.

Easy confusions#

Prometheus native storage

Local disk, **single node**.
Short retention (default 15 days).

Mimir / Thanos / VictoriaMetrics

Object storage, **horizontally scalable + multi-cluster**.
Retention from months to years.