核心 · Key Idea
In one line: at any meaningful scale, logs must be centralized — collect → parse → index → query → alert. Two mainstream paths: ELK / OpenSearch (full-text index, powerful but expensive) and Loki (label index, cheap and good enough).
What it is#
[App stdout] ──┐
[journal] ──┼──> [collector agent] ──> [central store] ──> [query UI]
[file logs] ──┘ Promtail/Vector/ Loki/ES Grafana/Kibana
Fluent Bit
Apps logging to stdout is the cloud-native default; agents on each node ship logs to the central store.
Analogy#
打个比方 · Analogy
Single-machine logs = each household keeps their own diary — finding something means knocking on every door and reading. Log aggregation = a village library — every household automatically deposits their diary, and you search in one place.
Key concepts#
Structured logsStructured
JSON lines, not plain text: `{level, ts, trace_id, msg, ...}`. Otherwise parsing is regex hell.
Trace IDTrace ID
Unique ID threading one request through the system. Key to log↔trace correlation.
Collector agentCollector
Promtail / Vector / Fluent Bit / OpenTelemetry Collector.
Inverted indexInverted Index
ES indexes every token — **full-text search anywhere**, but costly in disk / RAM.
Label index (Loki)Label Index
Only labels are indexed (service / pod / level); body isn't. Cheap to query; full-text means scanning.
RetentionRetention
Hot data 7–30 days + cold archive (S3 / OSS) for months.
How it works#
The de-facto K8s combo: Promtail/Fluent Bit + Loki + Grafana.
Practical notes#
- Standardize JSON logs:
{"ts":"...","level":"info","trace_id":"abc","msg":"...","fields":{...}}. - Labels for dimensions, body for the message: service / pod / level / env are good labels; user_id / order_id go into the message for search.
- Don't dump all K8s annotations into labels — Loki / ES index will explode.
- Sample high-volume logs — debug shouldn't reach production, or sample at 1 / 100.
- PII masking: replace phone / ID / token with
***at the agent. - Trace ↔ Log jumps: logs carry
trace_id; Grafana lets you click into Tempo to see the trace, and vice versa. - Disk protection: log retention + log rotation, so a node doesn't fill its disk.
Picking a stack#
- ELK / OpenSearch
- Powerful full-text search / complex queries. Heavy, expensive, ops-intensive.
- Loki + Grafana
- Cheap label index, smooth K8s UX. Weaker full-text search.
- ClickHouse / Doris
- Massive structured-log SQL analytics. Needs schemas.
- Datadog / cloud SaaS
- Easy mode, billed per GB — gets pricey at scale.
Easy confusions#
Logs
Events / text — **detailed, large volume**.
Best for "what happened".
Best for "what happened".
Metrics
Numbers / time series — **small after aggregation**.
Best for "what's the current state".
Best for "what's the current state".