ArcLibrary

Log Aggregation Architecture

From 'ssh + grep' to 'cluster-wide search + correlated traces' — log architecture evolution.

LogsELKLoki
核心 · Key Idea

In one line: at any meaningful scale, logs must be centralized — collect → parse → index → query → alert. Two mainstream paths: ELK / OpenSearch (full-text index, powerful but expensive) and Loki (label index, cheap and good enough).

What it is#

[App stdout]   ──┐
[journal]     ──┼──> [collector agent] ──> [central store] ──> [query UI]
[file logs]    ──┘    Promtail/Vector/    Loki/ES            Grafana/Kibana
                     Fluent Bit

Apps logging to stdout is the cloud-native default; agents on each node ship logs to the central store.

Analogy#

打个比方 · Analogy

Single-machine logs = each household keeps their own diary — finding something means knocking on every door and reading. Log aggregation = a village library — every household automatically deposits their diary, and you search in one place.

Key concepts#

Structured logsStructured
JSON lines, not plain text: `{level, ts, trace_id, msg, ...}`. Otherwise parsing is regex hell.
Trace IDTrace ID
Unique ID threading one request through the system. Key to log↔trace correlation.
Collector agentCollector
Promtail / Vector / Fluent Bit / OpenTelemetry Collector.
Inverted indexInverted Index
ES indexes every token — **full-text search anywhere**, but costly in disk / RAM.
Label index (Loki)Label Index
Only labels are indexed (service / pod / level); body isn't. Cheap to query; full-text means scanning.
RetentionRetention
Hot data 7–30 days + cold archive (S3 / OSS) for months.

How it works#

The de-facto K8s combo: Promtail/Fluent Bit + Loki + Grafana.

Practical notes#

  • Standardize JSON logs: {"ts":"...","level":"info","trace_id":"abc","msg":"...","fields":{...}}.
  • Labels for dimensions, body for the message: service / pod / level / env are good labels; user_id / order_id go into the message for search.
  • Don't dump all K8s annotations into labels — Loki / ES index will explode.
  • Sample high-volume logs — debug shouldn't reach production, or sample at 1 / 100.
  • PII masking: replace phone / ID / token with *** at the agent.
  • Trace ↔ Log jumps: logs carry trace_id; Grafana lets you click into Tempo to see the trace, and vice versa.
  • Disk protection: log retention + log rotation, so a node doesn't fill its disk.

Picking a stack#

ELK / OpenSearch
Powerful full-text search / complex queries. Heavy, expensive, ops-intensive.
Loki + Grafana
Cheap label index, smooth K8s UX. Weaker full-text search.
ClickHouse / Doris
Massive structured-log SQL analytics. Needs schemas.
Datadog / cloud SaaS
Easy mode, billed per GB — gets pricey at scale.

Easy confusions#

Logs
Events / text — **detailed, large volume**.
Best for "what happened".
Metrics
Numbers / time series — **small after aggregation**.
Best for "what's the current state".

Further reading#