Log Aggregation Architecture

核心 · Key Idea

In one line: at any meaningful scale, logs must be centralized — collect → parse → index → query → alert. Two mainstream paths: ELK / OpenSearch (full-text index, powerful but expensive) and Loki (label index, cheap and good enough).

What it is#

[App stdout]   ──┐
[journal]     ──┼──> [collector agent] ──> [central store] ──> [query UI]
[file logs]    ──┘    Promtail/Vector/    Loki/ES            Grafana/Kibana
                     Fluent Bit

Apps logging to stdout is the cloud-native default; agents on each node ship logs to the central store.

Analogy#

打个比方 · Analogy

Single-machine logs = each household keeps their own diary — finding something means knocking on every door and reading. Log aggregation = a village library — every household automatically deposits their diary, and you search in one place.

Key concepts#

Structured logsStructured

JSON lines, not plain text: `{level, ts, trace_id, msg, ...}`. Otherwise parsing is regex hell.

Trace IDTrace ID

Unique ID threading one request through the system. Key to log↔trace correlation.

Collector agentCollector

Promtail / Vector / Fluent Bit / OpenTelemetry Collector.

Inverted indexInverted Index

ES indexes every token — **full-text search anywhere**, but costly in disk / RAM.

Label index (Loki)Label Index

Only labels are indexed (service / pod / level); body isn't. Cheap to query; full-text means scanning.

RetentionRetention

Hot data 7–30 days + cold archive (S3 / OSS) for months.

How it works#

The de-facto K8s combo: Promtail/Fluent Bit + Loki + Grafana.

Practical notes#

Standardize JSON logs: {"ts":"...","level":"info","trace_id":"abc","msg":"...","fields":{...}}.
Labels for dimensions, body for the message: service / pod / level / env are good labels; user_id / order_id go into the message for search.
Don't dump all K8s annotations into labels — Loki / ES index will explode.
Sample high-volume logs — debug shouldn't reach production, or sample at 1 / 100.
PII masking: replace phone / ID / token with *** at the agent.
Trace ↔ Log jumps: logs carry trace_id; Grafana lets you click into Tempo to see the trace, and vice versa.
Disk protection: log retention + log rotation, so a node doesn't fill its disk.

Picking a stack#

ELK / OpenSearch: Powerful full-text search / complex queries. Heavy, expensive, ops-intensive.
Loki + Grafana: Cheap label index, smooth K8s UX. Weaker full-text search.
ClickHouse / Doris: Massive structured-log SQL analytics. Needs schemas.
Datadog / cloud SaaS: Easy mode, billed per GB — gets pricey at scale.

Easy confusions#

Logs

Events / text — **detailed, large volume**.
Best for "what happened".

Metrics

Numbers / time series — **small after aggregation**.
Best for "what's the current state".