Transformer & Attention

核心 · Key Idea

In one line: Transformer is the 2017 neural-network architecture that uses Self-Attention to let every token in a sequence "glance at every other token" and compute their importance to itself. GPT / Claude / Gemini / DeepSeek today are all variants of it.

What it is#

Before Transformer there was RNN/LSTM — sentences had to be fed one token at a time, and long-range dependencies degraded. Transformer feeds the whole sequence at once, and each token uses Attention to figure out "which other tokens matter to me":

"The cat sat on the mat because it was tired"

When computing 'it':
  - the:  0.02
  - cat:  0.83  ← high
  - sat:  0.05
  - mat:  0.04
  - tired: 0.06

The model directly sees that it refers to cat — long-range dependencies no longer decay.

Analogy#

打个比方 · Analogy

RNN = relay race — information passes baton-by-baton; the further it travels the more is lost.
Transformer = round-table meeting — all words sit at the table at once; each word looks around and decides whose voice to listen to.

Key concepts#

Self-AttentionSelf-attention

Each token computes the similarity of its query with every token's key, then weighted-aggregates the values. Core formula: softmax(QK/√d)V.

Multi-HeadMulti-head

N attention heads run in parallel; each head learns a different relation (syntax / semantics / coreference).

Positional EncodingPositional encoding

Attention itself is order-blind; positional info has to be injected (absolute / relative / RoPE).

FFNFeed-forward layer

An MLP after attention — most of the model's capacity lives here; MoE replaces it.

Decoder-OnlyDecoder-only

The simplified GPT variant: autoregressive generation only, no encoder.

How it works#

Each Transformer block = Attention → FFN, stacked N times. GPT-4-class models have 100+ layers.

Practical notes (application-engineer view)#

You don't need to re-implement it, but understand the bottleneck. Doubling context window roughly squares attention's compute and memory — which is why long contexts are expensive.
Most optimisations target Attention. FlashAttention (IO-optimised), KV Cache (reuse during inference), GQA / MQA (multiple queries share KV) — virtually all inference speedups come from here.
Positional encoding controls extrapolation. Relative variants like RoPE / ALiBi let models "train at 4K, run at 128K".
Decoder-only is dominant. BERT-style encoder-only is used for retrieval / classification; 99% of generative applications are decoder-only.
VRAM math: params × bytes_per_param + KV Cache (≈ batch × seq × layers × heads × head_dim × 2). For long contexts, KV cache often eats more VRAM than the params.

Easy confusions#

Encoder-Only

BERT — bidirectional view of context.
Strong on understanding / classification / retrieval.

Decoder-Only

GPT — left-to-right generation only.
Strong on open-ended generation and chat.