核心 · Key Idea
In one line: The context window is the total number of Tokens the model can "see" in a single inference — system prompt + chat history + current question + model reply, all added together must not exceed this cap. Anything over the cap is truncated.
What it is#
Every model has a fixed context-window size. Common numbers:
| Model | Context window |
|---|---|
| GPT-3.5 | 16K |
| GPT-4o | 128K |
| Claude 3.7 / Sonnet 4.5 | 200K |
| Gemini 2.5 Pro | 1M – 2M |
| Qwen3-Long | 1M |
It is input + output, not input only.
Analogy#
打个比方 · Analogy
The context window is the sticky note in front of the model — it can only read what fits on that sticky note. What fits, the model "remembers"; what doesn't fit is as if it never existed.
Key concepts#
Input TokensPrompt
System + history + user question + tool results.
Output TokensCompletion
What the model generates this turn — answer + tool-call JSON.
Hard LimitHard limit
Input + output ≤ context window. Going over throws an error or truncates.
Effective RecallEffective recall
The cap is not 'all usable' — middle sections of a long context tend to be ignored. Known as 'Lost in the Middle'.
How it works#
If system + history already consume 120K, the model can generate at most 8K Tokens more.
Practical notes#
- Track the full bill. For long conversations, count system + history + current + expected output. Any single overflow fails the call.
- Compress history. Summarise older messages instead of hard-deleting them.
- Put important content at the ends. Attention is strongest in the first and last few hundred Tokens. The middle tends to "go missing".
- Longer is not always better. Bigger contexts mean slower, more expensive, and lower effective recall. If RAG can solve it, don't brute-force the context.
Easy confusions#
Context Window (short term)
The Token cap for this single inference.
Disappears entirely when the session ends.
Disappears entirely when the session ends.
Long-term Memory
Persists across sessions (vector DB / database).
Pulled into context **on demand** via mechanisms like RAG.
Pulled into context **on demand** via mechanisms like RAG.
Context Window
How many Tokens it sees at once — **runtime** capacity.
Parameters
The number of model weights — **structural** capacity.
**Independent** of context size.
**Independent** of context size.
Further reading#
- Token — the unit of measurement
- Short-term Memory — managing context across turns
- RAG — the standard way to break the context limit