Short-term Memory

核心 · Key Idea

In one line: Short-term memory = the most recent few turns of the current conversation. The model itself has no concept of "session" — every turn we splice the whole chat history back into the prompt, making the model "think" it remembers. Once total length exceeds the context window, you must decide what to keep and what to drop.

What it is#

Every LLM call really sends:

[
  {"role": "system", "content": "You are an assistant"},
  {"role": "user", "content": "My name is Mike"},
  {"role": "assistant", "content": "Hello Mike"},
  {"role": "user", "content": "What's my name?"}   ← current question
]

The model "remembers the name" purely because earlier messages are still in the prompt. Once the total length exceeds the window, older messages get cut and the model "forgets".

Analogy#

打个比方 · Analogy

The model's "brain" = a finite whiteboard. Every turn we copy old notes back onto it, then add the current question.
When the board fills up, we have to erase old stuff — short-term memory strategy is "which bits to erase."

Key concepts#

Sliding WindowSliding window

Keep only the last N turns; drop older — simplest.

SummarizationSummarisation

Compress old turns into a summary, stuff back into the system prompt — saves tokens.

Token BudgetToken budget

Usable = context window − system prompt − tool defs − output reserve.

Pin MessagesPin messages

Important facts (user name, preferences) always retained, never trimmed.

How it works#

In production almost everyone uses Hybrid: pin key facts + summarise the middle + keep the last 5–10 turns.

Practical notes#

Track total tokens. Context window is not free — system + history + tools + output reserve all count.
Give summaries a format. Have the model "emit 3 lines of markdown bullets capturing key facts" — far more stable than "summarise freely."
Extract important facts immediately. When the user says "I'm allergic to peanuts", extract it on the spot into the system prompt — don't rely on future summarisation.
Don't store raw tool output. Multi-KB JSON instantly devours context. Summarise before feeding back.
For long tasks, use LangGraph or a checkpointed lib. Hand-rolling history splicing is a bug magnet.

Easy confusions#

Short-term

**Within the current session.**
Just messages in the prompt.

Long-term

**Across sessions, persisted.**
Stored in DB / vector store; pulled back next session.

Short-term memory

**Application strategy** — how to fit into the window.

Context Window

**Model attribute** — how big the window itself is.