RAG (Retrieval-Augmented Generation)

核心 · Key Idea

In one line: RAG = Retrieval-Augmented Generation. First fetch relevant snippets from your own corpus, then splice them into the prompt so the model answers from sources. It is the industry-default way to give an LLM private knowledge, fresh data, or huge documents.

What it is#

An LLM's built-in knowledge has a cutoff date and knows nothing about your private data. RAG does not touch the model weights — it drops source material into the context on the fly:

User: "What's our Q3 refund policy?"

[retrieve]  Pull the 3 most relevant snippets from the company knowledge base
[assemble]  Stuff snippets + question into the prompt
[generate]  Model writes the answer grounded in the snippets, with citations

Every answer is grounded in the real material it was just shown — update the knowledge base and the answer follows immediately.

Analogy#

打个比方 · Analogy

LLM without RAG = closed-book exam — relies on memory, gets things wrong.
RAG = open-book exam — flip to the relevant page first, copy the key bits, then write in your own words. Accuracy jumps by an order of magnitude.

Key concepts#

EmbeddingVectorisation

Turn text into a vector so retrieval can match by semantic similarity, not just keywords.

Vector DBVector database

A DB that stores vectors and does fast nearest-neighbour search (pgvector / Milvus / Qdrant…).

ChunkingSplitting

Break long documents into 200–800-character pieces that are easy to retrieve and feed to the model.

Top-K RetrievalTop-K

Return the K (usually 3–10) most similar snippets.

RerankReranking

Re-score the top-K with a more precise model so the best ones float to the top.

How it works#

Index once (offline), query every time (online).

Practical notes#

Recall matters more than generation. 80% of bad RAG answers come from "we never retrieved the right snippet". Tune retrieval first (chunking + embedding model + rerank), then tune the prompt.
Chunks must carry context. A flat 500-character cut loses titles and section info. Prepending "document name + section heading" to each chunk usually delivers a huge quality jump.
Hybrid retrieval. Run vector search + BM25 keyword search together, then rerank — currently the most reliable recipe.
Force citations. Tell the model "every claim must be marked with [^1] referencing the snippet number" and refuse to answer when no snippet supports the claim — this crushes hallucination rates.
Top-K is not "bigger is better". K = 3–5 is usually optimal. Too high a K stuffs noise into the prompt and dilutes the correct answer.

Easy confusions#

RAG

Injects material **at runtime**.
Content updates instantly; weights stay frozen.

Fine-tuning

Bakes knowledge into weights **at training time**.
Expensive and hard to refresh.

RAG

Retrieves only **the relevant snippets**.
Cheap, fast, and scales out.

Long context (1M tokens)

Stuffs **the whole corpus** in one shot.
Expensive, slow, and effective attention degrades.