核心 · Key Idea
In one line: Chunking = slicing a long document into hundred-to-thousand-character pieces. RAG retrieves and feeds the model in chunks — so how you split, how big, with what overlap essentially caps RAG quality.
What it is#
A 50-page PDF can't be tossed into a vector store wholesale (the embedding becomes unfocused, and you can't fit it back into the context). Common practice:
Source: 50-page PDF (50,000 chars)
↓ Chunking
chunk_001: piece 1, 500 chars
chunk_002: piece 2, 500 chars (overlaps chunk_001 by 50 chars)
...
chunk_120: piece 120, 500 chars
Each chunk is embedded, retrieved, and fed to the model on its own.
Analogy#
打个比方 · Analogy
Nobody memorises an encyclopedia cover-to-cover to look something up.
You flip to "the most relevant pages" — chunking pre-cuts the book into "pages", each with one semantic fingerprint, and you find pages by fingerprint.
Key concepts#
Chunk SizeChunk size
Typically 200–800 chars. Too small loses context; too large dilutes the topic.
OverlapOverlap
Adjacent chunks share some characters (10–20%) to avoid cutting topics in half.
SplitterSplitting strategy
By chars / tokens / sentences / paragraphs / markdown headings…
MetadataMetadata
Attach doc_id / section title / page number to each chunk so you can trace provenance.
Three mainstream strategies#
- Fixed length: simplest — flashcards, forum Q&A.
- Structural: documents with clear heading hierarchy (books, API docs) — preferred — respects natural semantic boundaries.
- Semantic chunking: cut where adjacent sentence embeddings show a sudden distance jump — best quality, also most expensive.
Practical notes#
- Cut by structure first, then by length. H1/H2/H3, lists, tables are natural boundaries — don't break them.
- Tune chunk size with the embedding model. Most embedding models top out around 512 tokens; longer gets truncated.
- Prefix ancestor headings to each chunk.
# Chapter 3 - ## Refund Policy - chunk text. Helps retrieval and generation alike. - Don't chunk small docs. <2000 chars: keep the whole doc as a single chunk. Less cutting, less noise.
- Two-granularity indexing. Index at sentence-level chunks for recall, expand to surrounding paragraph / section before feeding the model.
Easy confusions#
Chunking
**App-layer**: split long text into "passages" (hundreds of chars).
Tokenizing
**Model-layer**: split characters into "tokens" (sub-words).
Different abstraction layer entirely.
Different abstraction layer entirely.
Further reading#
- RAG — what chunking ultimately serves
- Embeddings — each chunk gets embedded
- Vector Database — where the chunks live