ArcLibrary

Token

The smallest piece of text an AI handles — and the unit you get billed in.

TokenBasics
核心 · Key Idea

In one line: A Token is the LLM's smallest "building block" of text — not a character, not a word, but a fragment of characters pre-baked into the model's vocabulary. Reading, writing, billing, and length limits all happen in Tokens.

What it is#

An LLM does not consume human text directly. It first splits the text into a sequence of Tokens, then maps each Token to an integer ID. Common splitters (BPE, SentencePiece) keep frequent character strings whole and slice rare ones into pieces, so:

  • English: 1 Token ≈ 0.75 words
  • Chinese: 1 Token ≈ 1–2 characters
  • Code / emoji / rare characters: a single character may become multiple Tokens

Analogy#

打个比方 · Analogy

Your brain reads "today's weather is great" character by character. The model reads it as "today / 's / weather / is / great" — each chunk is a Token.

Key concepts#

Token IDVocabulary index
Every Token has a unique integer in the vocab (GPT's vocab is ~100K tokens).
BPE / WordPieceSplitting algorithm
Keeps frequent strings whole and breaks down rare ones — balancing **compression and full coverage**.
Special TokensSpecial markers
<bos> / <eos> / <pad> / <system>, marking start, end, and chat roles.
Token billingPricing model
APIs charge per input + output Token. Long contexts get expensive fast.

How it works#

The splitter is fixed before training — every user shares the same vocabulary.

Practical notes#

  • Estimating Token count: for English, "words × 1.3"; for Chinese, "characters × 1.5". Good enough to ballpark API cost.
  • Saving Tokens: trim filler in prompts; compress tabular data into JSON/CSV; summarise long documents before feeding them in.
  • Watch out for rare characters: emoji, traditional-Chinese rare glyphs, obscure code symbols often each become several Tokens — lengths blow up easily.
  • Tokenizers are not interchangeable: GPT-4's splitter differs from Claude / Qwen / Llama. The same paragraph can vary by 30% in Token count.

Easy confusions#

Token (LLM view)
A **pre-defined string fragment** in the model's vocabulary.
Could be half a word, a whole word, or several words.
Word / Character (human view)
The smallest semantic unit of natural language.
Has **no one-to-one mapping** to the model's vocabulary.
Count your own Tokens

OpenAI's tokenizer page lets you paste text and see the split. In Python, tiktoken computes it in one line.

Further reading#

  • Context Window — how many Tokens you can fit at once
  • Parameters — model size vs Token throughput
  • Chunking — slicing long documents into Token-friendly pieces for RAG