核心 · Key Idea
In one line: vLLM is a GPU-targeted high-throughput LLM inference engine. PagedAttention makes the KV cache as memory-efficient as paged virtual memory; Continuous Batching keeps throughput maxed. The default for production OpenAI-compatible APIs.
Start serving in one line#
pip install vllm
# Single-GPU 7B
python -m vllm.entrypoints.openai.api_server \
--model Qwen/Qwen2.5-7B-Instruct \
--port 8000
# Multi-GPU tensor parallel
... --tensor-parallel-size 2
# Quantised
... --quantization awqAPI is fully OpenAI-compatible (/v1/chat/completions / /v1/completions / /v1/models); change base_url in the openai SDK and you're good.
Analogy#
打个比方 · Analogy
HuggingFace transformers' built-in generate() is the home printer: it works, but queues up at scale.
vLLM is the commercial press: imposition, batching, pipelined — ten thousand pages an hour without breaking a sweat.
Key concepts#
PagedAttentionPaged KV
KV cache split into 16-token uniform blocks, allocated on demand → near-zero fragmentation.
Continuous BatchingContinuous batching
New requests can join an in-flight batch — 5–20× throughput vs vanilla.
Tensor ParallelTensor parallel
Slice each layer across GPUs. --tp 2/4/8. Best on same NUMA / NVLink.
Pipeline ParallelPipeline parallel
Different layers on different GPUs; for very large models or multi-node.
AWQ / GPTQ / FP8Quantisation
vLLM loads several quant formats directly.
Prefix CachePrefix cache
Shared prefixes (system prompts) computed once. --enable-prefix-caching.
Speculative DecodingSpeculative decoding
Built-in draft-model / Medusa support.
LoRA AdapterLoRA hot-swap
Load multiple LoRAs at runtime; route per request.
How it works#
Practical notes#
--max-model-len: tune for VRAM; defaults to the model's full context and may OOM.--gpu-memory-utilization: default 0.9; lower to 0.45 if sharing a GPU between models.- Quant choice: AWQ usually best quality; GPTQ broadest compatibility; FP8 only on H100/H200-class.
- Multi-LoRA:
--enable-lora --lora-modules name1=path1 name2=path2; per-requestmodel: name1selects an adapter. - Speculative decoding: pair with
--speculative-model; typical speedup 1.5–2×. - K8s deployment: KubeRay or LWS (LeaderWorkerSet) for multi-node tensor parallel.
- Alternatives: TensorRT-LLM (NVIDIA), SGLang (richer routing), TGI (HuggingFace), llama.cpp (CPU).
Easy confusions#
vLLM
GPU serving, **high-concurrency production**.
OpenAI-compatible API.
OpenAI-compatible API.
Ollama / llama.cpp
CPU + GPU general purpose, **single-machine convenience**.
Can run 70B but throughput is lower.
Can run 70B but throughput is lower.