ArcLibrary

vLLM

The de-facto standard for high-performance LLM serving — PagedAttention + Continuous Batching.

vLLMInferenceServing
核心 · Key Idea

In one line: vLLM is a GPU-targeted high-throughput LLM inference engine. PagedAttention makes the KV cache as memory-efficient as paged virtual memory; Continuous Batching keeps throughput maxed. The default for production OpenAI-compatible APIs.

Start serving in one line#

pip install vllm
 
# Single-GPU 7B
python -m vllm.entrypoints.openai.api_server \
  --model Qwen/Qwen2.5-7B-Instruct \
  --port 8000
 
# Multi-GPU tensor parallel
... --tensor-parallel-size 2
 
# Quantised
... --quantization awq

API is fully OpenAI-compatible (/v1/chat/completions / /v1/completions / /v1/models); change base_url in the openai SDK and you're good.

Analogy#

打个比方 · Analogy

HuggingFace transformers' built-in generate() is the home printer: it works, but queues up at scale.
vLLM is the commercial press: imposition, batching, pipelined — ten thousand pages an hour without breaking a sweat.

Key concepts#

PagedAttentionPaged KV
KV cache split into 16-token uniform blocks, allocated on demand → near-zero fragmentation.
Continuous BatchingContinuous batching
New requests can join an in-flight batch — 5–20× throughput vs vanilla.
Tensor ParallelTensor parallel
Slice each layer across GPUs. --tp 2/4/8. Best on same NUMA / NVLink.
Pipeline ParallelPipeline parallel
Different layers on different GPUs; for very large models or multi-node.
AWQ / GPTQ / FP8Quantisation
vLLM loads several quant formats directly.
Prefix CachePrefix cache
Shared prefixes (system prompts) computed once. --enable-prefix-caching.
Speculative DecodingSpeculative decoding
Built-in draft-model / Medusa support.
LoRA AdapterLoRA hot-swap
Load multiple LoRAs at runtime; route per request.

How it works#

Practical notes#

  • --max-model-len: tune for VRAM; defaults to the model's full context and may OOM.
  • --gpu-memory-utilization: default 0.9; lower to 0.45 if sharing a GPU between models.
  • Quant choice: AWQ usually best quality; GPTQ broadest compatibility; FP8 only on H100/H200-class.
  • Multi-LoRA: --enable-lora --lora-modules name1=path1 name2=path2; per-request model: name1 selects an adapter.
  • Speculative decoding: pair with --speculative-model; typical speedup 1.5–2×.
  • K8s deployment: KubeRay or LWS (LeaderWorkerSet) for multi-node tensor parallel.
  • Alternatives: TensorRT-LLM (NVIDIA), SGLang (richer routing), TGI (HuggingFace), llama.cpp (CPU).

Easy confusions#

vLLM
GPU serving, **high-concurrency production**.
OpenAI-compatible API.
Ollama / llama.cpp
CPU + GPU general purpose, **single-machine convenience**.
Can run 70B but throughput is lower.

Further reading#