vLLM · ArcLibrary

核心 · Key Idea

In one line: vLLM is a GPU-targeted high-throughput LLM inference engine. PagedAttention makes the KV cache as memory-efficient as paged virtual memory; Continuous Batching keeps throughput maxed. The default for production OpenAI-compatible APIs.

Start serving in one line#

pip install vllm
 
# Single-GPU 7B
python -m vllm.entrypoints.openai.api_server \
  --model Qwen/Qwen2.5-7B-Instruct \
  --port 8000
 
# Multi-GPU tensor parallel
... --tensor-parallel-size 2
 
# Quantised
... --quantization awq

API is fully OpenAI-compatible (/v1/chat/completions / /v1/completions / /v1/models); change base_url in the openai SDK and you're good.

Analogy#

打个比方 · Analogy

HuggingFace transformers' built-in generate() is the home printer: it works, but queues up at scale.
vLLM is the commercial press: imposition, batching, pipelined — ten thousand pages an hour without breaking a sweat.

Key concepts#

PagedAttentionPaged KV

KV cache split into 16-token uniform blocks, allocated on demand → near-zero fragmentation.

Continuous BatchingContinuous batching

New requests can join an in-flight batch — 5–20× throughput vs vanilla.

Tensor ParallelTensor parallel

Slice each layer across GPUs. --tp 2/4/8. Best on same NUMA / NVLink.

Pipeline ParallelPipeline parallel

Different layers on different GPUs; for very large models or multi-node.

AWQ / GPTQ / FP8Quantisation

vLLM loads several quant formats directly.

Prefix CachePrefix cache

Shared prefixes (system prompts) computed once. --enable-prefix-caching.

Speculative DecodingSpeculative decoding

Built-in draft-model / Medusa support.

LoRA AdapterLoRA hot-swap

Load multiple LoRAs at runtime; route per request.

How it works#

Practical notes#

--max-model-len: tune for VRAM; defaults to the model's full context and may OOM.
--gpu-memory-utilization: default 0.9; lower to 0.45 if sharing a GPU between models.
Quant choice: AWQ usually best quality; GPTQ broadest compatibility; FP8 only on H100/H200-class.
Multi-LoRA: --enable-lora --lora-modules name1=path1 name2=path2; per-request model: name1 selects an adapter.
Speculative decoding: pair with --speculative-model; typical speedup 1.5–2×.
K8s deployment: KubeRay or LWS (LeaderWorkerSet) for multi-node tensor parallel.
Alternatives: TensorRT-LLM (NVIDIA), SGLang (richer routing), TGI (HuggingFace), llama.cpp (CPU).

Easy confusions#

vLLM

GPU serving, **high-concurrency production**.
OpenAI-compatible API.

Ollama / llama.cpp

CPU + GPU general purpose, **single-machine convenience**.
Can run 70B but throughput is lower.