核心 · Key Idea
In one line: Local inference = download model weights and run them on your own machine, no cloud API. With quantisation + efficient inference engines (llama.cpp / Ollama / vLLM), 8B fits a laptop, 70B-quantised fits a desktop — data privacy, zero API cost, fully offline.
What it is#
The full pipeline:
1. Pick an open-weights model (Llama 3 / Qwen / DeepSeek)
2. Download a quantised version (.gguf / .safetensors)
3. Load it in an inference engine (Ollama / llama.cpp / vLLM)
4. Hit it via an OpenAI-compatible API → done
ollama run llama3 — one command and you have a local chat-LLM API.
Analogy#
打个比方 · Analogy
- API call = eating out at a restaurant — convenient but you pay each time and the menu is fixed.
- Local inference = cooking at home — more setup, but data never leaves the kitchen, you can pick any flavour, and long-term it's cheaper.
Key concepts#
GGUFGGUF format
llama.cpp's container format for quantised models. Universal across CPU and GPU.
Inference EngineInference engine
llama.cpp / vLLM / SGLang / TGI / MLX — the program that actually runs the model.
Tokens/secThroughput
The key local-inference metric. Consumer GPU at 7B INT4 ≈ 50 tok/s.
VRAM FootprintVRAM footprint
= quantised weights + KV cache. KV cache balloons with long contexts.
Stack comparison#
| Tool | Best for | Features |
|---|---|---|
| Ollama | Individuals / small teams | Single-command, OpenAI-compatible API |
| llama.cpp | Maximum control / embedded | C++, broadest hardware coverage, GGUF standard |
| LM Studio | Desktop GUI | Graphical, beginner-friendly |
| vLLM / SGLang | Production-grade serving | High throughput, PagedAttention, batching |
| MLX / Core ML | Apple Silicon | Native acceleration on M-series chips |
How it works#
Application code can stay identical to the cloud-API version — just change OPENAI_BASE_URL to point to localhost.
Practical notes#
- Beginners use Ollama.
ollama run llama3.1,ollama run qwen2.5— minutes from zero to chatting. - VRAM rule of thumb: model size + 1–2 GB KV cache + system overhead. 8 GB → 7B INT4; 12 GB → 13B; 24 GB → 70B INT4 (just barely).
- Production uses vLLM. Throughput is an order of magnitude higher than Ollama, with paged KV cache + continuous batching.
- MoE models drop one tier of VRAM pressure. Mixtral 8x7B's 14B-active beats a 47B dense.
- Always try quantised first. fp16 70B needs 4×A100; INT4 70B fits on 1–2 4090s or a Mac Studio.
Easy confusions#
Local inference
Data **never leaves the machine**.
One-time hardware cost + 0 per-call fee.
One-time hardware cost + 0 per-call fee.
Cloud API
**Compute outsourced** to the cloud.
Pay-as-you-go, very high ceiling.
Pay-as-you-go, very high ceiling.
Local inference
Forward pass only → VRAM pressure **mostly KV cache**.
Local training
Weights + gradients + optimizer states — **VRAM ×3–5**.
Further reading#
- Quantization — what makes local inference feasible
- LLM — the application-layer view
- Tools: Ollama, LM Studio, llama.cpp, vLLM, MLX