Local Inference

核心 · Key Idea

In one line: Local inference = download model weights and run them on your own machine, no cloud API. With quantisation + efficient inference engines (llama.cpp / Ollama / vLLM), 8B fits a laptop, 70B-quantised fits a desktop — data privacy, zero API cost, fully offline.

What it is#

The full pipeline:

1. Pick an open-weights model (Llama 3 / Qwen / DeepSeek)
2. Download a quantised version (.gguf / .safetensors)
3. Load it in an inference engine (Ollama / llama.cpp / vLLM)
4. Hit it via an OpenAI-compatible API → done

ollama run llama3 — one command and you have a local chat-LLM API.

Analogy#

打个比方 · Analogy

API call = eating out at a restaurant — convenient but you pay each time and the menu is fixed.
Local inference = cooking at home — more setup, but data never leaves the kitchen, you can pick any flavour, and long-term it's cheaper.

Key concepts#

GGUFGGUF format

llama.cpp's container format for quantised models. Universal across CPU and GPU.

Inference EngineInference engine

llama.cpp / vLLM / SGLang / TGI / MLX — the program that actually runs the model.

Tokens/secThroughput

The key local-inference metric. Consumer GPU at 7B INT4 ≈ 50 tok/s.

VRAM FootprintVRAM footprint

= quantised weights + KV cache. KV cache balloons with long contexts.

Stack comparison#

Tool	Best for	Features
Ollama	Individuals / small teams	Single-command, OpenAI-compatible API
llama.cpp	Maximum control / embedded	C++, broadest hardware coverage, GGUF standard
LM Studio	Desktop GUI	Graphical, beginner-friendly
vLLM / SGLang	Production-grade serving	High throughput, PagedAttention, batching
MLX / Core ML	Apple Silicon	Native acceleration on M-series chips

How it works#

Application code can stay identical to the cloud-API version — just change OPENAI_BASE_URL to point to localhost.

Practical notes#

Beginners use Ollama. ollama run llama3.1, ollama run qwen2.5 — minutes from zero to chatting.
VRAM rule of thumb: model size + 1–2 GB KV cache + system overhead. 8 GB → 7B INT4; 12 GB → 13B; 24 GB → 70B INT4 (just barely).
Production uses vLLM. Throughput is an order of magnitude higher than Ollama, with paged KV cache + continuous batching.
MoE models drop one tier of VRAM pressure. Mixtral 8x7B's 14B-active beats a 47B dense.
Always try quantised first. fp16 70B needs 4×A100; INT4 70B fits on 1–2 4090s or a Mac Studio.

Easy confusions#

Local inference

Data **never leaves the machine**.
One-time hardware cost + 0 per-call fee.

Cloud API

**Compute outsourced** to the cloud.
Pay-as-you-go, very high ceiling.

Local inference

Forward pass only → VRAM pressure **mostly KV cache**.

Local training

Weights + gradients + optimizer states — **VRAM ×3–5**.