ArcLibrary

Local Inference

Run a large model on your laptop / server, no cloud needed.

InferenceLocalGGUF
核心 · Key Idea

In one line: Local inference = download model weights and run them on your own machine, no cloud API. With quantisation + efficient inference engines (llama.cpp / Ollama / vLLM), 8B fits a laptop, 70B-quantised fits a desktop — data privacy, zero API cost, fully offline.

What it is#

The full pipeline:

1. Pick an open-weights model (Llama 3 / Qwen / DeepSeek)
2. Download a quantised version (.gguf / .safetensors)
3. Load it in an inference engine (Ollama / llama.cpp / vLLM)
4. Hit it via an OpenAI-compatible API → done

ollama run llama3 — one command and you have a local chat-LLM API.

Analogy#

打个比方 · Analogy
  • API call = eating out at a restaurant — convenient but you pay each time and the menu is fixed.
  • Local inference = cooking at home — more setup, but data never leaves the kitchen, you can pick any flavour, and long-term it's cheaper.

Key concepts#

GGUFGGUF format
llama.cpp's container format for quantised models. Universal across CPU and GPU.
Inference EngineInference engine
llama.cpp / vLLM / SGLang / TGI / MLX — the program that actually runs the model.
Tokens/secThroughput
The key local-inference metric. Consumer GPU at 7B INT4 ≈ 50 tok/s.
VRAM FootprintVRAM footprint
= quantised weights + KV cache. KV cache balloons with long contexts.

Stack comparison#

ToolBest forFeatures
OllamaIndividuals / small teamsSingle-command, OpenAI-compatible API
llama.cppMaximum control / embeddedC++, broadest hardware coverage, GGUF standard
LM StudioDesktop GUIGraphical, beginner-friendly
vLLM / SGLangProduction-grade servingHigh throughput, PagedAttention, batching
MLX / Core MLApple SiliconNative acceleration on M-series chips

How it works#

Application code can stay identical to the cloud-API version — just change OPENAI_BASE_URL to point to localhost.

Practical notes#

  • Beginners use Ollama. ollama run llama3.1, ollama run qwen2.5minutes from zero to chatting.
  • VRAM rule of thumb: model size + 1–2 GB KV cache + system overhead. 8 GB → 7B INT4; 12 GB → 13B; 24 GB → 70B INT4 (just barely).
  • Production uses vLLM. Throughput is an order of magnitude higher than Ollama, with paged KV cache + continuous batching.
  • MoE models drop one tier of VRAM pressure. Mixtral 8x7B's 14B-active beats a 47B dense.
  • Always try quantised first. fp16 70B needs 4×A100; INT4 70B fits on 1–2 4090s or a Mac Studio.

Easy confusions#

Local inference
Data **never leaves the machine**.
One-time hardware cost + 0 per-call fee.
Cloud API
**Compute outsourced** to the cloud.
Pay-as-you-go, very high ceiling.
Local inference
Forward pass only → VRAM pressure **mostly KV cache**.
Local training
Weights + gradients + optimizer states — **VRAM ×3–5**.

Further reading#

  • Quantization — what makes local inference feasible
  • LLM — the application-layer view
  • Tools: Ollama, LM Studio, llama.cpp, vLLM, MLX