核心 · Key Idea
In one line: Ollama wraps llama.cpp into a Docker-style CLI — ollama run llama3.2 and you're up. Auto-quant downloads, API server, model registry — personal / local dev without GPU configuration headaches.
Cheatsheet#
# Install: one liner on mac/linux; installer on Windows
curl -fsSL https://ollama.com/install.sh | sh
# Run
ollama run qwen2.5:7b
ollama run llama3.2
ollama run deepseek-r1:8b
# List / remove
ollama list
ollama rm qwen2.5:7b
# Expose API (default http://localhost:11434)
ollama serveAPI is OpenAI v1 compatible (OLLAMA_HOST=0.0.0.0:11434 + /v1/chat/completions); mainstream frontends/tools connect directly.
Analogy#
打个比方 · Analogy
Running LLMs locally used to mean installing a GPU + compiling CUDA + finding GGUF quants: doable, high friction.
Ollama is the App Store: search the model → click install → use instantly.
Key concepts#
ModelfileModelfile
Dockerfile-style: FROM <base> + SYSTEM <prompt> + PARAMETER <temperature>...
TagTag
qwen2.5:7b, qwen2.5:14b-instruct-q4_0; after the colon: size / quantisation / variant.
GGUFQuant format
llama.cpp's efficient inference format — supports q4 / q5 / q8 / fp16.
GPU offloadGPU offload
OLLAMA_NUM_GPU sets how many layers go on GPU; rest in RAM.
Context LengthContext length
`PARAMETER num_ctx 32768`; beyond the model's native context, RoPE extrapolation is needed.
EmbeddingsEmbeddings
`ollama embed` for local embedding generation.
How it works#
Practical notes#
- Pick a model. Mac M-series 16 GB runs 7–8B Q4 smoothly; 32 GB handles 14B; 70B needs 64 GB+ or aggressive quantisation.
- Custom system prompt. Inside a
Modelfile:SYSTEM "..."+PARAMETER temperature 0.4, thenollama create my-llama -f Modelfile. - API is localhost-only by default. For LAN access:
OLLAMA_HOST=0.0.0.0:11434 ollama serve+ firewall rules. - GPU not engaged?
ollama psshows the GPU-offload line; check with nvidia-smi / Activity Monitor. - Hooking up Cherry Studio / LobeChat / Open WebUI: just point at the OpenAI-compatible endpoint.
- Embeddings.
ollama embed -m nomic-embed-text "sentence"— local RAG without an embedding API. - Production alternative. High concurrency → vLLM / TGI; Ollama is best for single-user / personal / edge.
Easy confusions#
Ollama
CLI + background service.
Frontends / tools can hit it directly.
Frontends / tools can hit it directly.
LM Studio
Desktop GUI.
Best for purely interactive use.
Best for purely interactive use.