ArcLibrary

Ollama

One command to run an LLM on your machine — the 'Docker for LLM' for personal / developer use.

OllamaLocalllama.cpp
核心 · Key Idea

In one line: Ollama wraps llama.cpp into a Docker-style CLIollama run llama3.2 and you're up. Auto-quant downloads, API server, model registry — personal / local dev without GPU configuration headaches.

Cheatsheet#

# Install: one liner on mac/linux; installer on Windows
curl -fsSL https://ollama.com/install.sh | sh
 
# Run
ollama run qwen2.5:7b
ollama run llama3.2
ollama run deepseek-r1:8b
 
# List / remove
ollama list
ollama rm qwen2.5:7b
 
# Expose API (default http://localhost:11434)
ollama serve

API is OpenAI v1 compatible (OLLAMA_HOST=0.0.0.0:11434 + /v1/chat/completions); mainstream frontends/tools connect directly.

Analogy#

打个比方 · Analogy

Running LLMs locally used to mean installing a GPU + compiling CUDA + finding GGUF quants: doable, high friction.
Ollama is the App Store: search the model → click install → use instantly.

Key concepts#

ModelfileModelfile
Dockerfile-style: FROM <base> + SYSTEM <prompt> + PARAMETER <temperature>...
TagTag
qwen2.5:7b, qwen2.5:14b-instruct-q4_0; after the colon: size / quantisation / variant.
GGUFQuant format
llama.cpp's efficient inference format — supports q4 / q5 / q8 / fp16.
GPU offloadGPU offload
OLLAMA_NUM_GPU sets how many layers go on GPU; rest in RAM.
Context LengthContext length
`PARAMETER num_ctx 32768`; beyond the model's native context, RoPE extrapolation is needed.
EmbeddingsEmbeddings
`ollama embed` for local embedding generation.

How it works#

Practical notes#

  • Pick a model. Mac M-series 16 GB runs 7–8B Q4 smoothly; 32 GB handles 14B; 70B needs 64 GB+ or aggressive quantisation.
  • Custom system prompt. Inside a Modelfile: SYSTEM "..." + PARAMETER temperature 0.4, then ollama create my-llama -f Modelfile.
  • API is localhost-only by default. For LAN access: OLLAMA_HOST=0.0.0.0:11434 ollama serve + firewall rules.
  • GPU not engaged? ollama ps shows the GPU-offload line; check with nvidia-smi / Activity Monitor.
  • Hooking up Cherry Studio / LobeChat / Open WebUI: just point at the OpenAI-compatible endpoint.
  • Embeddings. ollama embed -m nomic-embed-text "sentence" — local RAG without an embedding API.
  • Production alternative. High concurrency → vLLM / TGI; Ollama is best for single-user / personal / edge.

Easy confusions#

Ollama
CLI + background service.
Frontends / tools can hit it directly.
LM Studio
Desktop GUI.
Best for purely interactive use.

Further reading#