In one line: Quantisation = drop weight / activation precision from fp16 / fp32 down to int8 / int4 or lower. A 70B model takes 140 GB at fp16 but only ~35 GB at int4 — VRAM down 75%, inference 2–4× faster, quality nearly intact. The key technique to fit big models onto local / edge devices.
What it is#
Weights are originally floating-point:
fp16: 0.123456 (2 bytes)
int8: 16 (1 byte, ÷2)
int4: 2 (0.5 byte, ÷4)
Each weight uses fewer bits, coarser precision but smaller footprint. At inference we reconstruct an approximate value — with the right error budget the model's output drops only marginally.
Analogy#
Full precision = ingredients measured to 7 decimal places — the chef can distinguish 0.1234 g from 0.1235 g.
Quantisation = switch to grams as the smallest unit — most dishes taste the same.
A well-quantised meal is indistinguishable in a blind test.
Key concepts#
VRAM comparison#
| Model | FP16 | INT8 | INT4 |
|---|---|---|---|
| 7B | 14 GB | 7 GB | 3.5 GB |
| 13B | 26 GB | 13 GB | 6.5 GB |
| 70B | 140 GB | 70 GB | 35 GB |
| 175B | 350 GB | 175 GB | 88 GB |
INT4 is typically the boundary for consumer hardware — a 24 GB 4090 can squeeze in a 70B at INT4.
How it works#
Smart algorithms (AWQ / GPTQ) keep the most important weights at high precision and drop the rest to INT4, minimising quality loss.
Practical notes#
- Default to GGUF + Q4_K_M. Largest llama.cpp ecosystem, good quality, runs on CPU/GPU. The default for most local deployments.
- Be careful for sensitive tasks. Programming, math, long-chain reasoning are quantisation-sensitive. Start with INT8, then drop to INT4 only if it holds up.
- Run benchmarks, not the headline number. "1% quality loss" is an average; your task may drop 5%. Test on your own data.
- Quant + LoRA = QLoRA. Model in INT4, LoRA adapters in FP16. Fine-tune 70B on a single GPU.
- Production via vLLM + AWQ / GPTQ. Throughput and VRAM utilisation beat GGUF — the high-traffic default.
Easy confusions#
Structure preserved.
Structure changes; behaviour learned from teacher.
Simple, fast, mainstream.
Slightly better quality, requires retraining; rarely used.
Further reading#
- Local Inference — quantisation's ultimate destination
- LoRA — QLoRA combines the two
- Tools: llama.cpp / Ollama / LM Studio / vLLM