In one line: containers aren't VMs — they are regular processes with a special view of the world. The view comes from namespaces (PID / Net / Mount / UTS / IPC / User / Cgroup — 7 of them); the quotas come from cgroup v2. Docker / Podman / containerd are wrappers over those primitives.
What it is#
# Build a minimal "container" by hand with unshare
sudo unshare --pid --net --mount --uts --ipc --fork --mount-proc /bin/bash
# inside: ps shows only yourself; ip a shows no NIC; hostname xxx doesn't leak outcgroup v2 uses a single unified hierarchy:
/sys/fs/cgroup/ ← cgroup v2 root
└── system.slice/web.service/
memory.max = 512M
cpu.max = "20000 100000" # 20 % of one CPU
pids.max = 200
Write a PID into cgroup.procs to apply.
Analogy#
namespace = strap a VR headset onto the process: the "world" it sees is altered but the hardware is the same; cgroup = assign a meal-portion manager: per-day budget for CPU / memory / IO — exceed it and you starve or get smacked.
Seven namespaces#
- PID
- Process-ID space. PID 1 in the container is its init; host processes invisible.
- Net
- Own NIC / routes / iptables / ports. Docker creates a veth pair per container.
- Mount
- Own mount table. The container's `/` is an overlayfs stitched view.
- UTS
- Own hostname / domainname.
- IPC
- Own System V IPC / POSIX message queues.
- User
- UID/GID mapping. Container root (uid=0) mapped to non-root on host — key for rootless.
- Cgroup
- Restricts visibility of the cgroup tree; container can't see host cgroups.
Key cgroup v2 controllers#
How it works#
Each syscall, the kernel returns a restricted view based on the namespaces of that process.
Practical notes#
- Direct observation:
ls /proc/<pid>/ns/shows what namespaces a process is in;cat /proc/<pid>/cgroupshows the cgroup path;systemctl status <svc>lists resource usage (systemd uses cgroup v2). - systemd resource control:
systemctl set-property web.service MemoryMax=1G CPUQuota=50%— applies live. - Rootless containers: user-namespace maps an unprivileged host user to root inside the container; pair with fuse-overlayfs / slirp4netns for storage / networking.
- OOM-killer behavior: cgroup OOM only kills inside that cgroup — other host services are untouched. This is the foundation of Docker memory limits.
- Mixed workloads: prefer
cpu.weightovercpu.max— yields when others idle, splits proportionally when contended; safer for K8s pod cohabitation. - cgroup v1 vs v2: v1 had many tangled hierarchies; v2 unified to one tree. Modern kernels default to v2; older K8s + older Docker may still be v1.
Easy confusions#
PID / network / mount / hostname…
CPU / memory / IO / process count.