Namespaces, cgroups, and Containers | Hemanth's Internet Home

Today's topic is a little "lame" in comparison to what I've been writing the past few days, but it's still very useful (and fundamental). I am beginning to question just what my coursework has been amounting to, because as I'm reading some of these subjects, I find a lot of interesting stuff. Maybe less solitaire and more attention should do.

This begins with me asking a friend of mine to teach me about containerisation, where I was really obssessed with the idea of kernel headers being used somewhere. I was promptly corrected, saying that there is nothing about kernel headers in containers, and it's usually just namespaces.

So what is a namespace? It's a feature in Linux that lets you isolate processes, and their view of the system, from each other. There are several kinds of namespaces, and each of them isolate different things. Infact, there are 8, which are:-

PID Namespaces - This isolates the process IDs. So in a namespace, a process can have the PID 1, even if it's not the init process. This is what gives containers the illusion that they have their own PID 1.
Network Namespaces - This isolates the network interfaces. So in a namespace, a process can have its own network interfaces, even if it's not the host. This is what gives containers the illusion that they have their own network.
Mount Namespaces - This isolates the mount points. So in a namespace, a process can have its own mount points, even if it's not the host. This is what gives containers the illusion that they have their own filesystem.
IPC (Inter Process Communication) Namespaces - Isolates shared memory, so that processes in different containers cannot spy on each other's memory.
UTS (Unix Timesharing) Namespaces - Isolates hostnames (i.e. "user") and domain names (like ubuntu.com), so that the container can have its own name (like container1).
USER Namespaces - Isolates user and group IDs, allowing a user to be root (UID 0) inside the container but a normal user on the host.
CGROUP Namespaces - Isolates the view of resource limits (CPU, memory), therefore stopping the container from seeing host hardware limits.
TIME Namespaces - Isolates system clocks, so that the container can change its own time zone or uptime without altering the host clock.

All of these namespaces are effectively used to isolate different aspects of a process's view of the system, hence making the processes "contained" inside the host Operating System. How exactly are these employed then?

That is done using cgroups (control groups), which are used to control the resources available to a process. It essentially is a set of hierarchical data structures for tracking and limiting the usage of resources like CPU, memory, I/O bandwidth, etc.

So we begin with creating a new cgroup, and then we clone the process into the cgroup. the clone command is very similar to what fork does, except with options to change the namespaces. Oversimplifying things a lot, we fork the initial process (i.e. the process that runs the container), change the required namespaces, and then replace the main process with whatever program we'd like to run in the container. runc (which is the OCI compliant runtime) is usually used to set all of this up. The container will also need systems like /proc and /sys to be mounted in order to function properly, so a namespace aware version of these are also loaded up during the setup.

The cgroup here carries the whole illusion of isolation; when a call is made to the kernel, the kernel checks the cgroup the process belongs to, and only then decides to allocate resources. What I found really interesting is when you set a limit for CPU usage, asking for just 1 or 2 CPU cores. The process can still see all the cores, but instead the kernel employs a time slicing method, where it gives the process a fraction of the CPU time. Depending on how it uses the cores, it is frozen for the rest of the execution time, so while the process thinks it's been using multiple cores completely, it's actually just been frozen beyond the limit. Linux kernels above v5.14 support bursting, where this fraction of CPU time can be stored if not used in a cycle.

I have glossed over a lot of things here, but I hope this gives a pretty good overview. Here's some further reading if you'd like:-

Also something I want to go through: Containers from scratch in Go