Way back in January, Sam Newman tweeted this (perhaps rhetorical) question:

It got a handful of retweets recently, and I responded with:

Which definitely needs some expansion as Ola Bini pointed out. So here goes. (Caution: long ramble ahead. Second caution: I’m going to gloss over a lot of details in an effort to convey a bigger picture.)

The textbook definition of an operating system is that it provides process isolation, memory management, and hardware abstraction. Some useful operating systems have been built that remove various parts of this definition, but for now I’ll use that.

Let’s look at what various degrees of process isolation could mean and why we’ve got that stack of stuff in Sam’s slide.

The most basic degree of isolation would be that one process cannot read or modify another process’s memory. “Modern” operating systems like Linux, Windows, and macOS do pretty well on that. (They’re modern in the sense of “widely used today” but all are based on 30+ year old foundations.)

Memory isolation offers some degree of security. Security will be a recurring theme.

Other conflicts between processes might arise from error rather than malice. Overusing the CPU, for example. Or consuming all available memory. This would allow one process to unfairly deny service to other processes, so an operating system must also enforce usage limits.

When today’s operating systems were invented (and here I’m describing the Linux kernel as an instance of the Unix family), the idea of multitenant workload was strictly a mainframe concern. Mini- and micro-computers barely existed. The largest networks consisted of a few dozen intermittently connected machines. Most of the users knew each other by first name. Active, anonymous threats were unknown.

Benign noninterference between processes sufficed.

Process isolation now needs to mean much more than just memory protection and quota enforcement. In fact, the definition of “process” breaks down a bit, too.

In an operating system, a “process” consists of allocated memory (some of which may be paged out to storage), a memory mapping, and control information: threads’ stacks, open files, network sockets, entitlements or permissions, interrupt vectors, and so on. The operating system prevents one process from interfering with another, but it doesn’t prevent it from detecting the presence of others.

That is exactly what’s needed for multitenant cloud workload. A process from user A should have no way to detect the presence or absence of a process from user B. They might come from competing organizations. For government workload, they might operate under different security classification schemes.

As we look at the stack of virtualization and containerization in Sam’s slide, we can see how each layer attempts to plug some detection holes in lower layers.

The hypervisor is an operating system. It runs other operating systems because the guest operating systems are bad at preventing detection.

For example, each process should have it’s own IP address so it cannot detect other processes by their use of TCP ports it would like to occupy.

Each process should appear to have full control over the filesystem. Otherwise, processes could detect each other via changes to files. (Implemented by the VM, and again by the container.) That means the application’s own configuration files should be isolation. But it also means the operating system configurations should be isolated.

Each process should have it’s own namespace for users. Otherwise they could detect each other via the user listing. (Implemented by the VM and again by containers.)

An aside about containers: a “container” process with it’s own view of a filesystem plus an isolated “namespace” for kernel objects. That means a process running in a container is really executing on the same underlying kernel as the host operating system. It’s just not allowed to see other processes. Add a virtual NIC and IP address to the container and it has the kind of isolation I’m talking about.

When we look at this stack of layers in terms of detection-prevention, the crucial need for strong patch hygiene becomes clear. Any hole in an underlying layer allows detections that should not be allowed. Since no layer really provides perfect isolation, we must treat a patch at any layer with the same priority as a ring 0 bug in the lowest level.

(I also wonder if mainframes still have something to offer here. I just don’t know enough about their operating systems to say one way or the other. But think about this: IBM had virtual machines in the 1960’s.)

What could we do to create an operating system that meets our needs today?

Elevate non-detectability to the primary design goal. There should be no call or action an isolated workload can perform that would reveal the presence or absence of other workload on the same system. That includes other instances of the same workload!

A program can’t know what physical host it runs on. In a really extreme interpretation, programs can’t even be allowed to sample the clock too quickly, or else they could use timing attacks to detect other workloads!

Such non-detectability is not possible with Unix-style kernels. Likewise for Windows kernels. A microkernel like Mach might be able to achieve it, but Darwin as built would not. All of these embed the multi-user, multi-process, shared-filesystem model too deeply. Thus, the stack of virtualization and containerization.

There are some capability-based operating systems that offer promise. seL4 comes to mind.

I find unikernels interesting as a way of packaging applications. An operating system that aims toward true non-dectability might well use such a “super-fat binary” as a unikernel. It would carry the program text along with the expected filesystem. (A program binary today is mostly an image of the bytes that will go into memory for execution—called the text. There is some additional information about variable initialization and relinking symbols based on their actual load address.)

Functions as a service certainly step toward greater isolation. Each function execution might as well happen in a new operating system, as far as the function itself can tell.

It’s likely that this kind of operating system would have a very different notion of the “unit of workload” than a process. A process with threads is a compromise notion anyway. It allows the threads to share each other’s memory but assigns permissions, resources, and quotas for the collection of threads.

In a container, we get these levels of grouping:

  1. The container has a process space (meaning PIDs), IP address, sockets, file descriptors, file system, and user base. It has an overall quota on CPU, memory, and network usage.
  2. A process in the container has permissions of one user, resources, fine-grained quotas. It cannot see the memory of other processes in the container.
  3. Threads in a process share memory, but do not have their own permissions or quotas.

If we extend that to cover the VM, hypervisor, and host operating system, we get 6 levels of grouping but each level has a totally different model.

I don’t know what the design would look like if we aimed for a homogenous structure that allowed grouping or isolation at each level. It would probably look more like an Erlang supervision tree or seL4 style capability delegation. It would look very different from the Unix-derived systems we have now.