A proposal for (almost) fully asynchronous system calls and IPC

Long-time readers of this blog will know that I have no fondness for the monolithic kernel design used by legacy operating systems. In addition to stuffing too much functionality into one piece of software, which is a failure at basic software design, and doing so on the single most privileged piece of code on a machine, which is a failure at basic software security, this design practice also has averse effects on performance by requiring frequent context switches between the OS kernel on one hand, and user-mode processes on the other hand.

In this blog post, I will highlight how this context switching overhead can be minimized through the use of more asynchronous system call and interprocess communication primitives. I will then show how these conclusions apply not only to traditional monolithic kernel designs, but also to more modern OS designs.

The problem with synchronous syscalls

Essentially, traditional system calls suffer from the following issues :

They cause the client software to wait for an unspecified amount of time, potentially missing latency deadlines in doing so.
Each system call causes a CPU context switch between the kernel and user-mode software. Context switching is far from having a negligible cost, with latencies of the order of 10~100 µs to be compared with the nanosecond overhead of a regular function call, so we want to avoid it as much as possible.
The waiting client thread is essentially a wasted system resource, which could potentially be used to perform extra client-side activities instead, leveraging the parallelism of modern CPUs.

Asynchronicity to the rescue

The aforementioned problems with syscalls, and synchronous communication between processes in general, have been known for a long time. In modern days, a consensus emerged that the answer to this is asynchronous communication, where a client sends a message to a server, but does not wait for the server to reply in any way unless it really means to.

When applied to syscalls, asynchronous communication means that client processes will fill a command buffer which is adressed to the kernel, but not wait for the kernel to execute these commands or otherwise perform a context switch to kernel code, until they are truly done sending commands and doing their own processing. Alternatively, when waiting represents an unacceptable overhead, clients may also choose to never wait at all, and instead simply poll the kernel for command completion regularly.

This way of operating has several benefits :

Clients do not need to wait for the kernel if they do not want to.
In this framework, user processes are free to perform all the parallel processing they want while the kernel runs the commands they sent.
Context switching overhead can be made negligibly small by combining kernel-side command buffer readout with scheduler-driven context switches. That is, the kernel can simply check if a user-mode thread has sent in new commands whenever it turns off that thread, because it has started waiting for something, or because it has become time to schedule a new thread.

Regarding this last point, it is also important to point out that such deferred command fetches may or may not be desirable. It is essentially a way to increase command processing throughput and decrease command emission latency at the cost of an increased total command servicing latency. So ideally, clients should have a way to synchronously notify the kernel that asynchronous commands are pending when command processing latency is an important concern.

From asynchronous processing to events

Now, at some point, user-mode processes will need to request the kernel whether a particular command has completed. To handle this use case, we need another concept from the asynchronous world, namely events.

In event-based asynchronous processing, each asynchronous command is assigned an identifier, which may be unique on a system-wide or per-process basis. On your average desktop CPU with relatively few threads, all that is needed to generate such unique identifiers is an atomic counter, whereas larger-scale asynchronous processing will usually require randomly generated event IDs whose generation comes at a slightly higher processing cost.

It is important to realize that assigning system-wide identifiers to events needs not to require many system calls : a user-mode process may just request a bunch of unused event IDs to the kernel at once, then gradually attach these identifiers to the commands associated to its asynchronous system calls.

Let us now add an event input queue to our user-mode threads. Whenever a system call runs to completion, the kernel will append the associated event ID to the associated thread’s event queue, then wake up the thread if it is waiting for an event. Additional OS primitives, akin to UNIX’ select and epoll, may be used to make waiting for multiple events faster, at the cost of extra kernel event reporting complexity.

And now, here comes the killer feature of good event handling systems: the kernel may not even need to wake up user processes when a request completes. And I do not mean that in sense that clients need not care about when their requests complete, or can poll instead of wait, but rather in the sense that when events are added to an asynchronous syscall mechanism, it makes it possible for a client to chain multiple system calls to one another, in a fashion that will require no client intervention to complete.

The power of event chaining

Personally, I have discovered such event chaining techniques while studying OpenCL, and I have fallen in love with it ever since. It is a truly beautiful way to express complex asynchronous computation in a manner that can be handled sequentially or in parallel, and most importantly in such a way that the client which started the computation never needs to wake up or care about what is happening until the very end.

To see how this can work, let’s imagine that we are writing the equivalent of the UNIX “cat” program. We read data from a number of files and send it to standard output. For 2 small files, where the overhead of a real data stream is not needed, we could express this operation in the following way :

Let there be command FILE_READOUT_1, which loads the full contents of the first file into a memory buffer that we’ll call CAT_BUFFER, starting at the beginning, and stores the number of bytes read into an integer variable BYTES_READ.
Let there be command FILE_READOUT_2, which depends on FILE_READOUT_1. Once FILE_READOUT_1 is over, FILE_READOUT_2 will load the full contents of the second file into CAT_BUFFER, starting at offset BYTES_READ, and store the final offset within CAT_BUFFER back into BYTES_READ.
Finally, let there be command FILE_OUTPUT, which depends on FILE_READOUT_2. Once FILE_READOUT_2 is over, FILE_OUTPUT will transfer the full contents of CAT_BUFFER into the standard output of our cat program.

In this framework, our cat wannabe basically starts all three tasks FILE_READOUT_1, FILE_READOUT_2 and FILE_OUTPUT above, then enters a state in which it waits for all of its asynchronous system calls to complete before exiting. Since there is an underlying guarantee that the program will never run any code again, the OS kernel can tremendously optimize this process state by only keeping the minimal resources needed to process event, and throwing away all the rest. So in the end, only minimal system resources are needed to manage such processes which start lots of asynchronous work, then wait for its completion.

Meanwhile, the kernel knows all it needs to do in order to asynchronously perform all the work that was requested from it. So it can use some kind of dependency solving algorithm to efficiently schedule all these syscall tasks one after the other, without ever being interrupted by a context switch if no other process is running (or the scheduler is setup for FIFO scheduling). This means that a well-chosen set of asynchronous command, coupled with command chaining ability, can effectively make syscall-heavy programs much more efficient.

And by the way, for those who are curious, if you want to turn the asynchronous syscall program above into a parallel one (which may or may not help performance), all you’d need to do is to change the dependency chain and command list somewhat, and allocate more buffers :

Command FILE_READOUT_1 loads the full contents of the first file into a memory buffer CAT_BUFFER_1, starting at the beginning, and stores the number of bytes read into an integer variable BYTES_READ_1.
Command FILE_READOUT_2 loads the full contents of the second file into another memory buffer CAT_BUFFER_2, starting at the beginning, and stores the number of bytes read into an integer variable BYTES_READ_2.
Command OUTPUT_1 depends on FILE_READOUT_1, and sends the full contents of CAT_BUFFER_1 to the standard output.
Command OUTPUT_2 depends on FILE_READOUT_2 and OUTPUT_1, and sends the full contents of CAT_BUFFER_2 to the standard output.

So yeah, there IS a reason why this event chaining model was introduced into parallel programming frameworks.

Asynchronous processing, the microkernel way

Now, so far, we have discussed the introduction of asynchronous system calls in a monolithic kernel design, in which the OS kernel is in charge of pretty much everything. However, monolithic kernel designs are generally a terrible thing for the reasons mentioned on the top of this article, so we may want to consider how this asynchronous command processing and event handling design maps to a multi-process world.

And I am happy to say that there is absolutely nothing that prevents a microkernel OS from also using this design for communications between threads. In this situation, the whole thing could work as follows :

Client process opens an asynchronous communication channel (command buffer + event mailbox) with the server process. This peeks a hole into process isolation, and thus requires kernel assistance and server consent.
Client submits a number of asynchronous requests, with associated unique IDs, to the server’s command buffer.
At a later point in time, kernel may wake up the server once it notices that it is waiting on the command buffer opened above. Or the server may just be polling its command buffer on its own, if it prefers to do so.
Server process processes the commands, and notifies the client of the associated command completion events.
Similarly to before, the kernel may either wake up a sleeping client, or the client may be polling its event queue on its own.

You may notice that in the end, the kernel is only used as a postman to notify processes that they have received commands and events, and only if said processes have requested it to do so by synchronously waiting on their command buffer or event queue. Otherwise, processes may achieve a purely asynchronous communication pattern, possibly including the event chaining mechanism above, using nothing more than kernel-provided shared ring buffers.

In the end, using kernel primitives is mostly a performance optimization, allowing for lower event processing and command handling latency, and for early system resource liberation when a process enters a state where it simply waits for its dependent asynchronous tasks to complete. Like any performance optimization, it may or may not be worthwhile depending on the workload being considered, and is most certainly optional. So kernel calls are unlikely to become a performance bottleneck in such a communication scenario.

The OS|periment

Musings on personal computer operating systems